Guoxian Yu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Guoxian Yu is active.

Explore More

Publication

Featured researches published by Guoxian Yu.

BMC Bioinformatics | 2015

Predicting protein functions using incomplete hierarchical labels

Guoxian Yu; Hailong Zhu; Carlotta Domeniconi

BackgroundProtein function prediction is to assign biological or biochemical functions to proteins, and it is a challenging computational problem characterized by several factors: (1) the number of function labels (annotations) is large; (2) a protein may be associated with multiple labels; (3) the function labels are structured in a hierarchy; and (4) the labels are incomplete. Current predictive models often assume that the labels of the labeled proteins are complete, i.e. no label is missing. But in real scenarios, we may be aware of only some hierarchical labels of a protein, and we may not know whether additional ones are actually present. The scenario of incomplete hierarchical labels, a challenging and practical problem, is seldom studied in protein function prediction.ResultsIn this paper, we propose an algorithm to Predict protein functions using Incomplete hierarchical LabeLs (PILL in short). PILL takes into account the hierarchical and the flat taxonomy similarity between function labels, and defines a Combined Similarity (ComSim) to measure the correlation between labels. PILL estimates the missing labels for a protein based on ComSim and the known labels of the protein, and uses a regularization to exploit the interactions between proteins for function prediction. PILL is shown to outperform other related techniques in replenishing the missing labels and in predicting the functions of completely unlabeled proteins on publicly available PPI datasets annotated with MIPS Functional Catalogue and Gene Ontology labels.ConclusionThe empirical study shows that it is important to consider the incomplete annotation for protein function prediction. The proposed method (PILL) can serve as a valuable tool for protein function prediction using incomplete labels. The Matlab code of PILL is available upon request.

Pattern Recognition | 2012

Semi-supervised classification based on random subspace dimensionality reduction

Guoxian Yu; Guoji Zhang; Carlotta Domeniconi; Zhiwen Yu; Jane You

Graph structure is vital to graph based semi-supervised learning. However, the problem of constructing a graph that reflects the underlying data distribution has been seldom investigated in semi-supervised learning, especially for high dimensional data. In this paper, we focus on graph construction for semi-supervised learning and propose a novel method called Semi-Supervised Classification based on Random Subspace Dimensionality Reduction, SSC-RSDR in short. Different from traditional methods that perform graph-based dimensionality reduction and classification in the original space, SSC-RSDR performs these tasks in subspaces. More specifically, SSC-RSDR generates several random subspaces of the original space and applies graph-based semi-supervised dimensionality reduction in these random subspaces. It then constructs graphs in these processed random subspaces and trains semi-supervised classifiers on the graphs. Finally, it combines the resulting base classifiers into an ensemble classifier. Experimental results on face recognition tasks demonstrate that SSC-RSDR not only has superior recognition performance with respect to competitive methods, but also is robust against a wide range of values of input parameters.

IEEE/ACM Transactions on Computational Biology and Bioinformatics | 2013

Protein Function Prediction using Multi-label Ensemble Classification

Guoxian Yu; Huzefa Rangwala; Carlotta Domeniconi; Guoji Zhang; Zhiwen Yu

High-throughput experimental techniques produce several kinds of heterogeneous proteomic and genomic data sets. To computationally annotate proteins, it is necessary and promising to integrate these heterogeneous data sources. Some methods transform these data sources into different kernels or feature representations. Next, these kernels are linearly (or nonlinearly) combined into a composite kernel. The composite kernel is utilized to develop a predictive model to infer the function of proteins. A protein can have multiple roles and functions (or labels). Therefore, multilabel learning methods are also adapted for protein function prediction. We develop a transductive multilabel classifier (TMC) to predict multiple functions of proteins using several unlabeled proteins. We also propose a method called transductive multilabel ensemble classifier (TMEC) for integrating the different data sources using an ensemble approach. The TMEC trains a graph-based multilabel classifier on each single data source, and then combines the predictions of the individual classifiers. We use a directed birelational graph to capture the relationships between pairs of proteins, between pairs of functions, and between proteins and functions. We evaluate the effectiveness of the TMC and TMEC to predict the functions of proteins on three benchmarks. We show that our approaches perform better than recently proposed protein function prediction methods on composite and multiple kernels. The code, data sets used in this paper and supplemental material are available at https://sites.google.com/site/guoxian85/tmec.

knowledge discovery and data mining | 2012

Transductive multi-label ensemble classification for protein function prediction

Guoxian Yu; Carlotta Domeniconi; Huzefa Rangwala; Guoji Zhang; Zhiwen Yu

Advances in biotechnology have made available multitudes of heterogeneous proteomic and genomic data. Integrating these heterogeneous data sources, to automatically infer the function of proteins, is a fundamental challenge in computational biology. Several approaches represent each data source with a kernel (similarity) function. The resulting kernels are then integrated to determine a composite kernel, which is used for developing a function prediction model. Proteins are also found to have multiple roles and functions. As such, several approaches cast the protein function prediction problem within a multi-label learning framework. In our work we develop an approach that takes advantage of several unlabeled proteins, along with multiple data sources and multiple functions of proteins. We develop a graph-based transductive multi-label classifier (TMC) that is evaluated on a composite kernel, and also propose a method for data integration using the ensemble framework, called transductive multi-label ensemble classifier (TMEC). The TMEC approach trains a graph-based multi-label classifier for each individual kernel, and then combines the predictions of the individual models. Our contribution is the use of a bi-relational directed graph that captures relationships between pairs of proteins, between pairs of functions, and between proteins and functions. We evaluate the ability of TMC and TMEC to predict the functions of proteins by using two yeast datasets. We show that our approach performs better than recently proposed protein function prediction methods on composite and multiple kernels.

Applied Soft Computing | 2012

Semi-supervised ensemble classification in subspaces

Guoxian Yu; Guoji Zhang; Zhiwen Yu; Carlotta Domeniconi; Jane You; Guoqiang Han

Graph-based semi-supervised classification depends on a well-structured graph. However, it is difficult to construct a graph that faithfully reflects the underlying structure of data distribution, especially for data with a high dimensional representation. In this paper, we focus on graph construction and propose a novel method called semi-supervised ensemble classification in subspaces, SSEC in short. Unlike traditional methods that execute graph-based semi-supervised classification in the original space, SSEC performs semi-supervised linear classification in subspaces. More specifically, SSEC first divides the original feature space into several disjoint feature subspaces. Then, it constructs a neighborhood graph in each subspace, and trains a semi-supervised linear classifier on this graph, which will serve as the base classifier in an ensemble. Finally, SSEC combines the obtained base classifiers into an ensemble classifier using the majority-voting rule. Experimental results on facial images classification show that SSEC not only has higher classification accuracy than the competitive methods, but also can be effective in a wide range of values of input parameters.

Knowledge and Information Systems | 2015

Semi-supervised classification based on subspace sparse representation

Guoxian Yu; Guoji Zhang; Zili Zhang; Zhiwen Yu; Lin Deng

Graph plays an important role in graph-based semi-supervised classification. However, due to noisy and redundant features in high-dimensional data, it is not a trivial job to construct a well-structured graph on high-dimensional samples. In this paper, we take advantage of sparse representation in random subspaces for graph construction and propose a method called Semi-Supervised Classification based on Subspace Sparse Representation, SSC-SSR in short. SSC-SSR first generates several random subspaces from the original space and then seeks sparse representation coefficients in these subspaces. Next, it trains semi-supervised linear classifiers on graphs that are constructed by these coefficients. Finally, it combines these classifiers into an ensemble classifier by minimizing a linear regression problem. Unlike traditional graph-based semi-supervised classification methods, the graphs of SSC-SSR are data-driven instead of man-made in advance. Empirical study on face images classification tasks demonstrates that SSC-SSR not only has superior recognition performance with respect to competitive methods, but also has wide ranges of effective input parameters.

Information Sciences | 2014

Probabilistic cluster structure ensemble

Zhiwen Yu; Le Li; Hau-San Wong; Jane You; Guoqiang Han; Yunjun Gao; Guoxian Yu

Cluster structure ensemble focuses on integrating multiple cluster structures extracted from different datasets into a unified cluster structure, instead of aligning the individual labels from the clustering solutions derived from multiple homogenous datasets in the cluster ensemble framework. In this article, we design a novel probabilistic cluster structure ensemble framework, referred to as Gaussian mixture model based cluster structure ensemble framework (GMMSE), to identify the most representative cluster structure from the dataset. Specifically, GMMSE first applies the bagging approach to produce a set of variant datasets. Then, a set of Gaussian mixture models are used to capture the underlying cluster structures of the datasets. GMMSE applies K-means to initialize the values of the parameters of the Gaussian mixture model, and adopts the Expectation Maximization approach (EM) to estimate the parameter values of the model. Next, the components of the Gaussian mixture models are viewed as new data samples which are used to construct the representative matrix capturing the relationships among components. The similarity between two components corresponding to their respective Gaussian distributions is measured by the Bhattycharya distance function. Afterwards, GMMSE constructs a graph based on the new data samples and the representative matrix, and searches for the most representative cluster structure. Finally, we also design four criteria to assign the data samples to their corresponding clusters based on the unified cluster structure. The experimental results show that (i) GMMSE works well on synthetic datasets and real datasets in the UCI machine learning repository. (ii) GMMSE outperforms most of the previous cluster ensemble approaches.

IEEE/ACM Transactions on Computational Biology and Bioinformatics | 2014

Protein function prediction with incomplete annotations

Guoxian Yu; Huzefa Rangwala; Carlotta Domeniconi; Guoji Zhang; Zhiwen Yu

Automated protein function prediction is one of the grand challenges in computational biology. Multi-label learning is widely used to predict functions of proteins. Most of multi-label learning methods make prediction for unlabeled proteins under the assumption that the labeled proteins are completely annotated, i.e., without any missing functions. However, in practice, we may have a subset of the ground-truth functions for a protein, and whether the protein has other functions is unknown. To predict protein functions with incomplete annotations, we propose a Protein Function Prediction method with Weak-label Learning (ProWL) and its variant ProWL-IF. Both ProWL and ProWL-IF can replenish the missing functions of proteins. In addition, ProWL-IF makes use of the knowledge that a protein cannot have certain functions, which can further boost the performance of protein function prediction. Our experimental results on protein-protein interaction networks and gene expression benchmarks validate the effectiveness of both ProWL and ProWL-IF.

BMC Systems Biology | 2015

Integrating multiple networks for protein function prediction

Guoxian Yu; Hailong Zhu; Carlotta Domeniconi; Maozu Guo

BackgroundHigh throughput techniques produce multiple functional association networks. Integrating these networks can enhance the accuracy of protein function prediction. Many algorithms have been introduced to generate a composite network, which is obtained as a weighted sum of individual networks. The weight assigned to an individual network reflects its benefit towards the protein functional annotation inference. A classifier is then trained on the composite network for predicting protein functions. However, since these techniques model the optimization of the composite network and the prediction tasks as separate objectives, the resulting composite network is not necessarily optimal for the follow-up protein function prediction.ResultsWe address this issue by modeling the optimization of the composite network and the prediction problems within a unified objective function. In particular, we use a kernel target alignment technique and the loss function of a network based classifier to jointly adjust the weights assigned to the individual networks. We show that the proposed method, called MNet, can achieve a performance that is superior (with respect to different evaluation criteria) to related techniques using the multiple networks of four example species (yeast, human, mouse, and fly) annotated with thousands (or hundreds) of GO terms.ConclusionMNet can effectively integrate multiple networks for protein function prediction and is robust to the input parameters. Supplementary data is available at https://sites.google.com/site/guoxian85/home/mnet. The Matlab code of MNet is available upon request.

Neurocomputing | 2012

Local and global structure preserving based feature selection

Yazhou Ren; Guoji Zhang; Guoxian Yu; Xuan Li

Feature selection is of great importance in data mining tasks, especially for exploring high dimensional data. Laplacian Score, a recently proposed feature selection method, makes use of local manifold structure of samples to select features and achieves good performance. However, it ignores the global structure of samples and the selected features are of high redundancy. To address these issues, we propose a feature selection method based on local and global structure preserving, LGFS in short. LGFS first uses two graphs, nearest neighborhood graph and farthest neighborhood graph to describe the underlying local and global structure of samples, respectively. It then defines a criterion to prefer the features which have good ability on local and global structure preserving. To remove redundancy among the selected features, Extended LGFS (E-LGFS) is introduced by taking advantage of normalized mutual information to measure the dependency between a pair of features. We conduct extensive experiments on two artificial data sets, six UCI data sets and two public available face databases to evaluate LGFS and E-LGFS. The experimental results show our methods can achieve higher accuracies than other unsupervised comparing methods.

Explore More