Is this you? Create Your Porfile

Qingyao Wu

South China University of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Qingyao Wu is active.

Explore More

Publication

Featured researches published by Qingyao Wu.

Pattern Recognition | 2013

Stratified sampling for feature subspace selection in random forests for high dimensional data

Yunming Ye; Qingyao Wu; Joshua Zhexue Huang; Michael K. Ng; Xutao Li

For high dimensional data a large portion of features are often not informative of the class of the objects. Random forest algorithms tend to use a simple random sampling of features in building their decision trees and consequently select many subspaces that contain few, if any, informative features. In this paper we propose a stratified sampling method to select the feature subspaces for random forests with high dimensional data. The key idea is to stratify features into two groups. One group will contain strong informative features and the other weak informative features. Then, for feature subspace selection, we randomly select features from each group proportionally. The advantage of stratified sampling is that we can ensure that each subspace contains enough informative features for classification in high dimensional data. Testing on both synthetic data and various real data sets in gene classification, image categorization and face recognition data sets consistently demonstrates the effectiveness of this new method. The performance is shown to better that of state-of-the-art algorithms including SVM, the four variants of random forests (RF, ERT, enrich-RF, and oblique-RF), and nearest neighbor (NN) algorithms.

BMC Systems Biology | 2015

Protein functional properties prediction in sparsely-label PPI networks through regularized non-negative matrix factorization

Qingyao Wu; Zhenyu Wang; Chunshan Li; Yunming Ye; Yueping Li; Ning Sun

BackgroundPredicting functional properties of proteins in protein-protein interaction (PPI) networks presents a challenging problem and has important implication in computational biology. Collective classification (CC) that utilizes both attribute features and relational information to jointly classify related proteins in PPI networks has been shown to be a powerful computational method for this problem setting. Enabling CC usually increases accuracy when given a fully-labeled PPI network with a large amount of labeled data. However, such labels can be difficult to obtain in many real-world PPI networks in which there are usually only a limited number of labeled proteins and there are a large amount of unlabeled proteins. In this case, most of the unlabeled proteins may not connected to the labeled ones, the supervision knowledge cannot be obtained effectively from local network connections. As a consequence, learning a CC model in sparsely-labeled PPI networks can lead to poor performance.ResultsWe investigate a latent graph approach for finding an integration latent graph by exploiting various latent linkages and judiciously integrate the investigated linkages to link (separate) the proteins with similar (different) functions. We develop a regularized non-negative matrix factorization (RNMF) algorithm for CC to make protein functional properties prediction by utilizing various data sources that are available in this problem setting, including attribute features, latent graph, and unlabeled data information. In RNMF, a label matrix factorization term and a network regularization term are incorporated into the non-negative matrix factorization (NMF) objective function to seek a matrix factorization that respects the network structure and label information for classification prediction.ConclusionExperimental results on KDD Cup tasks predicting the localization and functions of proteins to yeast genes demonstrate the effectiveness of the proposed RNMF method for predicting the protein properties. In the comparison, we find that the performance of the new method is better than those of the other compared CC algorithms especially in paucity of labeled proteins.

IEEE Transactions on Nanobioscience | 2012

SNP Selection and Classification of Genome-Wide SNP Data Using Stratified Sampling Random Forests

Qingyao Wu; Yunming Ye; Yang Liu; Michael K. Ng

For high dimensional genome-wide association (GWA) case-control data of complex disease, there are usually a large portion of single-nucleotide polymorphisms (SNPs) that are irrelevant with the disease. A simple random sampling method in random forest using default mtry parameter to choose feature subspace, will select too many subspaces without informative SNPs. Exhaustive searching an optimal mtry is often required in order to include useful and relevant SNPs and get rid of vast of non-informative SNPs. However, it is too time-consuming and not favorable in GWA for high-dimensional data. The main aim of this paper is to propose a stratified sampling method for feature subspace selection to generate decision trees in a random forest for GWA high-dimensional data. Our idea is to design an equal-width discretization scheme for informativeness to divide SNPs into multiple groups. In feature subspace selection, we randomly select the same number of SNPs from each group and combine them to form a subspace to generate a decision tree. The advantage of this stratified sampling procedure can make sure each subspace contains enough useful SNPs, but can avoid a very high computational cost of exhaustive search of an optimal mtry, and maintain the randomness of a random forest. We employ two genome-wide SNP data sets (Parkinson case-control data comprised of 408 803 SNPs and Alzheimer case-control data comprised of 380 157 SNPs) to demonstrate that the proposed stratified sampling method is effective, and it can generate better random forest with higher accuracy and lower error bound than those by Breimans random forest generation method. For Parkinson data, we also show some interesting genes identified by the method, which may be associated with neurological disorders for further biological investigations.

BMC Genomics | 2015

Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests

Thanh-Tung Nguyen; Joshua Zhexue Huang; Qingyao Wu; Thuy Thi Nguyen; Mark Junjie Li

BackgroundSingle-nucleotide polymorphisms (SNPs) selection and identification are the most important tasks in Genome-wide association data analysis. The problem is difficult because genome-wide association data is very high dimensional and a large portion of SNPs in the data is irrelevant to the disease. Advanced machine learning methods have been successfully used in Genome-wide association studies (GWAS) for identification of genetic variants that have relatively big effects in some common, complex diseases. Among them, the most successful one is Random Forests (RF). Despite of performing well in terms of prediction accuracy in some data sets with moderate size, RF still suffers from working in GWAS for selecting informative SNPs and building accurate prediction models. In this paper, we propose to use a new two-stage quality-based sampling method in random forests, named ts-RF, for SNP subspace selection for GWAS. The method first applies p-value assessment to find a cut-off point that separates informative and irrelevant SNPs in two groups. The informative SNPs group is further divided into two sub-groups: highly informative and weak informative SNPs. When sampling the SNP subspace for building trees for the forest, only those SNPs from the two sub-groups are taken into account. The feature subspaces always contain highly informative SNPs when used to split a node at a tree.ResultsThis approach enables one to generate more accurate trees with a lower prediction error, meanwhile possibly avoiding overfitting. It allows one to detect interactions of multiple SNPs with the diseases, and to reduce the dimensionality and the amount of Genome-wide association data needed for learning the RF model. Extensive experiments on two genome-wide SNP data sets (Parkinson case-control data comprised of 408,803 SNPs and Alzheimer case-control data comprised of 380,157 SNPs) and 10 gene data sets have demonstrated that the proposed model significantly reduced prediction errors and outperformed most existing the-state-of-the-art random forests. The top 25 SNPs in Parkinson data set were identified by the proposed model including four interesting genes associated with neurological disorders.ConclusionThe presented approach has shown to be effective in selecting informative sub-groups of SNPs potentially associated with diseases that traditional statistical approaches might fail. The new RF works well for the data where the number of case-control objects is much smaller than the number of SNPs, which is a typical problem in gene data and GWAS. Experiment results demonstrated the effectiveness of the proposed RF model that outperformed the state-of-the-art RFs, including Breimans RF, GRRF and wsRF methods.

Knowledge and Information Systems | 2013

Markov-Miml: A Markov chain-based multi-instance multi-label learning algorithm

Qingyao Wu; Michael K. Ng; Yunming Ye

The main aim of this paper is to propose an efficient and novel Markov chain-based multi-instance multi-label (Markov-Miml) learning algorithm to evaluate the importance of a set of labels associated with objects of multiple instances. The algorithm computes ranking of labels to indicate the importance of a set of labels to an object. Our approach is to exploit the relationships between instances and labels of objects. The rank of a class label to an object depends on (i) the affinity metric between the bag of instances of this object and the bag of instances of the other objects, and (ii) the rank of a class label of similar objects. An object, which contains a bag of instances that are highly similar to bags of instances of the other objects with a high rank of a particular class label, receives a high rank of this class label. Experimental results on benchmark data have shown that the proposed algorithm is computationally efficient and effective in label ranking for MIML data. In the comparison, we find that the classification performance of the Markov-Miml algorithm is competitive with those of the three popular MIML algorithms based on boosting, support vector machine, and regularization, but the computational time required by the proposed algorithm is less than those by the other three algorithms.

Proceedings of the 1st International Workshop on Cross Domain Knowledge Discovery in Web and Social Network Mining | 2012

Co-transfer learning via joint transition probability graph based method

Michael K. Ng; Qingyao Wu; Yunming Ye

This paper studies a new machine learning strategy called co-transfer learning. Unlike many previous learning problems, we focus on how to use labeled data of different feature spaces to enhance the classification of different learning spaces simultaneously. For instance, we make use of both labeled images and labeled text data to help learn models for classifying image data and text data together. An important component of co-transfer learning is to build different relations to link different feature spaces, thus knowledge can be co-transferred across different spaces. Our idea is to model the problem as a joint transition probability graph. The transition probabilities can be constructed by using the intra-relationships based on affinity metric among instances and the inter-relationships based on co-occurrence information among instances from different spaces. The proposed algorithm computes ranking of labels to indicate the importance of a set of labels to an instance by propagating the ranking score of labeled instances via the random walk with restart. The main contribution of this paper is to (i) propose a co-transfer learning (CT-Learn) framework that can perform learning simultaneously by co-transferring knowledge across different spaces; (ii) show the theoretical properties of the random walk for such joint transition probability graph so that the proposed learning model can be used effectively; (iii) develop an efficient algorithm to compute ranking scores and generate the possible labels for a given instance. Experimental results on benchmark data (image-text and English-Chinese-French classification data sets) have shown that the proposed algorithm is computationally efficient, and effective in learning across different spaces. In the comparison, we find that the classification performance of the CT-Learn algorithm is better than those of the other tested transfer learning algorithms.

IEEE Intelligent Systems | 2014

Cotransfer Learning Using Coupled Markov Chains with Restart

Qingyao Wu; Michael K. Ng; Yunming Ye

This article studies co-transfer learning, a machine learning strategy that uses labeled data to enhance the classification of different learning spaces simultaneously. The authors model the problem as a coupled Markov chain with restart. The transition probabilities in the coupled Markov chain can be constructed using the intra-relationships based on the affinity metric among instances in the same space, and the interrelationships based on co-occurrence information among instances from different spaces. The learning algorithm computes ranking of labels to indicate the importance of a set of labels to an instance by propagating the ranking score of labeled instances via the coupled Markov chain with restart. Experimental results on benchmark data (multiclass image-text and English-Spanish-French classification datasets) have shown that the learning algorithm is computationally efficient, and effective in learning across different spaces.

Knowledge Based Systems | 2014

Multi-label collective classification via Markov chain based learning method

Qingyao Wu; Michael K. Ng; Yunming Ye; Xutao Li; Ruichao Shi; Yan Li

In this paper, we study the problem of multi-label collective classification (MLCC) where instances are related and associated with multiple class labels. Such correlation of class labels among interrelated instances exists in a wide variety of data, e.g., a web page can belong to multiple categories since its semantics can be recognized in different ways, and the linked web pages are more likely to have the same classes than the unlinked pages. We propose an effective and novel Markov chain based learning method for MLCC problems. Our idea is to model the problem as a Markov chain with restart on transition probability graphs, and to propagate the ranking score of labeled instances to unlabeled instances based on the affinity among instances. The affinity among instances is set up by explicitly using the attribute features derived from the content of instances as well as the correlation features constructed from the links of instances. Intuitively, an instance which contains linked neighbors that are highly similar to the other instances with a high rank of a particular class label, has a high chance of this class label. Extensive experiments have been conducted on two DBLP datasets to demonstrate the effectiveness of the proposed algorithm. The performance of the proposed algorithm is shown to be better than those of the binary relevance multi-label algorithm, collective classification algorithms (wvRN, ICA and Gibbs), and the ICML algorithm for the tested MLCC problems.

IEEE Transactions on Neural Networks | 2015

ML-TREE: A Tree-Structure-Based Approach to Multilabel Learning

Qingyao Wu; Yunming Ye; Haijun Zhang; Tommy W. S. Chow; Shen-Shyang Ho

Multilabel learning aims to predict labels of unseen instances by learning from training samples that are associated with a set of known labels. In this paper, we propose to use a hierarchical tree model for multilabel learning, and to develop the ML-Tree algorithm for finding the tree structure. ML-Tree considers a tree as a hierarchy of data and constructs the tree using the induction of one-against-all SVM classifiers at each node to recursively partition the data into child nodes. For each node, we define a predictive label vector to represent the predictive label transmission in the tree model for multilabel prediction and automatic discovery of the label relationships. If two labels co-occur frequently as predictive labels at leaf nodes, these labels are supposed to be relevant. The amount of predictive label co-occurrence provides an estimation of the label relationships. We examine the ML-Tree method on 11 real data sets of different domains and compare it with six well-established multilabel learning algorithms. The performances of these approaches are evaluated by 16 commonly used measures. We also conduct Friedman and Nemenyi tests to assess the statistical significance of the differences in performance. Experimental results demonstrate the effectiveness of our method.

BMC Bioinformatics | 2014

Collective prediction of protein functions from protein-protein interaction networks

Qingyao Wu; Yunming Ye; Michael K. Ng; Shen-Shyang Ho; Ruichao Shi

BackgroundAutomated assignment of functions to unknown proteins is one of the most important task in computational biology. The development of experimental methods for genome scale analysis of molecular interaction networks offers new ways to infer protein function from protein-protein interaction (PPI) network data. Existing techniques for collective classification (CC) usually increase accuracy for network data, wherein instances are interlinked with each other, using a large amount of labeled data for training. However, the labeled data are time-consuming and expensive to obtain. On the other hand, one can easily obtain large amount of unlabeled data. Thus, more sophisticated methods are needed to exploit the unlabeled data to increase prediction accuracy for protein function prediction.ResultsIn this paper, we propose an effective Markov chain based CC algorithm (ICAM) to tackle the label deficiency problem in CC for interrelated proteins from PPI networks. Our idea is to model the problem using two distinct Markov chain classifiers to make separate predictions with regard to attribute features from protein data and relational features from relational information. The ICAM learning algorithm combines the results of the two classifiers to compute the ranks of labels to indicate the importance of a set of labels to an instance, and uses an ICA framework to iteratively refine the learning models for improving performance of protein function prediction from PPI networks in the paucity of labeled data.ConclusionExperimental results on the real-world Yeast protein-protein interaction datasets show that our proposed ICAM method is better than the other ICA-type methods given limited labeled training data. This approach can serve as a valuable tool for the study of protein function prediction from PPI networks.

Explore More