Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Pengyi Yang is active.

Publication


Featured researches published by Pengyi Yang.


Current Bioinformatics | 2010

A Review of Ensemble Methods in Bioinformatics

Pengyi Yang; Yee Hwa Yang; Bing Bing Zhou; Albert Y. Zomaya

Ensemble learning is an intensively studies technique in machine learning and pattern recognition. Recent work in computational biology has seen an increasing use of ensemble learning methods due to their unique advantages in dealing with small sample size, high-dimensionality, and complexity data structures. The aim of this article is two-fold. First, it is to provide a review of the most widely used ensemble learning methods and their application in various bioinformatics problems, including the main topics of gene expression, mass spectrometry-based proteomics, gene-gene interaction identification from genome-wide association studies, and prediction of regulatory elements from DNA and protein sequences. Second, we try to identify and summarize future trends of ensemble methods in bioinformatics. Promising directions such as ensemble of support vector machine, meta-ensemble, and ensemble based feature selection are discussed.


Cell Metabolism | 2013

Dynamic adipocyte phosphoproteome reveals that Akt directly regulates mTORC2.

Sean J. Humphrey; Guang Yang; Pengyi Yang; Daniel J. Fazakerley; Jacqueline Stöckli; Jean Y. Yang; David E. James

Summary A major challenge of the post-genomics era is to define the connectivity of protein phosphorylation networks. Here, we quantitatively delineate the insulin signaling network in adipocytes by high-resolution mass spectrometry-based proteomics. These data reveal the complexity of intracellular protein phosphorylation. We identified 37,248 phosphorylation sites on 5,705 proteins in this single-cell type, with approximately 15% responding to insulin. We integrated these large-scale phosphoproteomics data using a machine learning approach to predict physiological substrates of several diverse insulin-regulated kinases. This led to the identification of an Akt substrate, SIN1, a core component of the mTORC2 complex. The phosphorylation of SIN1 by Akt was found to regulate mTORC2 activity in response to growth factors, revealing topological insights into the Akt/mTOR signaling network. The dynamic phosphoproteome described here contains numerous phosphorylation sites on proteins involved in diverse molecular functions and should serve as a useful functional resource for cell biologists.


Nature Communications | 2015

DNMT1 is essential for mammary and cancer stem cell maintenance and tumorigenesis

Rajneesh Pathania; Selvakumar Elangovan; Ravi Padia; Pengyi Yang; Senthilkumar Cinghu; Rajalakshmi Veeranan-Karmegam; Pachiappan Arjunan; Jaya P. Gnana-Prakasam; Fulzele Sadanand; Lirong Pei; Chang Sheng Chang; Jeong Hyeon Choi; Huidong Shi; Santhakumar Manicassamy; Puttur D. Prasad; Suash Sharma; Vadivel Ganapathy; Raja Jothi; Muthusamy Thangaraju

Mammary stem/progenitor cells (MaSCs) maintain self-renewal of the mammary epithelium during puberty and pregnancy. DNA methylation provides a potential epigenetic mechanism for maintaining cellular memory during self-renewal. Although DNA methyltransferases (DNMTs) are dispensable for embryonic stem cell maintenance, their role in maintaining MaSCs and cancer stem cells (CSCs) in constantly replenishing mammary epithelium is unclear. Here we show that DNMT1 is indispensable for MaSC maintenance. Furthermore, we find that DNMT1 expression is elevated in mammary tumors, and mammary gland-specific DNMT1 deletion protects mice from mammary tumorigenesis by limiting the CSC pool. Through genome-scale methylation studies, we identify ISL1 as a direct DNMT1 target, hypermethylated and downregulated in mammary tumors and CSCs. DNMT inhibition or ISL1 expression in breast cancer cells limits CSC population. Altogether, our studies uncover an essential role for DNMT1 in MaSC and CSC maintenance and identify DNMT1-ISL1 axis as a potential therapeutic target for breast cancer treatment.


BMC Genomics | 2009

A particle swarm based hybrid system for imbalanced medical data sampling

Pengyi Yang; Liang Xu; Bing Bing Zhou; Zili Zhang; Albert Y. Zomaya

BackgroundMedical and biological data are commonly with small sample size, missing values, and most importantly, imbalanced class distribution. In this study we propose a particle swarm based hybrid system for remedying the class imbalance problem in medical and biological data mining. This hybrid system combines the particle swarm optimization (PSO) algorithm with multiple classifiers and evaluation metrics for evaluation fusion. Samples from the majority class are ranked using multiple objectives according to their merit in compensating the class imbalance, and then combined with the minority class to form a balanced dataset.ResultsOne important finding of this study is that different classifiers and metrics often provide different evaluation results. Nevertheless, the proposed hybrid system demonstrates consistent improvements over several alternative methods with three different metrics. The sampling results also demonstrate good generalization on different types of classification algorithms, indicating the advantage of information fusion applied in the hybrid system.ConclusionThe experimental results demonstrate that unlike many currently available methods which often perform unevenly with different datasets the proposed hybrid system has a better generalization property which alleviates the method-data dependency problem. From the biological perspective, the system provides indication for further investigation of the highly ranked samples, which may result in the discovery of new conditions or disease subtypes.


Molecular Cell | 2014

Histone-Fold Domain Protein NF-Y Promotes Chromatin Accessibility for Cell Type-Specific Master Transcription Factors

Andrew Oldfield; Pengyi Yang; Amanda E. Conway; Senthilkumar Cinghu; Johannes M. Freudenberg; Sailu Yellaboina; Raja Jothi

Cell type-specific master transcription factors (TFs) play vital roles in defining cell identity and function. However, the roles ubiquitous factors play in the specification of cell identity remain underappreciated. Here we show that the ubiquitous CCAAT-binding NF-Y complex is required for the maintenance of embryonic stem cell (ESC) identity and is an essential component of the core pluripotency network. Genome-wide studies in ESCs and neurons reveal that NF-Y regulates not only genes with housekeeping functions through cell type-invariant promoter-proximal binding, but also genes required for cell identity by binding to cell type-specific enhancers with master TFs. Mechanistically, NF-Ys distinct DNA-binding mode promotes master/pioneer TF binding at enhancers by facilitating a permissive chromatin conformation. Our studies unearth a conceptually unique function for histone-fold domain (HFD) protein NF-Y in promoting chromatin accessibility and suggest that other HFD proteins with analogous structural and DNA-binding properties may function in similar ways.


The EMBO Journal | 2014

Fip1 regulates mRNA alternative polyadenylation to promote stem cell self‐renewal

Brad Lackford; Chengguo Yao; Georgette M Charles; Lingjie Weng; Xiaofeng Zheng; Eun-A Choi; Xiaohui Xie; Ji Wan; Yi Xing; Johannes M. Freudenberg; Pengyi Yang; Raja Jothi; Guang Hu; Yongsheng Shi

mRNA alternative polyadenylation (APA) plays a critical role in post‐transcriptional gene control and is highly regulated during development and disease. However, the regulatory mechanisms and functional consequences of APA remain poorly understood. Here, we show that an mRNA 3′ processing factor, Fip1, is essential for embryonic stem cell (ESC) self‐renewal and somatic cell reprogramming. Fip1 promotes stem cell maintenance, in part, by activating the ESC‐specific APA profiles to ensure the optimal expression of a specific set of genes, including critical self‐renewal factors. Fip1 expression and the Fip1‐dependent APA program change during ESC differentiation and are restored to an ESC‐like state during somatic reprogramming. Mechanistically, we provide evidence that the specificity of Fip1‐mediated APA regulation depends on multiple factors, including Fip1‐RNA interactions and the distance between APA sites. Together, our data highlight the role for post‐transcriptional control in stem cell self‐renewal, provide mechanistic insight on APA regulation in development, and establish an important function for APA in cell fate specification.


BMC Bioinformatics | 2010

A multi-filter enhanced genetic ensemble system for gene selection and sample classification of microarray data

Pengyi Yang; Bing Bing Zhou; Zili Zhang; Albert Y. Zomaya

BackgroundFeature selection techniques are critical to the analysis of high dimensional datasets. This is especially true in gene selection from microarray data which are commonly with extremely high feature-to-sample ratio. In addition to the essential objectives such as to reduce data noise, to reduce data redundancy, to improve sample classification accuracy, and to improve model generalization property, feature selection also helps biologists to focus on the selected genes to further validate their biological hypotheses.ResultsIn this paper we describe an improved hybrid system for gene selection. It is based on a recently proposed genetic ensemble (GE) system. To enhance the generalization property of the selected genes or gene subsets and to overcome the overfitting problem of the GE system, we devised a mapping strategy to fuse the goodness information of each gene provided by multiple filtering algorithms. This information is then used for initialization and mutation operation of the genetic ensemble system.ConclusionWe used four benchmark microarray datasets (including both binary-class and multi-class classification problems) for concept proving and model evaluation. The experimental results indicate that the proposed multi-filter enhanced genetic ensemble (MF-GE) system is able to improve sample classification accuracy, generate more compact gene subset, and converge to the selection results more quickly. The MF-GE system is very flexible as various combinations of multiple filters and classifiers can be incorporated based on the data characteristics and the user preferences.


pacific-asia conference on knowledge discovery and data mining | 2013

Ensemble-Based Wrapper Methods for Feature Selection and Class Imbalance Learning

Pengyi Yang; Wei Liu; Bing Bing Zhou; Sanjay Chawla; Albert Y. Zomaya

The wrapper feature selection approach is useful in identifying informative feature subsets from high-dimensional datasets. Typically, an inductive algorithm “wrapped” in a search algorithm is used to evaluate the merit of the selected features. However, significant bias may be introduced when dealing with highly imbalanced dataset. That is, the selected features may favour one class while being less useful to the adverse class. In this paper, we propose an ensemble-based wrapper approach for feature selection from data with highly imbalanced class distribution. The key idea is to create multiple balanced datasets from the original imbalanced dataset via sampling, and subsequently evaluate feature subsets using an ensemble of base classifiers each trained on a balanced dataset. The proposed approach provides a unified framework that incorporates ensemble feature selection and multiple sampling in a mutually beneficial way. The experimental results indicate that, overall, features selected by the ensemble-based wrapper are significantly better than those selected by wrappers with a single inductive algorithm in imbalanced data classification.


BMC Bioinformatics | 2011

Gene-gene interaction filtering with ensemble of filters.

Pengyi Yang; Joshua Wk Ho; Yee Hwa Yang; Bing Bing Zhou

BackgroundComplex diseases are commonly caused by multiple genes and their interactions with each other. Genome-wide association (GWA) studies provide us the opportunity to capture those disease associated genes and gene-gene interactions through panels of SNP markers. However, a proper filtering procedure is critical to reduce the search space prior to the computationally intensive gene-gene interaction identification step. In this study, we show that two commonly used SNP-SNP interaction filtering algorithms, ReliefF and tuned ReliefF (TuRF), are sensitive to the order of the samples in the dataset, giving rise to unstable and suboptimal results. However, we observe that the ‘unstable’ results from multiple runs of these algorithms can provide valuable information about the dataset. We therefore hypothesize that aggregating results from multiple runs of the algorithm may improve the filtering performance.ResultsWe propose a simple and effective ensemble approach in which the results from multiple runs of an unstable filter are aggregated based on the general theory of ensemble learning. The ensemble versions of the ReliefF and TuRF algorithms, referred to as ReliefF-E and TuRF-E, are robust to sample order dependency and enable a more informative investigation of data characteristics. Using simulated and real datasets, we demonstrate that both the ensemble of ReliefF and the ensemble of TuRF can generate a much more stable SNP ranking than the original algorithms. Furthermore, the ensemble of TuRF achieved the highest success rate in comparison to many state-of-the-art algorithms as well as traditional χ2-test and odds ratio methods in terms of retaining gene-gene interactions.


IEEE Transactions on Systems, Man, and Cybernetics | 2014

Sample Subset Optimization Techniques for Imbalanced and Ensemble Learning Problems in Bioinformatics Applications

Pengyi Yang; Paul D. Yoo; Juanita Fernando; Bing Bing Zhou; Zili Zhang; Albert Y. Zomaya

Data sampling is a widely used technique in a broad range of machine learning problems. Traditional sampling approaches generally rely on random resampling from a given dataset. However, these approaches do not take into consideration additional information, such as sample quality and usefulness. We recently proposed a data sampling technique, called sample subset optimization (SSO). The SSO technique relies on a cross-validation procedure for identifying and selecting the most useful samples as subsets. In this paper, we describe the application of SSO techniques to imbalanced and ensemble learning problems, respectively. For imbalanced learning, the SSO technique is employed as an under-sampling technique for identifying a subset of highly discriminative samples in the majority class. In ensemble learning, the SSO technique is utilized as a generic ensemble technique where multiple optimized subsets of samples from each class are selected for building an ensemble classifier. We demonstrate the utilities and advantages of the proposed techniques on a variety of bioinformatics applications where class imbalance, small sample size, and noisy data are prevalent.

Collaboration


Dive into the Pengyi Yang's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Raja Jothi

National Institutes of Health

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Andrew Oldfield

National Institutes of Health

View shared research outputs
Top Co-Authors

Avatar

Senthilkumar Cinghu

National Institutes of Health

View shared research outputs
Researchain Logo
Decentralizing Knowledge