Is this you? Create Your Porfile

Chengbin Peng

King Abdullah University of Science and Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Chengbin Peng is active.

Explore More

Publication

Featured researches published by Chengbin Peng.

Information Sciences | 2012

Evolutionary multimodal optimization using the principle of locality

Ka-Chun Wong; Chun-Ho Wu; Ricky K. P. Mok; Chengbin Peng; Zhaolei Zhang

The principle of locality is one of the most widely used concepts in designing computing systems. To explore the principle in evolutionary computation, crowding differential evolution is incorporated with locality for multimodal optimization. Instead of generating trial vectors randomly, the first method proposed takes advantage of spatial locality to generate trial vectors. Temporal locality is also adopted to help generate offspring in the second method proposed. Temporal and spatial locality are then applied together in the third method proposed. Numerical experiments are conducted to compare the proposed methods with the state-of-the-art methods on benchmark functions. Experimental analysis is undertaken to observe the effect of locality and the synergy between temporal locality and spatial locality. Further experiments are also conducted on two application problems. One is the varied-line-spacing holographic grating design problem, while the other is the protein structure prediction problem. The numerical results demonstrate the effectiveness of the methods proposed.

Nucleic Acids Research | 2013

DNA motif elucidation using belief propagation

Ka-Chun Wong; Tak-Ming Chan; Chengbin Peng; Yue Li; Zhaolei Zhang

Protein-binding microarray (PBM) is a high-throughout platform that can measure the DNA-binding preference of a protein in a comprehensive and unbiased manner. A typical PBM experiment can measure binding signal intensities of a protein to all the possible DNA k-mers (k = 8 ∼10); such comprehensive binding affinity data usually need to be reduced and represented as motif models before they can be further analyzed and applied. Since proteins can often bind to DNA in multiple modes, one of the major challenges is to decompose the comprehensive affinity data into multimodal motif representations. Here, we describe a new algorithm that uses Hidden Markov Models (HMMs) and can derive precise and multimodal motifs using belief propagations. We describe an HMM-based approach using belief propagations (kmerHMM), which accepts and preprocesses PBM probe raw data into median-binding intensities of individual k-mers. The k-mers are ranked and aligned for training an HMM as the underlying motif representation. Multiple motifs are then extracted from the HMM using belief propagations. Comparisons of kmerHMM with other leading methods on several data sets demonstrated its effectiveness and uniqueness. Especially, it achieved the best performance on more than half of the data sets. In addition, the multiple binding modes derived by kmerHMM are biologically meaningful and will be useful in interpreting other genome-wide data such as those generated from ChIP-seq. The executables and source codes are available at the authors’ websites: e.g. http://www.cs.toronto.edu/∼wkc/kmerHMM.

soft computing | 2011

Generalizing and learning protein-DNA binding sequence representations by an evolutionary algorithm

Ka-Chun Wong; Chengbin Peng; Man Hon Wong; Kwong-Sak Leung

Protein-DNA bindings are essential activities. Understanding them forms the basis for further deciphering of biological and genetic systems. In particular, the protein-DNA bindings between transcription factors (TFs) and transcription factor binding sites (TFBSs) play a central role in gene transcription. Comprehensive TF-TFBS binding sequence pairs have been found in a recent study. However, they are in one-to-one mappings which cannot fully reflect the many-to-many mappings within the bindings. An evolutionary algorithm is proposed to learn generalized representations (many-to-many mappings) from the TF-TFBS binding sequence pairs (one-to-one mappings). The generalized pairs are shown to be more meaningful than the original TF-TFBS binding sequence pairs. Some representative examples have been analyzed in this study. In particular, it shows that the TF-TFBS binding sequence pairs are not presumably in one-to-one mappings. They can also exhibit many-to-many mappings. The proposed method can help us extract such many-to-many information from the one-to-one TF-TFBS binding sequence pairs found in the previous study, providing further knowledge in understanding the bindings between TFs and TFBSs.

Nucleic Acids Research | 2015

Computational learning on specificity-determining residue-nucleotide interactions

Ka-Chun Wong; Yue Li; Chengbin Peng; Alan M. Moses; Zhaolei Zhang

The protein–DNA interactions between transcription factors and transcription factor binding sites are essential activities in gene regulation. To decipher the binding codes, it is a long-standing challenge to understand the binding mechanism across different transcription factor DNA binding families. Past computational learning studies usually focus on learning and predicting the DNA binding residues on protein side. Taking into account both sides (protein and DNA), we propose and describe a computational study for learning the specificity-determining residue-nucleotide interactions of different known DNA-binding domain families. The proposed learning models are compared to state-of-the-art models comprehensively, demonstrating its competitive learning performance. In addition, we describe and propose two applications which demonstrate how the learnt models can provide meaningful insights into protein–DNA interactions across different DNA binding families.

Applied Soft Computing | 2014

Herd Clustering: A synergistic data clustering approach using collective intelligence

Ka-Chun Wong; Chengbin Peng; Yue Li; Tak-Ming Chan

Traditional data mining methods emphasize on analytical abilities to decipher data, assuming that data are static during a mining process. We challenge this assumption, arguing that we can improve the analysis by vitalizing data. In this paper, this principle is used to develop a new clustering algorithm. Inspired by herd behavior, the clustering method is a synergistic approach using collective intelligence called Herd Clustering (HC). The novel part is laid in its first stage where data instances are represented by moving particles. Particles attract each other locally and form clusters by themselves as shown in the case studies reported. To demonstrate its effectiveness, the performance of HC is compared to other state-of-the art clustering methods on more than thirty datasets using four performance metrics. An application for DNA motif discovery is also conducted. The results support the effectiveness of HC and thus the underlying philosophy.

Bioinformatics | 2016

Identification of coupling DNA motif pairs on long-range chromatin interactions in human K562 cells

Ka-Chun Wong; Yue Li; Chengbin Peng

MOTIVATION The protein-DNA interactions between transcription factors (TFs) and transcription factor binding sites (TFBSs, also known as DNA motifs) are critical activities in gene transcription. The identification of the DNA motifs is a vital task for downstream analysis. Unfortunately, the long-range coupling information between different DNA motifs is still lacking. To fill the void, as the first-of-its-kind study, we have identified the coupling DNA motif pairs on long-range chromatin interactions in human. RESULTS The coupling DNA motif pairs exhibit substantially higher DNase accessibility than the background sequences. Half of the DNA motifs involved are matched to the existing motif databases, although nearly all of them are enriched with at least one gene ontology term. Their motif instances are also found statistically enriched on the promoter and enhancer regions. Especially, we introduce a novel measurement called motif pairing multiplicity which is defined as the number of motifs that are paired with a given motif on chromatin interactions. Interestingly, we observe that motif pairing multiplicity is linked to several characteristics such as regulatory region type, motif sequence degeneracy, DNase accessibility and pairing genomic distance. Taken into account together, we believe the coupling DNA motif pairs identified in this study can shed lights on the gene transcription mechanism under long-range chromatin interactions. AVAILABILITY AND IMPLEMENTATION The identified motif pair data is compressed and available in the supplementary materials associated with this manuscript. CONTACT [email protected] SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

international conference on data mining | 2012

Multiplicative Algorithms for Constrained Non-negative Matrix Factorization

Chengbin Peng; Ka-Chun Wong; Alyn Rockwood; Xiangliang Zhang; Jinling Jiang; David E. Keyes

Non-negative matrix factorization (NMF) provides the advantage of parts-based data representation through additive only combinations. It has been widely adopted in areas like item recommending, text mining, data clustering, speech denoising, etc. In this paper, we provide an algorithm that allows the factorization to have linear or approximately linear constraints with respect to each factor. We prove that if the constraint function is linear, algorithms within our multiplicative framework will converge. This theory supports a large variety of equality and inequality constraints, and can facilitate application of NMF to a much larger domain. Taking the recommender system as an example, we demonstrate how a specialized weighted and constrained NMF algorithm can be developed to fit exactly for the problem, and the tests justify that our constraints improve the performance for both weighted and unweighted NMF algorithms under several different metrics. In particular, on the Movie lens data with 94% of items, the Constrained NMF improves recall rate 3% compared to SVD50 and 45% compared to SVD150, which were reported as the best two in the top-N metric.

IEEE Transactions on Systems, Man, and Cybernetics | 2017

Evolving Transcription Factor Binding Site Models From Protein Binding Microarray Data

Ka-Chun Wong; Chengbin Peng; Yue Li

Protein binding microarray (PBM) is a high-throughput platform that can measure the DNA binding preference of a protein in a comprehensive and unbiased manner. In this paper, we describe the PBM motif model building problem. We apply several evolutionary computation methods and compare their performance with the interior point method, demonstrating their performance advantages. In addition, given the PBM domain knowledge, we propose and describe a novel method called kmerGA which makes domain-specific assumptions to exploit PBM data properties to build more accurate models than the other models built. The effectiveness and robustness of kmerGA is supported by comprehensive performance benchmarking on more than 200 datasets, time complexity analysis, convergence analysis, parameter analysis, and case studies. To demonstrate its utility further, kmerGA is applied to two real world applications: 1) PBM rotation testing and 2) ChIP-Seq peak sequence prediction. The results support the biological relevance of the models learned by kmerGA, and thus its real world applicability.

Archive | 2011

Urgent Epidemic Control Mechanism for Aviation Networks

Chengbin Peng; Shengbin Wang; Meixia Shi; Xiaogang Jin

In the current century, the highly developed transportation system can not only boost the economy, but also greatly accelerate the spreading of epidemics. While some epidemic diseases may infect quite a number of people ahead of our awareness, the health care resources such as vaccines and the medical staff are usually locally or even globally insufficient. In this research, with the network of major aviation routes as an example, we present a method to determine the optimal locations to allocate the medical service in order to minimize the impact of the infectious disease with limited resources. Specifically, we demonstrate that when the medical resources are insufficient, we should concentrate our efforts on the travelers with the objective of effectively controlling the spreading rate of the epidemic diseases.

IEEE Transactions on Nanobioscience | 2017

Probabilistic Inference on Multiple Normalized Genome-Wide Signal Profiles With Model Regularization

Ka-Chun Wong; Chengbin Peng; Shankai Yan; Cheng Liang

Understanding genome-wide protein-DNA interaction signals forms the basis for further focused studies in gene regulation. In particular, the chromatin immunoprecipitation with massively parallel DNA sequencing technology (ChIP-Seq) can enable us to measure the in vivo genome-wide occupancy of the DNA-binding protein of interest in a single run. Multiple ChIP-Seq runs thus inherent the potential for us to decipher the combinatorial occupancies of multiple DNA-binding proteins. To handle the genome-wide signal profiles from those multiple runs, we propose to integrate regularized regression functions (i.e., LASSO, Elastic Net, and Ridge Regression) into the well-established SignalRanker and FullSignalRanker frameworks, resulting in six additional probabilistic models for inference on multiple normalized genome-wide signal profiles. The corresponding model training algorithms are devised with computational complexity analysis. Comprehensive benchmarking is conducted to demonstrate and compare the performance of nine related probabilistic models on the ENCODE ChIP-Seq datasets. The results indicate that the regularized SignalRanker models, in contrast to the original SignalRanker models, can demonstrate excellent inference performance comparable to the FullSignalRanker models with low model complexities and time complexities. Such a feature is especially valuable in the context of the rapidly growing genome-wide signal profile data in the recent years.

Explore More