Minghua Deng | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Minghua Deng is active.

Explore More

Publication

Featured researches published by Minghua Deng.

Proceedings of the National Academy of Sciences of the United States of America | 2002

A dynamic programming algorithm for haplotype block partitioning

Kui Zhang; Minghua Deng; Ting Chen; Michael S. Waterman; Fengzhu Sun

We develop a dynamic programming algorithm for haplotype block partitioning to minimize the number of representative single nucleotide polymorphisms (SNPs) required to account for most of the common haplotypes in each block. Any measure of haplotype quality can be used in the algorithm and of course the measure should depend on the specific application. The dynamic programming algorithm is applied to analyze the chromosome 21 haplotype data of Patil et al. [Patil, N., Berno, A. J., Hinds, D. A., Barrett, W. A., Doshi, J. M., Hacker, C. R., Kautzer, C. R., Lee, D. H., Marjoribanks, C., McDonough, D. P., et al. (2001) Science 294, 1719–1723], who searched for blocks of limited haplotype diversity. Using the same criteria as in Patil et al., we identify a total of 3,582 representative SNPs and 2,575 blocks that are 21.5% and 37.7% smaller, respectively, than those identified using a greedy algorithm of Patil et al. We also apply the dynamic programming algorithm to the same data set based on haplotype diversity. A total of 3,982 representative SNPs and 1,884 blocks are identified to account for 95% of the haplotype diversity in each block.

pacific symposium on biocomputing | 2003

Kernel-based data fusion and its application to protein function prediction in yeast.

Gert R. G. Lanckriet; Minghua Deng; Nello Cristianini; Michael I. Jordan; William Stafford Noble

Kernel methods provide a principled framework in which to represent many types of data, including vectors, strings, trees and graphs. As such, these methods are useful for drawing inferences about biological phenomena. We describe a method for combining multiple kernel representations in an optimal fashion, by formulating the problem as a convex optimization problem that can be solved using semidefinite programming techniques. The method is applied to the problem of predicting yeast protein functional classifications using a support vector machine (SVM) trained on five types of data. For this problem, the new method performs better than a previously-described Markov random field method, and better than the SVM trained on any single type of data.

Genome Biology | 2008

A critical assessment of Mus musculus gene function prediction using integrated genomic evidence

Lourdes Peña-Castillo; Murat Tasan; Chad L. Myers; Hyunju Lee; Trupti Joshi; Chao Zhang; Yuanfang Guan; Michele Leone; Andrea Pagnani; Wan-Kyu Kim; Chase Krumpelman; Weidong Tian; Guillaume Obozinski; Yanjun Qi; Guan Ning Lin; Gabriel F. Berriz; Francis D. Gibbons; Gert R. G. Lanckriet; Jian-Ge Qiu; Charles E. Grant; Zafer Barutcuoglu; David P. Hill; David Warde-Farley; Chris Grouios; Debajyoti Ray; Judith A. Blake; Minghua Deng; Michael I. Jordan; William Stafford Noble; Quaid Morris

Background:Several years after sequencing the human genome and the mouse genome, much remains to be discovered about the functions of most human and mouse genes. Computational prediction of gene function promises to help focus limited experimental resources on the most likely hypotheses. Several algorithms using diverse genomic data have been applied to this task in model organisms; however, the performance of such approaches in mammals has not yet been evaluated.Results:In this study, a standardized collection of mouse functional genomic data was assembled; nine bioinformatics teams used this data set to independently train classifiers and generate predictions of function, as defined by Gene Ontology (GO) terms, for 21,603 mouse genes; and the best performing submissions were combined in a single set of predictions. We identified strengths and weaknesses of current functional genomic data sets and compared the performance of function prediction algorithms. This analysis inferred functions for 76% of mouse genes, including 5,000 currently uncharacterized genes. At a recall rate of 20%, a unified set of predictions averaged 41% precision, with 26% of GO terms achieving a precision better than 90%.Conclusion:We performed a systematic evaluation of diverse, independently developed computational approaches for predicting gene function from heterogeneous data sources in mammals. The results show that currently available data for mammals allows predictions with both breadth and accuracy. Importantly, many highly novel predictions emerge for the 38% of mouse genes that remain uncharacterized.

Journal of Computational Biology | 2004

An integrated probabilistic model for functional prediction of proteins.

Minghua Deng; Ting Chen; Fengzhu Sun

We develop an integrated probabilistic model to combine protein physical interactions, genetic interactions, highly correlated gene expression networks, protein complex data, and domain structures of individual proteins to predict protein functions. The model is an extension of our previous model for protein function prediction based on Markovian random field theory. The model is flexible in that other protein pairwise relationship information and features of individual proteins can be easily incorporated. Two features distinguish the integrated approach from other available methods for protein function prediction. One is that the integrated approach uses all available sources of information with different weights for different sources of data. It is a global approach that takes the whole network into consideration. The second feature is that the posterior probability that a protein has the function of interest is assigned. The posterior probability indicates how confident we are about assigning the function to the protein. We apply our integrated approach to predict functions of yeast proteins based upon MIPS protein function classifications and upon the interaction networks based on MIPS physical and genetic interactions, gene expression profiles, tandem affinity purification (TAP) protein complex data, and protein domain information. We study the recall and precision of the integrated approach using different sources of information by the leave-one-out approach. In contrast to using MIPS physical interactions only, the integrated approach combining all of the information increases the recall from 57% to 87% when the precision is set at 57%-an increase of 30%.

Bioinformatics | 2004

Mapping gene ontology to proteins based on protein--protein interaction data

Minghua Deng; Zhidong Tu; Fengzhu Sun; Ting Chen

MOTIVATION Gene Ontology (GO) consortium provides structural description of protein function that is used as a common language for gene annotation in many organisms. Large-scale techniques have generated many valuable protein-protein interaction datasets that are useful for the study of protein function. Combining both GO and protein-protein interaction data allows the prediction of function for unknown proteins. RESULT We apply a Markov random field method to the prediction of yeast protein function based on multiple protein-protein interaction datasets. We assign function to unknown proteins with a probability representing the confidence of this prediction. The functions are based on three general categories of cellular component, molecular function and biological process defined in GO. The yeast proteins are defined in the Saccharomyces Genome Database (SGD). The protein-protein interaction datasets are obtained from the Munich Information Center for Protein Sequences (MIPS), including physical interactions and genetic interactions. The efficiency of our prediction is measured by applying the leave-one-out validation procedure to a functional path matching scheme, which compares the prediction with the GO description of a proteins function from the abstract level to the detailed level along the GO structure. For biological process, the leave-one-out validation procedure shows 52% precision and recall of our method, much better than that of the simple guilty-by-association methods.

BMC Bioinformatics | 2006

An integrated approach to the prediction of domain-domain interactions

Hyunju Lee; Minghua Deng; Fengzhu Sun; Ting Chen

BackgroundThe development of high-throughput technologies has produced several large scale protein interaction data sets for multiple species, and significant efforts have been made to analyze the data sets in order to understand protein activities. Considering that the basic units of protein interactions are domain interactions, it is crucial to understand protein interactions at the level of the domains. The availability of many diverse biological data sets provides an opportunity to discover the underlying domain interactions within protein interactions through an integration of these biological data sets.ResultsWe combine protein interaction data sets from multiple species, molecular sequences, and gene ontology to construct a set of high-confidence domain-domain interactions. First, we propose a new measure, the expected number of interactions for each pair of domains, to score domain interactions based on protein interaction data in one species and show that it has similar performance as the E-value defined by Riley et al. [1]. Our new measure is applied to the protein interaction data sets from yeast, worm, fruitfly and humans. Second, information on pairs of domains that coexist in known proteins and on pairs of domains with the same gene ontology function annotations are incorporated to construct a high-confidence set of domain-domain interactions using a Bayesian approach. Finally, we evaluate the set of domain-domain interactions by comparing predicted domain interactions with those defined in iPfam database [2, 3] that were derived based on protein structures. The accuracy of predicted domain interactions are also confirmed by comparing with experimentally obtained domain interactions from H. pylori [4]. As a result, a total of 2,391 high-confidence domain interactions are obtained and these domain interactions are used to unravel detailed protein and domain interactions in several protein complexes.ConclusionOur study shows that integration of multiple biological data sets based on the Bayesian approach provides a reliable framework to predict domain interactions. By integrating multiple data sources, the coverage and accuracy of predicted domain interactions can be significantly increased.

Physica D: Nonlinear Phenomena | 2006

Stochastic model of yeast cell-cycle network

Yuping Zhang; Minping Qian; Qi Ouyang; Minghua Deng; Fangting Li; Chao Tang

Biological functions in living cells are controlled by protein interaction and genetic networks. These molecular networks should be dynamically stable against various fluctuations which are inevitable in the living world. In this paper, we propose and study a stochastic model for the network regulating the cell cycle of the budding yeast. The stochasticity in the model is controlled by a temperature-like parameter . Our simulation results show that both the biological stationary state and the biological pathway are stable for a wide range of “temperature”. There is, however, a sharp transition-like behavior at c, below which the dynamics are dominated by noise. We also define a pseudo energy landscape for the system in which the biological pathway can be seen as a deep valley. c 2006 Elsevier B.V. All rights reserved.

Bioinformatics | 2011

A Lasso regression model for the construction of microRNA-target regulatory networks

Yiming Lu; Yang Zhou; Wubin Qu; Minghua Deng; Chenggang Zhang

MOTIVATION MicroRNAs have recently emerged as a major class of regulatory molecules involved in a broad range of biological processes and complex diseases. Construction of miRNA-target regulatory networks can provide useful information for the study and diagnosis of complex diseases. Many sequence-based and evolutionary information-based methods have been developed to identify miRNA-mRNA targeting relationships. However, as the amount of available miRNA and gene expression data grows, a more statistical and systematic method combining sequence-based binding predictions and expression-based correlation data becomes necessary for the accurate identification of miRNA-mRNA pairs. RESULTS We propose a Lasso regression model for the identification of miRNA-mRNA targeting relationships that combines sequence-based prediction information, miRNA co-regulation, RISC availability and miRNA/mRNA abundance data. By comparing this modelling approach with two other known methods applied to three different datasets, we found that the Lasso regression model has considerable advantages in both sensitivity and specificity. The regression coefficients in the model can be used to determine the true regulatory efficacies in tissues and was demonstrated using the miRNA target site type data. Finally, by constructing the miRNA regulatory networks in two stages of prostate cancer (PCa), we found the several significant miRNA-hubbed network modules associated with PCa metastasis. In conclusion, the Lasso regression model is a robust and informative tool for constructing the miRNA regulatory networks for diagnosis and treatment of complex diseases. AVAILABILITY The R program for predicting miRNA-mRNA targeting relationships using the Lasso regression model is freely available, along with the described datasets and resulting regulatory network, at http://biocompute.bmi.ac.cn/CZlab/alarmnet/. The source code is open for modification and application to other miRNA/mRNA expression datasets. CONTACT [email protected] SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

Briefings in Bioinformatics | 2014

New developments of alignment-free sequence comparison: measures, statistics and next-generation sequencing

Kai Song; Jie Ren; Gesine Reinert; Minghua Deng; Michael S. Waterman; Fengzhu Sun

With the development of next-generation sequencing (NGS) technologies, a large amount of short read data has been generated. Assembly of these short reads can be challenging for genomes and metagenomes without template sequences, making alignment-based genome sequence comparison difficult. In addition, sequence reads from NGS can come from different regions of various genomes and they may not be alignable. Sequence signature-based methods for genome comparison based on the frequencies of word patterns in genomes and metagenomes can potentially be useful for the analysis of short reads data from NGS. Here we review the recent development of alignment-free genome and metagenome comparison based on the frequencies of word patterns with emphasis on the dissimilarity measures between sequences, the statistical power of these measures when two sequences are related and the applications of these measures to NGS data.

BMC Genomics | 2012

Comparison of metagenomic samples using sequence signatures.

Bai Jiang; Kai Song; Jie Ren; Minghua Deng; Fengzhu Sun; Xuegong Zhang

BackgroundSequence signatures, as defined by the frequencies of k-tuples (or k-mers, k-grams), have been used extensively to compare genomic sequences of individual organisms, to identify cis-regulatory modules, and to study the evolution of regulatory sequences. Recently many next-generation sequencing (NGS) read data sets of metagenomic samples from a variety of different environments have been generated. The assembly of these reads can be difficult and analysis methods based on mapping reads to genes or pathways are also restricted by the availability and completeness of existing databases. Sequence-signature-based methods, however, do not need the complete genomes or existing databases and thus, can potentially be very useful for the comparison of metagenomic samples using NGS read data. Still, the applications of sequence signature methods for the comparison of metagenomic samples have not been well studied.ResultsWe studied several dissimilarity measures, including d2, d2* and d2S recently developed from our group, a measure (hereinafter noted as Hao) used in CVTree developed from Hao’s group (Qi et al., 2004), measures based on relative di-, tri-, and tetra-nucleotide frequencies as in Willner et al. (2009), as well as standard lp measures between the frequency vectors, for the comparison of metagenomic samples using sequence signatures. We compared their performance using a series of extensive simulations and three real next-generation sequencing (NGS) metagenomic datasets: 39 fecal samples from 33 mammalian host species, 56 marine samples across the world, and 13 fecal samples from human individuals. Results showed that the dissimilarity measure d2S can achieve superior performance when comparing metagenomic samples by clustering them into different groups as well as recovering environmental gradients affecting microbial samples. New insights into the environmental factors affecting microbial compositions in metagenomic samples are obtained through the analyses. Our results show that sequence signatures of the mammalian gut are closely associated with diet and gut physiology of the mammals, and that sequence signatures of marine communities are closely related to location and temperature.ConclusionsSequence signatures can successfully reveal major group and gradient relationships among metagenomic samples from NGS reads without alignment to reference databases. The d2S dissimilarity measure is a good choice in all application scenarios. The optimal choice of tuple size depends on sequencing depth, but it is quite robust within a range of choices for moderate sequencing depths.

Explore More