Mingfu Shao
École Polytechnique Fédérale de Lausanne
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Mingfu Shao.
research in computational molecular biology | 2012
Mingfu Shao; Yu Lin
Computing the edit distance between two genomes under certain operations is a basic problem in the study of genome evolution. The double-cut-and-join (DCJ) model has formed the basis for most algorithmic research on rearrangements over the last few years. The edit distance under the DCJ model can be easily computed for genomes without duplicate genes. In this paper, we study the edit distance for genomes with duplicate genes under a model that includes DCJ operations, insertions and deletions. We prove that computing the edit distance is equivalent to finding the optimal cycle decomposition of the corresponding adjacency graph, and give an approximation algorithm with an approximation ratio of 1.5 + ∈.
research in computational molecular biology | 2014
Mingfu Shao; Yu Lin; Bernard M. E. Moret
Computing the edit distance between two genomes is a basic problem in the study of genome evolution. The double-cut-and-joini¾źDCJ model has formed the basis for most algorithmic research on rearrangements over the last few years. The edit distance under the DCJ model can be computed in linear time for genomes without duplicate genes, while the problem becomes NP-hard in the presence of duplicate genes. In this paper, we propose an ILPi¾źinteger linear programming formulation to compute the DCJ distance between two genomes with duplicate genes. We also provide an efficient preprocessing approach to simplify the ILP formulation while preserving optimality. Comparison on simulated genomes demonstrates that our method outperforms MSOAR in computing the edit distance, especially when the genomes contain long duplicated segments. We also apply our method to assign orthologous gene pairs among human, mouse and rat genomes, where once again our method outperforms MSOAR.
Journal of Computational Biology | 2015
Mingfu Shao; Yu Lin; Bernard M. E. Moret
Computing the edit distance between two genomes is a basic problem in the study of genome evolution. The double-cut-and-join (DCJ) model has formed the basis for most algorithmic research on rearrangements over the last few years. The edit distance under the DCJ model can be computed in linear time for genomes without duplicate genes, while the problem becomes NP-hard in the presence of duplicate genes. In this article, we propose an integer linear programming (ILP) formulation to compute the DCJ distance between two genomes with duplicate genes. We also provide an efficient preprocessing approach to simplify the ILP formulation while preserving optimality. Comparison on simulated genomes demonstrates that our method outperforms MSOAR in computing the edit distance, especially when the genomes contain long duplicated segments. We also apply our method to assign orthologous gene pairs among human, mouse, and rat genomes, where once again our method outperforms MSOAR.
research in computational molecular biology | 2013
Mingfu Shao; Yu Lin; Bernard M. E. Moret
We study the problem of sorting genomes under an evolutionary model that includes genomic rearrangements and segmental duplications. We propose an iterative algorithm to improve any initial evolutionary trajectory between two genomes in terms of parsimony. Our algorithm is based on a new graphical model, the trajectory graph, which models not only the final states of two genomes but also an existing evolutionary trajectory between them. We show that redundant rearrangements in the trajectory correspond to certain cycles in the trajectory graph, and prove that our algorithm converges to an optimal trajectory for any initial trajectory involving only rearrangements.
Bioinformatics | 2015
Mingfu Shao; Bernard M. E. Moret
Motivation: Large-scale evolutionary events such as genomic rearrange.ments and segmental duplications form an important part of the evolution of genomes and are widely studied from both biological and computational perspectives. A basic computational problem is to infer these events in the evolutionary history for given modern genomes, a task for which many algorithms have been proposed under various constraints. Algorithms that can handle both rearrangements and content-modifying events such as duplications and losses remain few and limited in their applicability. Results: We study the comparison of two genomes under a model including general rearrangements (through double-cut-and-join) and segmental duplications. We formulate the comparison as an optimization problem and describe an exact algorithm to solve it by using an integer linear program. We also devise a sufficient condition and an efficient algorithm to identify optimal substructures, which can simplify the problem while preserving optimality. Using the optimal substructures with the integer linear program (ILP) formulation yields a practical and exact algorithm to solve the problem. We then apply our algorithm to assign in-paralogs and orthologs (a necessary step in handling duplications) and compare its performance with that of the state-of-the-art method MSOAR, using both simulations and real data. On simulated datasets, our method outperforms MSOAR by a significant margin, and on five well-annotated species, MSOAR achieves high accuracy, yet our method performs slightly better on each of the 10 pairwise comparisons. Availability and implementation: http://lcbb.epfl.ch/softwares/coser. Contact: [email protected] or [email protected]
BMC Bioinformatics | 2011
Mingfu Shao; Sheng Wang; Chao Wang; Xiongying Yuan; Shuai Cheng Li; Wei-Mou Zheng; Dongbo Bu
BackgroundNative structures of proteins are formed essentially due to the combining effects of local and distant (in the sense of sequence) interactions among residues. These interaction information are, explicitly or implicitly, encoded into the scoring function in protein structure prediction approaches—threading approaches usually measure an alignment in the sense that how well a sequence adopts an existing structure; while the energy functions in Ab Initio methods are designed to measure how likely a conformation is near-native. Encouraging progress has been observed in structure refinement where knowledge-based or physics-based potentials are designed to capture distant interactions. Thus, it is interesting to investigate whether distant interaction information captured by the Ab Initio energy function can be used to improve threading, especially for the weakly/distant homologous templates.ResultsIn this paper, we investigate the possibility to improve alignment-generating through incorporating distant interaction information into the alignment scoring function in a nontrivial approach. Specifically, the distant interaction information is introduced through employing an Ab Initio energy function to evaluate the “partial” decoy built from an alignment. Subsequently, a local search algorithm is utilized to optimize the scoring function.Experimental results demonstrate that with distant interaction items, the quality of generated alignments are improved on 68 out of 127 query-template pairs in Prosup benchmark. In addition, compared with state-to-art threading methods, our method performs better on alignment accuracy comparison.ConclusionsIncorporating Ab Initio energy functions into threading can greatly improve alignment accuracy.
Nature Biotechnology | 2017
Mingfu Shao; Carl Kingsford
We introduce Scallop, an accurate reference-based transcript assembler that improves reconstruction of multi-exon and lowly expressed transcripts. Scallop preserves long-range phasing paths extracted from reads, while producing a parsimonious set of transcripts and minimizing coverage deviation. On 10 human RNA-seq samples, Scallop produces 34.5% and 36.3% more correct multi-exon transcripts than StringTie and TransComb, and respectively identifies 67.5% and 52.3% more lowly expressed transcripts. Scallop achieves higher sensitivity and precision than previous approaches over a wide range of coverage thresholds.
Journal of Computational Biology | 2016
Mingfu Shao; Bernard M. E. Moret
A fundamental problem in comparative genomics is to compute the distance between two genomes. For two genomes without duplicate genes, we can easily compute a variety of distance measures in linear time, but the problem is NP-hard under most models when genomes contain duplicate genes. Sankoff proposed the use of exemplars to tackle the problem of duplicate genes and gene families: each gene family is represented by a single gene (the exemplar for that family), chosen so as to optimize some metric. Unfortunately, choosing exemplars is itself an NP-hard problem. In this article, we propose a very fast and exact algorithm to compute the exemplar breakpoint distance, based on new insights in the underlying structure of genome rearrangements and exemplars. We evaluate the performance of our algorithm on simulation data and compare its performance to the best effort to date (a divide-and-conquer approach), showing that our algorithm runs much faster and scales much better. We also devise a new algorithm for the intermediate breakpoint distance problem, which can then be applied to assign orthologs. We compare our algorithm with the state-of-the-art method MSOAR by assigning orthologs among five well annotated mammalian genomes, showing that our algorithm runs much faster and is slightly more accurate than MSOAR.
research in computational molecular biology | 2015
Mingfu Shao; Bernard M. E. Moret
A fundamental problem in comparative genomics is to compute the distance between two genomes. For two genomes without duplicate genes, we can easily compute a variety of distance measures in linear time, but the problem is NP-hard under most models when genomes contain duplicate genes. Sankoff proposed the use of exemplars to tackle the problem of duplicates genes and gene families: each gene family is represented by a single gene (the exemplar for that family), chosen so as to optimize some metric. Unfortunately, choosing exemplars is itself an NP-hard problem. In this paper, we propose a very fast and exact algorithm to compute the exemplar breakpoint distance, based on new insights in the underlying structure of genome rearrangements and exemplars. We evaluate the performance of our algorithm on simulation data and compare its performance to the best effort to date (a divide-and-conquer approach), showing that our algorithm runs much faster and scales much better. We also devise a new algorithm for the generalized breakpoint distance problem, which can then be applied to assign orthologs. We compare our algorithm with the state-of-the-art method MSOAR by assigning orthologs among five well annotated mammalian genomes, showing that our algorithm runs much faster and is slightly more accurate than MSOAR.
asia pacific bioinformatics conference | 2016
Nishanth Ulhas Nair; Laura Hunter; Mingfu Shao; Paulina Grnarova; Yu Lin; Philipp Bucher; Bernard M. E. Moret
BackgroundIn cell differentiation, a less specialized cell differentiates into a more specialized one, even though all cells in one organism have (almost) the same genome. Epigenetic factors such as histone modifications are known to play a significant role in cell differentiation. We previously introduce cell-type trees to represent the differentiation of cells into more specialized types, a representation that partakes of both ontogeny and phylogeny.ResultsWe propose a maximum-likelihood (ML) approach to build cell-type trees and show that this ML approach outperforms our earlier distance-based and parsimony-based approaches. We then study the reconstruction of ancestral cell types; since both ancestral and derived cell types can coexist in adult organisms, we propose a lifting algorithm to infer internal nodes. We present results on our lifting algorithm obtained both through simulations and on real datasets.ConclusionsWe show that our ML-based approach outperforms previously proposed techniques such as distance-based and parsimony-based methods. We show our lifting-based approach works well on both simulated and real data.