Guohua Jin | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Guohua Jin is active.

Explore More

Publication

Featured researches published by Guohua Jin.

Bioinformatics | 2006

Maximum likelihood of phylogenetic networks

Guohua Jin; Luay Nakhleh; Sagi Snir; Tamir Tuller

MOTIVATION Horizontal gene transfer (HGT) is believed to be ubiquitous among bacteria, and plays a major role in their genome diversification as well as their ability to develop resistance to antibiotics. In light of its evolutionary significance and implications for human health, developing accurate and efficient methods for detecting and reconstructing HGT is imperative. RESULTS In this article we provide a new HGT-oriented likelihood framework for many problems that involve phylogeny-based HGT detection and reconstruction. Beside the formulation of various likelihood criteria, we show that most of these problems are NP-hard, and offer heuristics for efficient and accurate reconstruction of HGT under these criteria. We implemented our heuristics and used them to analyze biological as well as synthetic data. In both cases, our criteria and heuristics exhibited very good performance with respect to identifying the correct number of HGT events as well as inferring their correct location on the species tree. AVAILABILITY Implementation of the criteria as well as heuristics and hardness proofs are available from the authors upon request. Hardness proofs can also be downloaded at http://www.cs.tau.ac.il/~tamirtul/MLNET/Supp-ML.pdf

Proceedings of the Third Conference on Partitioned Global Address Space Programing Models | 2009

A new vision for coarray Fortran

John M. Mellor-Crummey; Laksono Adhianto; William N. Scherer; Guohua Jin

In 1998, Numrich and Reid proposed Coarray Fortran as a simple set of extensions to Fortran 95 [7]. Their principal extension to Fortran was support for shared data known as coarrays. In 2005, the Fortran Standards Committee began exploring the addition of coarrays to Fortran 2008, which is now being finalized. Careful review of drafts of the emerging Fortran 2008 standard led us to identify several shortcomings with the proposed coarray extensions. In this paper, we briefly critique the coarray extensions proposed for Fortran 2008, outline a new vision for coarrays in Fortran language that is far more expressive, and briefly describe our strategy for implementing the language extensions that we propose.

computational systems bioinformatics | 2005

Reconstructing phylogenetic networks using maximum parsimony

Luay Nakhleh; Guohua Jin; Fengmei Zhao; John M. Mellor-Crummey

Phylogenies-the evolutionary histories of groups of organisms-are one of the most widely used tools throughout the life sciences, as well as objects of research within systematics, evolutionary biology, epidemiology, etc. Almost every tool devised to date to reconstruct phylogenies produces trees; yet it is widely understood and accepted that trees oversimplify the evolutionary histories of many groups of organisms, most prominently bacteria (because of horizontal gene transfer) and plants (because of hybrid speciation). Various methods and criteria have been introduced for phylogenetic tree reconstruction. Parsimony is one of the most widely used and studied criteria, and various accurate and efficient heuristics for reconstructing trees based on parsimony have been devised. Jotun Hein suggested a straightforward extension of the parsimony criterion to phylogenetic networks. In this paper we formalize this concept, and provide the first experimental study of the quality of parsimony as a criterion for constructing and evaluating phylogenetic networks. Our results show that, when extended to phylogenetic networks, the parsimony criterion produces promising results. In a great majority of the cases in our experiments, the parsimony criterion accurately predicts the numbers and placements of non-tree events.

Bioinformatics | 2007

Efficient parsimony-based methods for phylogenetic network reconstruction

Guohua Jin; Luay Nakhleh; Sagi Snir; Tamir Tuller

MOTIVATION Phylogenies--the evolutionary histories of groups of organisms-play a major role in representing relationships among biological entities. Although many biological processes can be effectively modeled as tree-like relationships, others, such as hybrid speciation and horizontal gene transfer (HGT), result in networks, rather than trees, of relationships. Hybrid speciation is a significant evolutionary mechanism in plants, fish and other groups of species. HGT plays a major role in bacterial genome diversification and is a significant mechanism by which bacteria develop resistance to antibiotics. Maximum parsimony is one of the most commonly used criteria for phylogenetic tree inference. Roughly speaking, inference based on this criterion seeks the tree that minimizes the amount of evolution. In 1990, Jotun Hein proposed using this criterion for inferring the evolution of sequences subject to recombination. Preliminary results on small synthetic datasets. Nakhleh et al. (2005) demonstrated the criterions application to phylogenetic network reconstruction in general and HGT detection in particular. However, the naive algorithms used by the authors are inapplicable to large datasets due to their demanding computational requirements. Further, no rigorous theoretical analysis of computing the criterion was given, nor was it tested on biological data. RESULTS In the present work we prove that the problem of scoring the parsimony of a phylogenetic network is NP-hard and provide an improved fixed parameter tractable algorithm for it. Further, we devise efficient heuristics for parsimony-based reconstruction of phylogenetic networks. We test our methods on both synthetic and biological data (rbcL gene in bacteria) and obtain very promising results.

ACM Transactions on Mathematical Software | 2005

SFCGen: A framework for efficient generation of multi-dimensional space-filling curves by recursion

Guohua Jin; John M. Mellor-Crummey

Because they are continuous and self-similar, space-filling curves have been widely used in mathematics to transform multi-dimensional problems into one-dimensional forms. For scientific applications, reordering computation by certain space-filling curves can significantly improve data reuse because of the locality properties of these curves. However, when space-filling curves are used in programs for reordering data, traversal or indexing of the curves must be efficient. To address this problem, we present the table-driven framework SFCGen to efficiently generate multi-dimensional space-filling curves on the fly. The framework is general and easy enough to be used in any application that can be partitioned recursively in multiple dimensions. We describe a movement specification table, a universal turtle algorithm to enumerate points along a space-filling curve, a table-based indexing algorithm to transform coordinates of a point into its position along the curve and an algorithm to pregenerate the table automatically. As examples, we show how high-dimensional Hilbert, Morton, and Peano curves and a two-dimensional Sierpiński curve can be generated with our algorithms. We present performance results for Hilbert, Morton, and Peano curves and compare the efficiency of our curve generation algorithm with the most recent work on generating Hilbert curves. Our experimental results on three modern microprocessor-based platforms show that SFCGen performs up to 63&percent; faster than the most recent recursive algorithm on 2D curve generation and up to a factor of 132 faster than two previous byte-oriented non-recursive implementations. On curve indexing, SFCGen performs as much as a factor of three faster than the byte-oriented implementation. Our results on 4D space-filling curves also show that SFCGen scales very well with curve level for higher dimensional spaces.

conference on high performance computing (supercomputing) | 1998

High Performance Fortran Compilation Techniques for Parallelizing Scientific Codes

Vikram S. Adve; Guohua Jin; John M. Mellor-Crummey; Qing Yi

With current compilers for High Performance Fortran (HPF), substantial restructuring and hand- optimization may be required to obtain acceptable performance from an HPF port of an existing Fortran application. A key goal of the Rice dHPF compiler project is to develop optimization techniques that can provide consistently high performance for a broad spectrum of scientific applications with minimal restructuring of existing Fortran 77 or Fortran 90 applications. This paper presents four new optimization techniques we developed to support efficient parallelization of codes with minimal restructuring. These optimizations include computation partition selection for loop nests that use privatizable arrays, along with partial replication of boundary computations to reduce communication overhead; communication- sensitive loop distribution to eliminate inner-loop communications; interprocedural selection of computation partitions; and data availability analysis to eliminate redundant communications. We studied the effectiveness of the dHPF compiler, which incorporates these optimizations, in parallelizing serial versions of the NAS SP and BT application benchmarks. We present experimental results comparing the performance of hand-written MPI code for the benchmarks against code generated from HPF using the dHPF compiler and the Portland Groups pghpf compiler. Using the compilation techniques described in this paper we achieve performance within 15% of hand-written MPI code on 25 processors for BT and within 33% for SP. Furthermore, these results are obtained with HPF versions of the benchmarks that were created with minimal restructuring of the serial code (modifying only approximately 5% of the code).

IEEE/ACM Transactions on Computational Biology and Bioinformatics | 2009

Parsimony Score of Phylogenetic Networks: Hardness Results and a Linear-Time Heuristic

Guohua Jin; Luay Nakhleh; Sagi Snir; Tamir Tuller

Phylogenies-the evolutionary histories of groups of organisms-play a major role in representing the interrelationships among biological entities. Many methods for reconstructing and studying such phylogenies have been proposed, almost all of which assume that the underlying history of a given set of species can be represented by a binary tree. Although many biological processes can be effectively modeled and summarized in this fashion, others cannot: recombination, hybrid speciation, and horizontal gene transfer result in networks of relationships rather than trees of relationships. In previous works, we formulated a maximum parsimony (MP) criterion for reconstructing and evaluating phylogenetic networks, and demonstrated its quality on biological as well as synthetic data sets. In this paper, we provide further theoretical results as well as a very fast heuristic algorithm for the MP criterion of phylogenetic networks. In particular, we provide a novel combinatorial definition of phylogenetic networks in terms of ldquoforbidden cycles,rdquo and provide detailed hardness and hardness of approximation proofs for the smallrdquo MP problem. We demonstrate the performance of our heuristic in terms of time and accuracy on both biological and synthetic data sets. Finally, we explain the difference between our model and a similar one formulated by Nguyen et al., and describe the implications of this difference on the hardness and approximation results.

Concurrency and Computation: Practice and Experience | 2002

Advanced Optimization Strategies in the Rice dHPF Compiler

John M. Mellor-Crummey; Vikram S. Adve; Bradley Broom; Daniel G. Chavarría-Miranda; Robert J. Fowler; Guohua Jin; Ken Kennedy; Qing Yi

High‐Performance Fortran (HPF) was envisioned as a vehicle for modernizing legacy Fortran codes to achieve scalable parallel performance. To a large extent, todays commercially available HPF compilers have failed to deliver scalable parallel performance for a broad spectrum of applications because of insufficiently powerful compiler analysis and optimization. Substantial restructuring and hand‐optimization can be required to achieve acceptable performance with an HPF port of an existing Fortran application, even for regular data‐parallel applications. A key goal of the Rice dHPF compiler project has been to develop optimization techniques that enable a wide range of existing scientific applications to be ported easily to efficient HPF with minimal restructuring. This paper describes the challenges to effective parallelization presented by complex (but regular) data‐parallel applications, and then describes how the novel analysis and optimization technologies in the dHPF compiler address these challenges effectively, without major rewriting of the applications. We illustrate the techniques by describing their use for parallelizing the NAS SP and BT benchmarks. The dHPF compiler generates multipartitioned parallelizations of these codes that are approaching the scalability and efficiency of sophisticated hand‐coded parallelizations. Copyright

international parallel and distributed processing symposium | 2011

Implementation and Performance Evaluation of the HPC Challenge Benchmarks in Coarray Fortran 2.0

Guohua Jin; John M. Mellor-Crummey; Laksono Adhianto; William N. Scherer; Chaoran Yang

Todays largest supercomputers have over two hundred thousand CPU cores and even larger systems are under development. Typically, these systems are programmed using message passing. Over the past decade, there has been considerable interest in developing simpler and more expressive programming models for them. Partitioned global address space (PGAS) languages are viewed as perhaps the most promising alternative. In this paper, we report on our experience developing a set of PGAS extensions to Fortran that we call Co array Fortran 2.0 (CAF 2.0). Our design for CAF 2.0 goes well beyond the original 1998 design of Co array Fortran (CAF) by Numrich and Reid. CAF 2.0 includes language support for many features including teams, collective communication, asynchronous communication, function shipping, and synchronization. We describe the implementation of these features and our experiences using them to implement the High Performance Computing Challenge (HPCC) benchmarks, including High Performance Linpack (HPL), Random Access, Fast Fourier Transform (FFT), and STREAM triad. On 4096 CPU cores of a Cray XT with 2.3 GHz single socket quad-core Opteron processors, we achieved 18.3 TFLOP/s with HPL, 2.01 GUP/s with Random Access, 125 GFLOP/s with FFT, and a bandwidth of 8.73 TByte/s with STREAM triad. we call Co array Fortran 2.0 (CAF 2.0). Our design for CAF 2.0 goes well beyond the original 1998 design of Coarray Fortran (CAF) by Numrich and Reid. CAF 2.0 includes language support for many features including teams, collective communication, asynchronous communication, function shipping, and synchronization. We describe the implementation of these features and our experiences using them to implement the High Performance Computing Challenge (HPCC) benchmarks, including High Performance Linpack (HPL), Random Access, Fast Fourier Transform (FFT), and STREAM triad. On 4096 CPU cores of a Cray XT with 2.3 GHz single socket quad-core Opteron processors, we achieved 18.3 TFLOP/s with HPL, 2.01 GUP/s with Random Access, 125 GFLOP/s with FFT, and a bandwidth of 8.73 TByte/s with STREAM triad.

BMC Evolutionary Biology | 2010

Bootstrap-based Support of HGT Inferred by Maximum Parsimony

Hyun Jung Park; Guohua Jin; Luay Nakhleh

BackgroundMaximum parsimony is one of the most commonly used criteria for reconstructing phylogenetic trees. Recently, Nakhleh and co-workers extended this criterion to enable reconstruction of phylogenetic networks, and demonstrated its application to detecting reticulate evolutionary relationships. However, one of the major problems with this extension has been that it favors more complex evolutionary relationships over simpler ones, thus having the potential for overestimating the amount of reticulation in the data. An ad hoc solution to this problem that has been used entails inspecting the improvement in the parsimony length as more reticulation events are added to the model, and stopping when the improvement is below a certain threshold.ResultsIn this paper, we address this problem in a more systematic way, by proposing a nonparametric bootstrap-based measure of support of inferred reticulation events, and using it to determine the number of those events, as well as their placements. A number of samples is generated from the given sequence alignment, and reticulation events are inferred based on each sample. Finally, the support of each reticulation event is quantified based on the inferences made over all samples.ConclusionsWe have implemented our method in the NEPAL software tool (available publicly at http://bioinfo.cs.rice.edu/), and studied its performance on both biological and simulated data sets. While our studies show very promising results, they also highlight issues that are inherently challenging when applying the maximum parsimony criterion to detect reticulate evolution.

Explore More