Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Seung-Jin Sul is active.

Publication


Featured researches published by Seung-Jin Sul.


asia-pacific bioinformatics conference | 2007

A RANDOMIZED ALGORITHM FOR COMPARING SETS OF PHYLOGENETIC TREES

Seung-Jin Sul; Tiffani L. Williams

Phylogenetic analysis often produce a large number of candidate evolutionary trees, each a hypothesis of the ”true” tree. Post-processing techniques such as stri ct consensus trees are widely used to summarize the evolutionary relationships into a single tree. H owever, valuable information is lost during the summarization process. A more elementary step is to produce estimates of the topological differences that exist among all pairs of trees. We design a new randomized algorithm, called Hash-RF, that computes the all-to-all Robinson-Foulds (RF) distance—the most common distance metric for comparing two phylogenetic trees. Our approach uses a hash table to organize the bipartitions of a tree, and a universal hashing function makes our algorithm randomized. We compare the performance of our Hash-RF algorithm to PAUP*’s implementation of computing the all-to-all RF distance matrix. Our experiments focus on the algorithmic performance of comparing sets of biological trees, where the size of each tree ranged from 500 to 2,000 taxa and the collection of trees varied from 200 to 1,000 trees. Our experimental results clearly show that our Hash-RF algorithm is up to 500 times faster than PAUP*’s approach. Thus, Hash-RF provides an efficient alter native to a single tree summary of a collection of trees and potentially gives researchers the abil ity to explore their data in new and interesting ways.


european symposium on algorithms | 2008

An Experimental Analysis of Robinson-Foulds Distance Matrix Algorithms

Seung-Jin Sul; Tiffani L. Williams

In this paper, we study two fast algorithms--HashRF and PGM-Hashed--for computing the Robinson-Foulds (RF) distance matrix between a collection of evolutionary trees. The RF distance matrix represents a tremendous data-mining opportunity for helping biologists understand the evolutionary relationships depicted among their trees. The novelty of our work results from using a variety of different architecture- and implementation-independent measures (i.e., percentage of bipartition sharing, number of bipartition comparisons, and memory usage) in addition to CPU time to explore practical algorithmic performance. Overall, our study concludes that HashRF performs better across the various performance measures than its competitor, PGM-Hashed. Thus, the HashRF algorithm provides scientists with a fast approach for understanding the evolutionary relationships among a set of trees.


international symposium on bioinformatics research and applications | 2009

An Experimental Analysis of Consensus Tree Algorithms for Large-Scale Tree Collections

Seung-Jin Sul; Tiffani L. Williams

Consensus trees are a popular approach for summarizing the shared evolutionary relationships in a collection of trees. Many popular techniques such as Bayesian analyses produce results that can contain tens of thousands of trees to summarize. We develop a fast consensus algorithm called HashCS to construct large-scale consensus trees. We perform an extensive empirical study for comparing the performance of several consensus tree algorithms implemented in widely-used, phylogenetic software such as PAUP* and MrBayes. Our collections of biological and artificial trees range from 128 to 16,384 trees on 128 to 1,024 taxa. Experimental results show that our HashCS approach is up to 100 times faster than MrBayes and up to 9 times faster than PAUP*. Fast consensus algorithms such as HashCS can be used in a variety of ways, such as in real-time to detect whether a phylogenetic search has converged.


BMC Bioinformatics | 2009

Using tree diversity to compare phylogenetic heuristics

Seung-Jin Sul; Suzanne J. Matthews; Tiffani L. Williams

BackgroundEvolutionary trees are family trees that represent the relationships between a group of organisms. Phylogenetic heuristics are used to search stochastically for the best-scoring trees in tree space. Given that better tree scores are believed to be better approximations of the true phylogeny, traditional evaluation techniques have used tree scores to determine the heuristics that find the best scores in the fastest time. We develop new techniques to evaluate phylogenetic heuristics based on both tree scores and topologies to compare Pauprat and Rec-I-DCM3, two popular Maximum Parsimony search algorithms.ResultsOur results show that although Pauprat and Rec-I-DCM3 find the trees with the same best scores, topologically these trees are quite different. Furthermore, the Rec-I-DCM3 trees cluster distinctly from the Pauprat trees. In addition to our heatmap visualizations of using parsimony scores and the Robinson-Foulds distance to compare best-scoring trees found by the two heuristics, we also develop entropy-based methods to show the diversity of the trees found. Overall, Pauprat identifies more diverse trees than Rec-I-DCM3.ConclusionOverall, our work shows that there is value to comparing heuristics beyond the parsimony scores that they find. Pauprat is a slower heuristic than Rec-I-DCM3. However, our work shows that there is tremendous value in using Pauprat to reconstruct trees—especially since it finds identical scoring but topologically distinct trees. Hence, instead of discounting Pauprat, effort should go in improving its implementation. Ultimately, improved performance measures lead to better phylogenetic heuristics and will result in better approximations of the true evolutionary history of the organisms of interest.


workshop on algorithms in bioinformatics | 2008

Efficiently Computing Arbitrarily-Sized Robinson-Foulds Distance Matrices

Seung-Jin Sul; Grant R. Brammer; Tiffani L. Williams

In this paper, we introduce the HashRF(p,q) algorithm for computing RF matrices of large binary, evolutionary tree collections. The novelty of our algorithm is that it can be used to compute arbitrarily-sized (p×q) RF matrices without running into physical memory limitations. In this paper, we explore the performance of our HashRF(p,q) approach on 20,000 and 33,306 biological trees of 150 taxa and 567 taxa trees, respectively, collected from a Bayesian analysis. When computing the all-to-all RF matrix, HashRF(p,q) is up to 200 times faster than PAUP* and around 40% faster than HashRF, one of the fastest all-to-all RF algorithms. We show an application of our approach by clustering large RF matrices to improve the resolution rate of consensus trees, a popular approach used by biologists to summarize the results of their phylogenetic analysis. Thus, our HashRF(p,q) algorithm provides scientists with a fast and efficient alternative for understanding the evolutionary relationships among a set of trees.


international symposium on bioinformatics research and applications | 2010

A novel approach for compressing phylogenetic trees

Suzanne J. Matthews; Seung-Jin Sul; Tiffani L. Williams

Phylogenetic trees are tree structures that depict relationships between organisms. Popular analysis techniques often produce large collections of candidate trees, which are expensive to store. We introduce TreeZip, a novel algorithm to compress phylogenetic trees based on their shared evolutionary relationships. We evaluate TreeZips performance on fourteen tree collections ranging from 2,505 trees on 328 taxa to 150,000 trees on 525 taxa corresponding to 0.6 MB to 434 MB in storage. Our results show that TreeZip is very effective, typically compressing a tree file to less than 2% of its original size. When coupled with standard compression methods such as 7zip, TreeZip can compress a file to less than 1% of its original size. Our results strongly suggest that TreeZip is very effective at compressing phylogenetic trees, which allows for easier exchange of data with colleagues around the world.


bioinformatics and biomedicine | 2008

New Approaches to Compare Phylogenetic Search Heuristics

Seung-Jin Sul; Suzanne J. Matthews; Tiffani L. Williams

We present new and novel insights into the behavior of two maximum parsimony heuristics for building evolutionary trees of different sizes. First, our results show that the heuristics find different classes of good-scoring trees, where the different classes of trees may have significant evolutionary implications. Secondly, we develop a new entropy-based measure to quantify the diversity among the evolutionary trees found by the heuristics. Overall, topological distance measures such as the Robinson-Foulds distance identify more diversity among a collection of trees than parsimony scores, which implies more powerful heuristics could be designed that use a combination of parsimony scores and topological distances. Thus, by understanding phylogenetic heuristic behavior, better heuristics could be designed, which ultimately leads to more accurate evolutionary trees.


Journal of Computational Biology | 2011

Big Cat Phylogenies, Consensus Trees, and Computational Thinking

Seung-Jin Sul; Tiffani L. Williams

Phylogenetics seeks to deduce the pattern of relatedness between organisms by using a phylogeny or evolutionary tree. For a given set of organisms or taxa, there may be many evolutionary trees depicting how these organisms evolved from a common ancestor. As a result, consensus trees are a popular approach for summarizing the shared evolutionary relationships in a group of trees. We examine these consensus techniques by studying how the pantherine lineage of cats (clouded leopard, jaguar, leopard, lion, snow leopard, and tiger) evolved, which is hotly debated. While there are many phylogenetic resources that describe consensus trees, there is very little information, written for biologists, regarding the underlying computational techniques for building them. The pantherine cats provide us with a small, relevant example to explore the computational techniques (such as sorting numbers, hashing functions, and traversing trees) for constructing consensus trees. Our hope is that life scientists enjoy peeking under the computational hood of consensus tree construction and share their positive experiences with others in their community.


Advances in Experimental Medicine and Biology | 2010

Large-Scale Analysis of Phylogenetic Search Behavior

Hyun Jung Park; Seung-Jin Sul; Tiffani L. Williams

Phylogenetic analysis is used in all branches of biology with applications ranging from studies on the origin of human populations to investigations of the transmission patterns of HIV. Most phylogenetic analyses rely on effective heuristics for obtaining accurate trees. However, relatively little work has been done to analyze quantitatively the behavior of phylogenetic heuristics in tree space. A better understanding of local search behavior can facilitate the design of better heuristics, which ultimately lead to more accurate depictions of the true evolutionary relationships. In this paper, we present new and novel insights into local search behavior for maximum parsimony on three biological datasets consisting of 44, 60, and 174 taxa. By analyzing all trees from search, we find that, as the search algorithm climbs the hill to local optima, the trees in the neighborhood surrounding the current solution improve as well. Furthermore, the search is quite robust to a small number of randomly selected neighbors. Thus, our work shows how to gain insights into the behavior of local search algorithm by exploring a large diverse collection of trees.


Evolutionary Bioinformatics | 2011

A New Support Measure to Quantify the Impact of Local Optima in Phylogenetic Analyses

Grant R. Brammer; Seung-Jin Sul; Tiffani L. Williams

Phylogentic analyses are often incorrectly assumed to have stabilized to a single optimum. However, a set of trees from a phylogenetic analysis may contain multiple distinct local optima with each optimum providing different levels of support for each clade. For situations with multiple local optima, we propose p-support which is a clade support measure that shows the impact optima have on a final consensus tree. Our p-support measure is implemented in our PeakMapper software package. We study our approach on two published, large-scale biological tree collections. PeakMapper shows that each data set contains multiple local optima, p-support shows that both datasets contain clades in the majority consensus tree that are only supported by a subset of the local optima. Clades with low p-support are most likely to benefit from further investigation. These tools provide researchers with new information regarding phylogenetic analyses beyond what is provided by other support measures alone.

Collaboration


Dive into the Seung-Jin Sul's collaboration.

Top Co-Authors

Avatar

Suzanne J. Matthews

United States Military Academy

View shared research outputs
Top Co-Authors

Avatar

Hyun Jung Park

Baylor College of Medicine

View shared research outputs
Researchain Logo
Decentralizing Knowledge