Rob Egan
Lawrence Berkeley National Laboratory
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Rob Egan.
Science | 2011
Matthias Hess; Alexander Sczyrba; Rob Egan; Tae Wan Kim; Harshal A. Chokhawala; Gary P. Schroth; Shujun Luo; Douglas S. Clark; Feng Chen; Tao Zhang; Roderick I. Mackie; Len A. Pennacchio; Susannah G. Tringe; Axel Visel; Tanja Woyke; Zhong Wang; Edward M. Rubin
Metagenomic sequencing of biomass-degrading microbes from cow rumen reveals new carbohydrate-active enzymes. The paucity of enzymes that efficiently deconstruct plant polysaccharides represents a major bottleneck for industrial-scale conversion of cellulosic biomass into biofuels. Cow rumen microbes specialize in degradation of cellulosic plant material, but most members of this complex community resist cultivation. To characterize biomass-degrading genes and genomes, we sequenced and analyzed 268 gigabases of metagenomic DNA from microbes adherent to plant fiber incubated in cow rumen. From these data, we identified 27,755 putative carbohydrate-active genes and expressed 90 candidate proteins, of which 57% were enzymatically active against cellulosic substrates. We also assembled 15 uncultured microbial genomes, which were validated by complementary methods including single-cell genome sequencing. These data sets provide a substantially expanded catalog of genes and genomes participating in the deconstruction of cellulosic biomass.
PeerJ | 2015
Dongwan D. Kang; Jeff Froula; Rob Egan; Zhong Wang
Grouping large genomic fragments assembled from shotgun metagenomic sequences to deconvolute complex microbial communities, or metagenome binning, enables the study of individual organisms and their interactions. Because of the complex nature of these communities, existing metagenome binning methods often miss a large number of microbial species. In addition, most of the tools are not scalable to large datasets. Here we introduce automated software called MetaBAT that integrates empirical probabilistic distances of genome abundance and tetranucleotide frequency for accurate metagenome binning. MetaBAT outperforms alternative methods in accuracy and computational efficiency on both synthetic and real metagenome datasets. It automatically forms hundreds of high quality genome bins on a very large assembly consisting millions of contigs in a matter of hours on a single node. MetaBAT is open source software and available at https://bitbucket.org/berkeleylab/metabat.
ieee international conference on high performance computing data and analytics | 2015
Evangelos Georganas; Aydin Buluç; Jarrod Chapman; Steven A. Hofmeyr; Chaitanya Aluru; Rob Egan; Leonid Oliker; Daniel Rokhsar; Katherine A. Yelick
De novo whole genome assembly reconstructs genomic sequences from short, overlapping, and potentially erroneous DNA segments and is one of the most important computations in modern genomics. This work presents HipMer, the first high-quality end-to-end de novo assembler designed for extreme scale analysis, via efficient parallelization of the Meraculous code. First, we significantly improve scalability of parallel k-mer analysis for complex repetitive genomes that exhibit skewed frequency distributions. Next, we optimize the traversal of the de Bruijn graph of k-mers by employing a novel communication-avoiding parallel algorithm in a variety of use-case scenarios. Finally, we parallelize the Meraculous scaffolding modules by leveraging the one-sided communication capabilities of the Unified Parallel C while effectively mitigating load imbalance. Large-scale results on a Cray XC30 using grand-challenge genomes demonstrate efficient performance and scalability on thousands of cores. Overall, our pipeline accelerates Meraculous performance by orders of magnitude, enabling the complete assembly of the human genome in just 8.4 minutes on 15K cores of the Cray XC30, and creating unprecedented capability for extreme-scale genomic analysis.
bioRxiv | 2016
Marcus H. Stoiber; Joshua Quick; Rob Egan; Ji Eun Lee; Susan E. Celniker; Robert K. Neely; Nicholas J. Loman; Len A. Pennacchio; James B. Brown
Advances in single molecule sequencing technology have enabled the investigation of the full catalogue of covalent DNA modifications. We present an assay, Modified DNA sequencing (MoD-seq), that leverages raw nanopore data processing, visualization and statistical testing to directly survey DNA modifications without the need for a large prior training dataset. We present case studies applying MoD-seq to identify three distinct marks, 4mC, 5mC, and 6mA, and demonstrate quantitative reproducibility across biological replicates processed in different labs. In a ground-truth dataset created via in vitro treatment of synthetic DNA with selected methylases, we show that modifications can be detected in a variety of distinct sequence contexts. We recapitulated known methylation patterns and frequencies in E. coli, and propose a pipeline for the comprehensive discovery of DNA modifications in a genome without a priori knowledge of their chemical identities.
bioRxiv | 2014
Dongwan Don Kang; Jeff Froula; Rob Egan; Zhong Wang
We present software that reconstructs genomes from shotgun metagenomic sequences using a reference-independent approach. This method permits the identification of OTUs in large complex communities where many species are unknown. Binning reduces the complexity of a metagenomic dataset enabling many downstream analyses previously unavailable. In this study we developed MetaBAT, a robust statistical framework that integrates probabilistic distances of genome abundance with sequence composition for automatic binning. Applying MetaBAT to a human gut microbiome dataset identified 173 highly specific genomes bins including many representing previously unidentified species.
european conference on parallel processing | 2017
Marquita Ellis; Evangelos Georganas; Rob Egan; Steven A. Hofmeyr; Aydin Buluç; Brandon Cook; Leonid Oliker; Katherine A. Yelick
De novo genome assembly is one of the most important and challenging computational problems in modern genomics; further, it shares algorithms and communication patterns important to other graph analytic and irregular applications. Unlike simulations, it has no floating point arithmetic and is dominated by small memory transactions within and between computing nodes. In this work, we focus on the highly scalable HipMer assembler and identify the dominant algorithms and communication patterns, also using microbenchmarks to capture the workload. We evaluate HipMer on a variety of platforms from the latest HPC systems to ethernet clusters. HipMer performs well on all single node systems, including the Xeon Phi manycore architecture. Given large enough problems, it also demonstrates excellent scaling across nodes in an HPC system, but requires a high speed network with low overhead and high injection rates. Our results shed light on the architectural features that are most important for achieving good parallel efficiency on this and related problems.
Proceedings of the Second Annual PGAS Applications Workshop on | 2017
Evangelos Georganas; Marquita Ellis; Rob Egan; Steven A. Hofmeyr; Aydin Buluç; Brandon Cook; Leonid Oliker; Katherine A. Yelick
De novo genome assembly is one of the most important and challenging computational problems in modern genomics; further, it shares algorithms and communication patterns important to other graph analytic and irregular applications. Unlike simulations, it has no floating point arithmetic and is dominated by small memory transactions within and between computing nodes. In this work, we introduce MerBench, a compact set of PGAS benchmarks that capture the communication patterns of the parallel algorithms throughout HipMer, a parallel genome assembler pipeline that has been shown to scale to massive concurrencies. We also present results of these microbenchmarks on the Edison supercomputer and illustrate how these empirical results can be used to assess the scaling behavior of the pipeline.
international conference on e-science | 2013
Taghrid Samak; Rob Egan; Brian Bushnell; Daniel K. Gunter; Alex Copeland; Zhong Wang
In this work we describe a method to automatically detect errors in de novo assembled genomes. The method extends a Bayesian assembly quality evaluation framework, ALE, which computes the likelihood of an assembly given a set of unassembled data. Starting from ALE output, this method applies outlier detection algorithms to identify the precise locations of assembly errors. We show results from a microbial genome with manually curated assembly errors. Our method detects all deletions, 82.3% of insertions, and 88.8% of single base substitutions. It was also able to detect an inversion error that spans more than 400 bases.
Bioinformatics | 2013
Scott C. Clark; Rob Egan; Peter I. Frazier; Zhong Wang
Archive | 2014
Dongwan D. Kang; Jeff Froula; Rob Egan; Zhong Wang