Travis Harrison | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Travis Harrison is active.

Explore More

Publication

Featured researches published by Travis Harrison.

BMC Bioinformatics | 2012

The M5nr: a novel non-redundant database containing protein sequences and annotations from multiple sources and associated tools.

Andreas Wilke; Travis Harrison; Jared Wilkening; Dawn Field; Elizabeth M. Glass; Nikos C. Kyrpides; Konstantinos Mavrommatis; Folker Meyer

BackgroundComputing of sequence similarity results is becoming a limiting factor in metagenome analysis. Sequence similarity search results encoded in an open, exchangeable format have the potential to limit the needs for computational reanalysis of these data sets. A prerequisite for sharing of similarity results is a common reference.DescriptionWe introduce a mechanism for automatically maintaining a comprehensive, non-redundant protein database and for creating a quarterly release of this resource. In addition, we present tools for translating similarity searches into many annotation namespaces, e.g. KEGG or NCBIs GenBank.ConclusionsThe data and tools we present allow the creation of multiple result sets using a single computation, permitting computational results to be shared between groups for large sequence data sets.

PLOS Computational Biology | 2012

A Platform-Independent Method for Detecting Errors in Metagenomic Sequencing Data: DRISEE

Kevin P. Keegan; William L. Trimble; Jared Wilkening; Andreas Wilke; Travis Harrison; Mark D'Souza; Folker Meyer

We provide a novel method, DRISEE (duplicate read inferred sequencing error estimation), to assess sequencing quality (alternatively referred to as “noise” or “error”) within and/or between sequencing samples. DRISEE provides positional error estimates that can be used to inform read trimming within a sample. It also provides global (whole sample) error estimates that can be used to identify samples with high or varying levels of sequencing error that may confound downstream analyses, particularly in the case of studies that utilize data from multiple sequencing samples. For shotgun metagenomic data, we believe that DRISEE provides estimates of sequencing error that are more accurate and less constrained by technical limitations than existing methods that rely on reference genomes or the use of scores (e.g. Phred). Here, DRISEE is applied to (non amplicon) data sets from both the 454 and Illumina platforms. The DRISEE error estimate is obtained by analyzing sets of artifactual duplicate reads (ADRs), a known by-product of both sequencing platforms. We present DRISEE as an open-source, platform-independent method to assess sequencing error in shotgun metagenomic data, and utilize it to discover previously uncharacterized error in de novo sequence data from the 454 and Illumina sequencing platforms.

Nucleic Acids Research | 2016

The MG-RAST metagenomics database and portal in 2015

Andreas Wilke; Jared Bischof; Wolfgang Gerlach; Elizabeth M. Glass; Travis Harrison; Kevin P. Keegan; Tobias Paczian; William L. Trimble; Saurabh Bagchi; Somali Chaterji; Folker Meyer

MG-RAST (http://metagenomics.anl.gov) is an open-submission data portal for processing, analyzing, sharing and disseminating metagenomic datasets. The system currently hosts over 200 000 datasets and is continuously updated. The volume of submissions has increased 4-fold over the past 24 months, now averaging 4 terabasepairs per month. In addition to several new features, we report changes to the analysis workflow and the technologies used to scale the pipeline up to the required throughput levels. To show possible uses for the data from MG-RAST, we present several examples integrating data and analyses from MG-RAST into popular third-party analysis tools or sequence alignment tools.

Proceedings of the 5th International Workshop on Data-Intensive Computing in the Clouds | 2014

Skyport: container-based execution environment management for multi-cloud scientific workflows

Wolfgang Gerlach; Wei Tang; Kevin P. Keegan; Travis Harrison; Andreas Wilke; Jared Bischof; Mark D'Souza; Scott Devoid; Daniel Murphy-Olson; Narayan Desai; Folker Meyer

Recently, Linux container technology has been gaining attention as it promises to transform the way software is developed and deployed. The portability and ease of deployment makes Linux containers an ideal technology to be used in scientific workflow platforms. Skyport utilizes Docker Linux containers to solve software deployment problems and resource utilization inefficiencies inherent to all existing scientific workflow platforms. As an extension to AWE/Shock, our data analysis platform that provides scalable workflow execution environments for scientific data in the cloud, Skyport greatly reduces the complexity associated with providing the environment necessary to execute complex workflows.

PLOS Computational Biology | 2015

A RESTful API for accessing microbial community data for MG-RAST

Andreas Wilke; Jared Bischof; Travis Harrison; Tom Brettin; Mark D'Souza; Wolfgang Gerlach; Hunter Matthews; Tobias Paczian; Jared Wilkening; Elizabeth M. Glass; Narayan Desai; Folker Meyer

Metagenomic sequencing has produced significant amounts of data in recent years. For example, as of summer 2013, MG-RAST has been used to annotate over 110,000 data sets totaling over 43 Terabases. With metagenomic sequencing finding even wider adoption in the scientific community, the existing web-based analysis tools and infrastructure in MG-RAST provide limited capability for data retrieval and analysis, such as comparative analysis between multiple data sets. Moreover, although the system provides many analysis tools, it is not comprehensive. By opening MG-RAST up via a web services API (application programmers interface) we have greatly expanded access to MG-RAST data, as well as provided a mechanism for the use of third-party analysis tools with MG-RAST data. This RESTful API makes all data and data objects created by the MG-RAST pipeline accessible as JSON objects. As part of the DOE Systems Biology Knowledgebase project (KBase, http://kbase.us) we have implemented a web services API for MG-RAST. This API complements the existing MG-RAST web interface and constitutes the basis of KBases microbial community capabilities. In addition, the API exposes a comprehensive collection of data to programmers. This API, which uses a RESTful (Representational State Transfer) implementation, is compatible with most programming environments and should be easy to use for end users and third parties. It provides comprehensive access to sequence data, quality control results, annotations, and many other data types. Where feasible, we have used standards to expose data and metadata. Code examples are provided in a number of languages both to show the versatility of the API and to provide a starting point for users. We present an API that exposes the data in MG-RAST for consumption by our users, greatly enhancing the utility of the MG-RAST service.

Methods in Enzymology | 2013

A Metagenomics Portal for a Democratized Sequencing World

Andreas Wilke; Elizabeth M. Glass; Daniela Bartels; Jared Bischof; Daniel Braithwaite; Mark D’Souza; Wolfgang Gerlach; Travis Harrison; Kevin P. Keegan; Hunter Matthews; Renzo Kottmann; Tobias Paczian; Wei Tang; William L. Trimble; Pelin Yilmaz; Jared Wilkening; Narayan Desai; Folker Meyer

The democratized world of sequencing is leading to numerous data analysis challenges; MG-RAST addresses many of these challenges for diverse datasets, including amplicon datasets, shotgun metagenomes, and metatranscriptomes. The changes from version 2 to version 3 include the addition of a dedicated gene calling stage using FragGenescan, clustering of predicted proteins at 90% identity, and the use of BLAT for the computation of similarities. Together with changes in the underlying software infrastructure, this has enabled the dramatic scaling up of pipeline throughput while remaining on a limited hardware budget. The Web-based service allows upload, fully automated analysis, and visualization of results. As a result of the plummeting cost of sequencing and the readily available analytical power of MG-RAST, over 78,000 metagenomic datasets have been analyzed, with over 12,000 of them publicly available in MG-RAST.

international conference on big data | 2014

Workload characterization for MG-RAST metagenomic data analytics service in the cloud

Wei Tang; Jared Bischof; Narayan Desai; Kanak Mahadik; Wolfgang Gerlach; Travis Harrison; Andreas Wilke; Folker Meyer

The cost of DNA sequencing has plummeted in recent years. The consequent data deluge has imposed big burdens for data analysis applications. For example, MG-RAST, a production open-public metagenome annotation service, has experienced increasingly large amount of data submission and has demanded scalable resources for the computational needs. To address this problem, we have developed a scalable platform to port MG-RAST workloads into the cloud, where elastic computing resources can be used on demand. To efficiently utilize such resources, however, one must understand the characteristics of the application workloads. In this paper, we characterize the MG-RAST workloads running in the cloud, from the perspectives of computation, I/O, and data transfer. Insights from this work will help guide application enhancement, service operation, and resource management for MG-RAST and similar big data applications demanding elastic computing resources.

Standards in Genomic Sciences | 2014

Metazen - metadata capture for metagenomes.

Jared Bischof; Travis Harrison; Tobias Paczian; Elizabeth M. Glass; Andreas Wilke; Folker Meyer

BackgroundAs the impact and prevalence of large-scale metagenomic surveys grow, so does the acute need for more complete and standards compliant metadata. Metadata (data describing data) provides an essential complement to experimental data, helping to answer questions about its source, mode of collection, and reliability. Metadata collection and interpretation have become vital to the genomics and metagenomics communities, but considerable challenges remain, including exchange, curation, and distribution.Currently, tools are available for capturing basic field metadata during sampling, and for storing, updating and viewing it. Unfortunately, these tools are not specifically designed for metagenomic surveys; in particular, they lack the appropriate metadata collection templates, a centralized storage repository, and a unique ID linking system that can be used to easily port complete and compatible metagenomic metadata into widely used assembly and sequence analysis tools.ResultsMetazen was developed as a comprehensive framework designed to enable metadata capture for metagenomic sequencing projects. Specifically, Metazen provides a rapid, easy-to-use portal to encourage early deposition of project and sample metadata.ConclusionsMetazen is an interactive tool that aids users in recording their metadata in a complete and valid format. A defined set of mandatory fields captures vital information, while the option to add fields provides flexibility.

Briefings in Bioinformatics | 2017

MG-RAST version 4—lessons learned from a decade of low-budget ultra-high-throughput metagenome analysis

Folker Meyer; Saurabh Bagchi; Somali Chaterji; Wolfgang Gerlach; Travis Harrison; Tobias Paczian; William L. Trimble; Andreas Wilke

As technologies change, MG-RAST is adapting. Newly available software is being included to improve accuracy and performance. As a computational service constantly running large volume scientific workflows, MG-RAST is the right location to perform benchmarking and implement algorithmic or platform improvements, in many cases involving trade-offs between specificity, sensitivity and run-time cost. The work in [Glass EM, Dribinsky Y, Yilmaz P, et al. ISME J 2014;8:1-3] is an example; we use existing well-studied data sets as gold standards representing different environments and different technologies to evaluate any changes to the pipeline. Currently, we use well-understood data sets in MG-RAST as platform for benchmarking. The use of artificial data sets for pipeline performance optimization has not added value, as these data sets are not presenting the same challenges as real-world data sets. In addition, the MG-RAST team welcomes suggestions for improvements of the workflow. We are currently working on versions 4.02 and 4.1, both of which contain significant input from the community and our partners that will enable double barcoding, stronger inferences supported by longer-read technologies, and will increase throughput while maintaining sensitivity by using Diamond and SortMeRNA. On the technical platform side, the MG-RAST team intends to support the Common Workflow Language as a standard to specify bioinformatics workflows, both to facilitate development and efficient high-performance implementation of the communitys data analysis tasks.

Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference on | 2017

Rafiki: a middleware for parameter tuning of NoSQL datastores for dynamic metagenomics workloads

Ashraf Mahgoub; Paul Wood; Sachandhan Ganesh; Subrata Mitra; Wolfgang Gerlach; Travis Harrison; Folker Meyer; Saurabh Bagchi; Somali Chaterji

High performance computing (HPC) applications, such as metagenomics and other big data systems, need to store and analyze huge volumes of semi-structured data. Such applications often rely on NoSQL-based datastores, and optimizing these databases is a challenging endeavor, with over 50 configuration parameters in Cassandra alone. As the application executes, database workloads can change rapidly from read-heavy to write-heavy ones, and a system tuned with a read-optimized configuration becomes suboptimal when the workload becomes write-heavy. In this paper, we present a method and a system for optimizing NoSQL configurations for Cassandra and ScyllaDB when running HPC and metagenomics workloads. First, we identify the significance of configuration parameters using ANOVA. Next, we apply neural networks using the most significant parameters and their workload-dependent mapping to predict database throughput, as a surrogate model. Then, we optimize the configuration using genetic algorithms on the surrogate to maximize the workload-dependent performance. Using the proposed methodology in our system (Rafiki), we can predict the throughput for unseen workloads and configuration values with an error of 7.5% for Cassandra and 6.9-7.8% for ScyllaDB. Searching the configuration spaces using the trained surrogate models, we achieve performance improvements of 41% for Cassandra and 9% for ScyllaDB over the default configuration with respect to a read-heavy workload, and also significant improvement for mixed workloads. In terms of searching speed, Rafiki, using only 1/10000-th of the searching time of exhaustive search, reaches within 15% and 9.5% of the theoretically best achievable performances for Cassandra and ScyllaDB, respectively---supporting optimizations for highly dynamic workloads.

Explore More