Wolfgang Gerlach
Argonne National Laboratory
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Wolfgang Gerlach.
Nucleic Acids Research | 2016
Andreas Wilke; Jared Bischof; Wolfgang Gerlach; Elizabeth M. Glass; Travis Harrison; Kevin P. Keegan; Tobias Paczian; William L. Trimble; Saurabh Bagchi; Somali Chaterji; Folker Meyer
MG-RAST (http://metagenomics.anl.gov) is an open-submission data portal for processing, analyzing, sharing and disseminating metagenomic datasets. The system currently hosts over 200 000 datasets and is continuously updated. The volume of submissions has increased 4-fold over the past 24 months, now averaging 4 terabasepairs per month. In addition to several new features, we report changes to the analysis workflow and the technologies used to scale the pipeline up to the required throughput levels. To show possible uses for the data from MG-RAST, we present several examples integrating data and analyses from MG-RAST into popular third-party analysis tools or sequence alignment tools.
Proceedings of the 5th International Workshop on Data-Intensive Computing in the Clouds | 2014
Wolfgang Gerlach; Wei Tang; Kevin P. Keegan; Travis Harrison; Andreas Wilke; Jared Bischof; Mark D'Souza; Scott Devoid; Daniel Murphy-Olson; Narayan Desai; Folker Meyer
Recently, Linux container technology has been gaining attention as it promises to transform the way software is developed and deployed. The portability and ease of deployment makes Linux containers an ideal technology to be used in scientific workflow platforms. Skyport utilizes Docker Linux containers to solve software deployment problems and resource utilization inefficiencies inherent to all existing scientific workflow platforms. As an extension to AWE/Shock, our data analysis platform that provides scalable workflow execution environments for scientific data in the cloud, Skyport greatly reduces the complexity associated with providing the environment necessary to execute complex workflows.
PLOS Computational Biology | 2015
Andreas Wilke; Jared Bischof; Travis Harrison; Tom Brettin; Mark D'Souza; Wolfgang Gerlach; Hunter Matthews; Tobias Paczian; Jared Wilkening; Elizabeth M. Glass; Narayan Desai; Folker Meyer
Metagenomic sequencing has produced significant amounts of data in recent years. For example, as of summer 2013, MG-RAST has been used to annotate over 110,000 data sets totaling over 43 Terabases. With metagenomic sequencing finding even wider adoption in the scientific community, the existing web-based analysis tools and infrastructure in MG-RAST provide limited capability for data retrieval and analysis, such as comparative analysis between multiple data sets. Moreover, although the system provides many analysis tools, it is not comprehensive. By opening MG-RAST up via a web services API (application programmers interface) we have greatly expanded access to MG-RAST data, as well as provided a mechanism for the use of third-party analysis tools with MG-RAST data. This RESTful API makes all data and data objects created by the MG-RAST pipeline accessible as JSON objects. As part of the DOE Systems Biology Knowledgebase project (KBase, http://kbase.us) we have implemented a web services API for MG-RAST. This API complements the existing MG-RAST web interface and constitutes the basis of KBases microbial community capabilities. In addition, the API exposes a comprehensive collection of data to programmers. This API, which uses a RESTful (Representational State Transfer) implementation, is compatible with most programming environments and should be easy to use for end users and third parties. It provides comprehensive access to sequence data, quality control results, annotations, and many other data types. Where feasible, we have used standards to expose data and metadata. Code examples are provided in a number of languages both to show the versatility of the API and to provide a starting point for users. We present an API that exposes the data in MG-RAST for consumption by our users, greatly enhancing the utility of the MG-RAST service.
bioRxiv | 2016
Adam P. Arkin; Rick Stevens; Robert W. Cottingham; Sergei Maslov; Christopher S. Henry; Paramvir Dehal; Doreen Ware; Fernando Perez; Nomi L. Harris; Shane Canon; Michael W Sneddon; Matthew L Henderson; William J Riehl; Dan Gunter; Dan Murphy-Olson; Stephen Chan; Roy T Kamimura; Thomas S Brettin; Folker Meyer; Dylan Chivian; David J. Weston; Elizabeth M. Glass; Brian H. Davison; Sunita Kumari; Benjamin H Allen; Jason K. Baumohl; Aaron A. Best; Ben Bowen; Steven E. Brenner; Christopher C Bun
The U.S. Department of Energy Systems Biology Knowledgebase (KBase) is an open-source software and data platform designed to meet the grand challenge of systems biology — predicting and designing biological function from the biomolecular (small scale) to the ecological (large scale). KBase is available for anyone to use, and enables researchers to collaboratively generate, test, compare, and share hypotheses about biological functions; perform large-scale analyses on scalable computing infrastructure; and combine experimental evidence and conclusions that lead to accurate models of plant and microbial physiology and community dynamics. The KBase platform has (1) extensible analytical capabilities that currently include genome assembly, annotation, ontology assignment, comparative genomics, transcriptomics, and metabolic modeling; (2) a web-browser-based user interface that supports building, sharing, and publishing reproducible and well-annotated analyses with integrated data; (3) access to extensive computational resources; and (4) a software development kit allowing the community to add functionality to the system.
international conference on big data | 2013
Wei Tang; Jared Wilkening; Narayan Desai; Wolfgang Gerlach; Andreas Wilke; Folker Meyer
With the advent of high-throughput DNA sequencing technology, the analysis and management of the increasing amount of biological sequence data has become a bottleneck for scientific progress. For example, MG-RAST, a metagenome annotation system serving a large scientific community worldwide, has experienced a sustained, exponential growth in data submissions for several years; and this trend is expected to continue. To address the computational challenges posed by this workload, we developed a new data analysis platform, including a data management system (Shock) for biological sequence data and a workflow management system (AWE) supporting scalable, fault-tolerant task and resource management. Shock and AWE can be used to build a scalable and reproducible data analysis infrastructure for upper-level biological data analysis services.
Methods in Enzymology | 2013
Andreas Wilke; Elizabeth M. Glass; Daniela Bartels; Jared Bischof; Daniel Braithwaite; Mark D’Souza; Wolfgang Gerlach; Travis Harrison; Kevin P. Keegan; Hunter Matthews; Renzo Kottmann; Tobias Paczian; Wei Tang; William L. Trimble; Pelin Yilmaz; Jared Wilkening; Narayan Desai; Folker Meyer
The democratized world of sequencing is leading to numerous data analysis challenges; MG-RAST addresses many of these challenges for diverse datasets, including amplicon datasets, shotgun metagenomes, and metatranscriptomes. The changes from version 2 to version 3 include the addition of a dedicated gene calling stage using FragGenescan, clustering of predicted proteins at 90% identity, and the use of BLAT for the computation of similarities. Together with changes in the underlying software infrastructure, this has enabled the dramatic scaling up of pipeline throughput while remaining on a limited hardware budget. The Web-based service allows upload, fully automated analysis, and visualization of results. As a result of the plummeting cost of sequencing and the readily available analytical power of MG-RAST, over 78,000 metagenomic datasets have been analyzed, with over 12,000 of them publicly available in MG-RAST.
international conference on big data | 2014
Wei Tang; Jared Bischof; Narayan Desai; Kanak Mahadik; Wolfgang Gerlach; Travis Harrison; Andreas Wilke; Folker Meyer
The cost of DNA sequencing has plummeted in recent years. The consequent data deluge has imposed big burdens for data analysis applications. For example, MG-RAST, a production open-public metagenome annotation service, has experienced increasingly large amount of data submission and has demanded scalable resources for the computational needs. To address this problem, we have developed a scalable platform to port MG-RAST workloads into the cloud, where elastic computing resources can be used on demand. To efficiently utilize such resources, however, one must understand the characteristics of the application workloads. In this paper, we characterize the MG-RAST workloads running in the cloud, from the perspectives of computation, I/O, and data transfer. Insights from this work will help guide application enhancement, service operation, and resource management for MG-RAST and similar big data applications demanding elastic computing resources.
ieee international conference on cloud engineering | 2015
Wolfgang Gerlach; Wei Tang; Andreas Wilke; Dan Olson; Folker Meyer
Recently, Linux container technology has been gaining attention as it promises to transform the way software is developed and deployed. The portability and ease of deployment makes Linux containers an ideal technology to be used in scientific workflow platforms. AWE/Shock is a scalable data analysis platform designed to execute data intensive scientific workflows. Recently we introduced Skyport, an extension to AWE/Shock, that uses Docker container technology to orchestrate and automate the deployment of individual workflow tasks onto the worker machines. The installation of software in independent execution environments for each task reduces complexity and offers an elegant solution to installation problems such as library version conflicts. The systematic use of isolated execution environments for workflow tasks also offers a convenient and simple mechanism to reproduce scientific results.
Nature Biotechnology | 2018
Adam P. Arkin; Robert W. Cottingham; Christopher S. Henry; Nomi L. Harris; Rick Stevens; Sergei Maslov; Paramvir Dehal; Doreen Ware; Fernando Perez; Shane Canon; Michael W Sneddon; Matthew L Henderson; William J Riehl; Dan Murphy-Olson; Stephen Chan; Roy T Kamimura; Sunita Kumari; Meghan M Drake; Thomas Brettin; Elizabeth M. Glass; Dylan Chivian; Dan Gunter; David J. Weston; Benjamin H Allen; Jason K. Baumohl; Aaron A. Best; Ben Bowen; Steven E. Brenner; Christopher C Bun; John-Marc Chandonia
Author(s): Arkin, Adam P; Cottingham, Robert W; Henry, Christopher S; Harris, Nomi L; Stevens, Rick L; Maslov, Sergei; Dehal, Paramvir; Ware, Doreen; Perez, Fernando; Canon, Shane; Sneddon, Michael W; Henderson, Matthew L; Riehl, William J; Murphy-Olson, Dan; Chan, Stephen Y; Kamimura, Roy T; Kumari, Sunita; Drake, Meghan M; Brettin, Thomas S; Glass, Elizabeth M; Chivian, Dylan; Gunter, Dan; Weston, David J; Allen, Benjamin H; Baumohl, Jason; Best, Aaron A; Bowen, Ben; Brenner, Steven E; Bun, Christopher C; Chandonia, John-Marc; Chia, Jer-Ming; Colasanti, Ric; Conrad, Neal; Davis, James J; Davison, Brian H; DeJongh, Matthew; Devoid, Scott; Dietrich, Emily; Dubchak, Inna; Edirisinghe, Janaka N; Fang, Gang; Faria, Jose P; Frybarger, Paul M; Gerlach, Wolfgang; Gerstein, Mark; Greiner, Annette; Gurtowski, James; Haun, Holly L; He, Fei; Jain, Rashmi; Joachimiak, Marcin P; Keegan, Kevin P; Kondo, Shinnosuke; Kumar, Vivek; Land, Miriam L; Meyer, Folker; Mills, Marissa; Novichkov, Pavel S; Oh, Taeyun; Olsen, Gary J; Olson, Robert; Parrello, Bruce; Pasternak, Shiran; Pearson, Erik; Poon, Sarah S; Price, Gavin A; Ramakrishnan, Srividya; Ranjan, Priya; Ronald, Pamela C; Schatz, Michael C; Seaver, Samuel MD; Shukla, Maulik; Sutormin, Roman A; Syed, Mustafa H; Thomason, James; Tintle, Nathan L; Wang, Daifeng; Xia, Fangfang; Yoo, Hyunseung; Yoo, Shinjae; Yu, Dantong
Briefings in Bioinformatics | 2017
Folker Meyer; Saurabh Bagchi; Somali Chaterji; Wolfgang Gerlach; Travis Harrison; Tobias Paczian; William L. Trimble; Andreas Wilke
As technologies change, MG-RAST is adapting. Newly available software is being included to improve accuracy and performance. As a computational service constantly running large volume scientific workflows, MG-RAST is the right location to perform benchmarking and implement algorithmic or platform improvements, in many cases involving trade-offs between specificity, sensitivity and run-time cost. The work in [Glass EM, Dribinsky Y, Yilmaz P, et al. ISME J 2014;8:1-3] is an example; we use existing well-studied data sets as gold standards representing different environments and different technologies to evaluate any changes to the pipeline. Currently, we use well-understood data sets in MG-RAST as platform for benchmarking. The use of artificial data sets for pipeline performance optimization has not added value, as these data sets are not presenting the same challenges as real-world data sets. In addition, the MG-RAST team welcomes suggestions for improvements of the workflow. We are currently working on versions 4.02 and 4.1, both of which contain significant input from the community and our partners that will enable double barcoding, stronger inferences supported by longer-read technologies, and will increase throughput while maintaining sensitivity by using Diamond and SortMeRNA. On the technical platform side, the MG-RAST team intends to support the Common Workflow Language as a standard to specify bioinformatics workflows, both to facilitate development and efficient high-performance implementation of the communitys data analysis tasks.