[PDF] The MetaSUB Microbiome Core Analysis Pipeline Enables Large Scale Metagenomic Analysis

Abstract

Motivation: Accurate data analysis and quality control is critical for metagenomic studies. Though many tools exist to analyze metagenomic data there is no consistent framework to integrate and run these tools across projects. Currently, computational analysis of metagenomes is time consuming, often misses potentially interesting results, and is difficult to reproduce. Further, comparison between metagenomic studies is hampered by inconsistencies in tools and databases. Results: We present the MetaSUB Core Analysis Pipeline (CAP) a comprehensive tool to analyze metagenomes and summarize the results of a project. The CAP is designed in a bottom up fashion to perform QC, preprocessing, analysis and even to build relevant databases and install necessary tools. Availability and Implementation: The CAP is available under an MIT License on GitHub at this https URL and on the Python Package Index. Documentation and examples are available on GitHub.

Full PDF

TThe MetaSUB Microbiome Core Analysis Pipeline EnablesLarge Scale Metagenomic Analysis

David C. Danko and Christopher Mason ∗ Tri-Institutional Computational Biology & Medicine Program, Cornell University, NY, USA Institute for Computational Biomedicine, Department of Physiology and Biophysics, Weill Cornell Medicine ofCornell University, NY, USA The Feil Family Brain and Mind Research Institute, Weill Cornell Medicine, NY, USA ∗ Corresponding author

AbstractMotivation

Accurate data analysis and quality control is critical for metagenomic studies.Though many tools exist to analyze metagenomic data there is no consistent framework to inte-grate and run these tools across projects. Currently, computational analysis of metagenomes istime consuming, often misses potentially interesting results, and is diﬃcult to reproduce. Further,comparison between metagenomic studies is hampered by inconsistencies in tools and databases.

Results

We present the MetaSUB Core Analysis Pipeline (CAP) a comprehensive tool to analyzemetagenomes and summarize the results of a project. The CAP is designed in a bottom up fashionto perform QC, preprocessing, analysis and even to build relevant databases and install necessarytools.

Availability and Implementation

The CAP is available under an MIT License on GitHub at https://github.com/MetaSUB/CAP2 and on the Python Package Index. Documentation and exam-ples are available on GitHub.

Advances in metagenomics have led to a sharp increase in the number and size of studies. A largenumber of tools have been built to analyze metagenomes (in particular to generate taxonomic proﬁles)but there is still wide variation in how these tools are applied. This has created unnecessary frictionas metagenomic studies often develop their pipelines in an ad-hoc fashion and as diﬀerent choices forimplementation (particularly choice of reference database) complicate cross study comparisons.The CAP addresses issues of unnecessary friction in metagenomic analysis by providing a standard-ized analysis pipeline and framework. This provides researches with an eﬀective starting point from whichto launch more custom analyses. The CAP provides consistent naming and ﬁle structure conventions,clear documentation for diﬀerent pipeline stages, automatic installation of constituent programs, auto-matic construction of reference databases and indices, project level summaries of results, comprehensiveversioning system for reproducibility, and an easy to use python interface and API.Critically the CAP does not attempt to replace existing analysis tools. The CAP is, rather, a setof standardized practices for metagenomic analysis based largely on existing peer reviewed tools. Theﬂexible nature of the CAP also means that tools and databases can be swapped or modiﬁed with minimaleﬀort.

The CAP is written in Python and can be installed from the python package index using tools like pip or easyinstall . Once installed the CAP can be run from the command line or called from other programsvia a python API. The CAP does not require users to install any constituent programs or databases. Allnecessary sub-programs are installed through PyPi or Conda and databases can be downloaded or builtfrom source based on conﬁguration. 1 a r X i v : . [ q - b i o . Q M ] S e p o run the CAP a user need only supply a CSV ﬁle with sample names and the locations of theforward and reverse FASTQ read ﬁle for each sample. Optionally the user can supply some conﬁgurationparameters (either through command line arguments or environment variables) which will direct the CAPto install programs and databases in speciﬁc locations. If for some reason installation of subprogramsthrough conda is not available users can supply custom executables via conﬁguration. The CAP consists of 5 stages (Fig 1): a quality control stage (Fig S1), a preprocessing stage (Fig S2),a short read stage (Fig S3), an assembly stage (Fig S4), and a database stage (Fig S5). The databasestage consists of building relevant indices for the other stages. Since most users will not want to builddatabases the CAP is conﬁgured to download versioned databases by default.The quality control stage consists of three modules: a custom module to count reads and associatedstatistics (e.g. read length, GC content, etc), FastQC (Andrews, 2010), and MultiQC (Ewels et al.,2016).The preprocessing stage consists of 3 modules in series. First, adapters are removed from raw readsusing AdapterRemoval (Schubert et al., 2016). Second, reads mapping to the human genome (hg38, withalternate contigs) are removed using Bowtie2 (Langmead and Steven L Salzberg, 2013). Third, readsare error corrected using BayesHammer (Nikolenko et al., 2013).The short read stage follows the preprocessing stage. It consists of several modules (or small groups ofmodules) that all start directly from pre-processed reads. Taxonomic proﬁling is performed by Kraken2(Wood et al., 2019) using a database based on all of RefSeq microbial followed by Bracken (Lu et al.,2017). Functional proﬁling is done with HUMANn2 (Franzosa et al., 2018) after alignment of reads toUniRef90 using Diamond (Buchﬁnk et al., 2014). MicrobeCensus (Nayfach and Pollard, 2015) is used toestimate the average genome size of a microbiome. MASH is used to generate small sketches of the datawhich are then used in a custom module to ﬁnd the similarity to several human microbiomes (Consortiumet al., 2012). A custom module is used to generate read statistics.The assembly stage also follows the preprocessing stage. It consists of assembly using MetaSPAdes(omitting error correction since that is performed in the preprocessing stage) (Nurk et al., 2017) followedby contig clustering using MetaBAT2 (Kang et al., 2019). ORFs are predicted in contig clusters usingprodigal (Hyatt et al., 2010). Antimicrobial Resistance Genes are identiﬁed on contigs using RGI (Alcocket al., 2020).After processing the CAP includes tools to build summary reports for a projects. Generally thesereports consist of CSV ﬁles summarizing the outputs of tools across all samples in the project.

The CAP is composed of modules each of which corresponds (roughly) to a single sub-programs. Thesemodules contain documentation for the speciﬁc sub-program, specify dependencies, outputs and how thesub-program will be run. Modules are python classes and may be dynamically modiﬁed e.g. to substituteFigure 1:

Overall structure of the pipeline. igure 2:

Relationship of taxonomic proﬁles and functional pathways. Position of each point shows the similarityof each samples taxonomic proﬁle to other samples (nearby samples are more similar, reduced to 2-d with UMAP).Color of each point indicates similarity of functional proﬁles of samples (reduced to 1-d with UMAP). A roughrelationship between taxonomic and function proﬁles can be seen.

Figure 3:

Relationship between functional pathways and estimated average genome size. The estimated averagegenome size is positively correlated (rho 0.24, p < 2e-16) to the second principal component of the functionalproﬁle. iﬀerent module versions or to insert stages for experimentation. This makes it easy to develop and addmodules to the CAP.A major feature of the CAP is its strict multi-layered versioning system. Each module speciﬁesa speciﬁc version number (typically a SemVer) and dependencies. To account for upstream changesmodules also specify a version tree that includes the version numbers of all upstream modules (and theirupstream modules, etc) into a tree. This tree can be accessed either as a human readable newick tree oras a version hash. When run each modules records the version number of the module run, the hash of theversion tree, and the version of the sub-program in the module (in case a novel version has been speciﬁedin the conﬁg) as a JSON ﬁle alongside the actual module results. This system enables comparison andreproducibility between pipeline versions.

To demonstrate some of the capabilities of the CAP we analyzed 1,521 environmental metagenomicsamples from the PathoMAP project (Afshinnekoo et al., 2015). These samples were collected from drysurfaces in the New York City Subway system in 2015.To demonstrate the CAP we present two results that show how multiple tools in the CAP can becombined to produce novel results. In particular we show two results from three diﬀerent tools that arepart of the CAP: functional proﬁling using HUMANn2, taxonomic proﬁling with Kraken2, and averagegenome size estimation using MicrobeCensus. Our goal with these results is to show how multiple toolscan be used to uncover certain ecological relationships in metagenomic data.We reduced the dimensionality of our taxonomic proﬁles to 2 using UMAP and analagously reducedfunctional proﬁles to 1 dimension. The functional proﬁles and taxonomic proﬁles show some relation(Figure 2) with samples that have similar taxonomic proﬁles also having similar functional proﬁles.Average genome size of bacteria in microbiomes has been postulated to have some connection to metaboliccapacity with larger genomes in more generalist bacteria (Nayfach and Pollard, 2015). We show arelationship between average genome size and functional capacity (Figure 3) with a weak correlation(spearmans rho 0.24, p < 2e-16) between average genome size and the second principal component ofthe functional proﬁle.

The CAP provides a consistent good-practice baseline for Metagenomic studies. It is based, largely oﬀof existing peer reviewed tools that have themselves been extensively benchmarked. Routine use of theCAP is meant to improve reproducibility in metagenomic studies, decrease friction in performing suchstudies, and enable cross-project comparisons. The routine use of multiple analysis tools can result innovel insights and help to uncover ecological relationships.

We would like to thank Robert Petit and Ben Chrobot for their help.

References

Afshinnekoo, E., Meydan, C., Levy, S., and Mason, C. E. (2015). Geospatial Resolution of Humanand Bacterial Diversity with City-Scale Metagenomics Article Geospatial Resolution of Human andBacterial Diversity with City-Scale Metagenomics.

Cell Systems , 1:1–15.Alcock, B. P., Raphenya, A. R., Lau, T. T., Tsang, K. K., Bouchard, M., Edalatmand, A., Huynh, W.,Nguyen, A. L. V., Cheng, A. A., Liu, S., Min, S. Y., Miroshnichenko, A., Tran, H. K., Werfalli, R. E.,Nasir, J. A., Oloni, M., Speicher, D. J., Florescu, A., Singh, B., Faltyn, M., Hernandez-Koutoucheva,A., Sharma, A. N., Bordeleau, E., Pawlowski, A. C., Zubyk, H. L., Dooley, D., Griﬃths, E., Maguire,F., Winsor, G. L., Beiko, R. G., Brinkman, F. S., Hsiao, W. W., Domselaar, G. V., and McArthur, A. G.(2020). CARD 2020: Antibiotic resistome surveillance with the comprehensive antibiotic resistancedatabase.

Nucleic Acids Research , 48(D1):D517–D525.ndrews, S. (2010). FastQC.

Babraham Bioinformatics

Nature , 486(7402):207–14.Ewels, P., Magnusson, M., Lundin, S., and Käller, M. (2016). MultiQC: Summarize analysis results formultiple tools and samples in a single report.

Bioinformatics , 32(19):3047–3048.Franzosa, E. A., McIver, L. J., Rahnavard, G., Thompson, L. R., Schirmer, M., Weingart, G., Lipson,K. S., Knight, R., Caporaso, J. G., Segata, N., and Huttenhower, C. (2018). Species-level functionalproﬁling of metagenomes and metatranscriptomes.

Nature methods , 15(11):962–968.Hyatt, D., Chen, G. L., LoCascio, P. F., Land, M. L., Larimer, F. W., and Hauser, L. J. (2010). Prodigal:Prokaryotic gene recognition and translation initiation site identiﬁcation.

BMC Bioinformatics , 11.Kang, D. D., Li, F., Kirton, E., Thomas, A., Egan, R., An, H., and Wang, Z. (2019). MetaBAT2: An adaptive binning algorithm for robust and eﬃcient genome reconstruction from metagenomeassemblies.

PeerJ , 2019(7).Langmead and Steven L Salzberg (2013). Bowtie2.

Nature methods , 9(4):357–359.Lu, J., Breitwieser, F. P., Thielen, P., and Salzberg, S. L. (2017). Bracken: Estimating species abundancein metagenomics data.

PeerJ Computer Science , 2017(1).Nayfach, S. and Pollard, K. S. (2015). Average genome size estimation improves comparative metage-nomics and sheds light on the functional ecology of the human microbiome.

Genome Biology , 16(1).Nikolenko, S. I., Korobeynikov, A. I., and Alekseyev, M. A. (2013). BayesHammer: Bayesian clusteringfor error correction in single-cell sequencing.

BMC Genomics , 14.Nurk, S., Meleshko, D., Korobeynikov, A., and Pevzner, P. A. (2017). MetaSPAdes: A new versatilemetagenomic assembler.

Genome Research , 27(5):824–834.Schubert, M., Lindgreen, S., and Orlando, L. (2016). AdapterRemoval v2: rapid adapter trimming,identiﬁcation, and read merging.

BMC Research Notes , 9(1):88.Wood, D. E., Lu, J., and Langmead, B. (2019). Improved metagenomic analysis with Kraken 2.

GenomeBiology , 20(1). upplement igure S1:

Quality control stage, black cylinders represent raw data, orange ovals are results, and green boxesare tools.

Figure S2:

Preprocessing stage, black cylinders represent raw data, orange ovals are results, and green boxesare tools. igure S3:

Short Read stage, black cylinders represent raw data, orange ovals are results, and green boxes aretools.

Figure S4:

Assembly stage, black cylinders represent raw data, orange ovals are results, and green boxes aretools. igure S5:igure S5: