The MetaSUB Microbiome Core Analysis Pipeline Enables Large Scale Metagenomic Analysis
TThe MetaSUB Microbiome Core Analysis Pipeline EnablesLarge Scale Metagenomic Analysis
David C. Danko and Christopher Mason ∗ Tri-Institutional Computational Biology & Medicine Program, Cornell University, NY, USA Institute for Computational Biomedicine, Department of Physiology and Biophysics, Weill Cornell Medicine ofCornell University, NY, USA The Feil Family Brain and Mind Research Institute, Weill Cornell Medicine, NY, USA ∗ Corresponding author
AbstractMotivation
Accurate data analysis and quality control is critical for metagenomic studies.Though many tools exist to analyze metagenomic data there is no consistent framework to inte-grate and run these tools across projects. Currently, computational analysis of metagenomes istime consuming, often misses potentially interesting results, and is difficult to reproduce. Further,comparison between metagenomic studies is hampered by inconsistencies in tools and databases.
Results
We present the MetaSUB Core Analysis Pipeline (CAP) a comprehensive tool to analyzemetagenomes and summarize the results of a project. The CAP is designed in a bottom up fashionto perform QC, preprocessing, analysis and even to build relevant databases and install necessarytools.
Availability and Implementation
The CAP is available under an MIT License on GitHub at https://github.com/MetaSUB/CAP2 and on the Python Package Index. Documentation and exam-ples are available on GitHub.
Advances in metagenomics have led to a sharp increase in the number and size of studies. A largenumber of tools have been built to analyze metagenomes (in particular to generate taxonomic profiles)but there is still wide variation in how these tools are applied. This has created unnecessary frictionas metagenomic studies often develop their pipelines in an ad-hoc fashion and as different choices forimplementation (particularly choice of reference database) complicate cross study comparisons.The CAP addresses issues of unnecessary friction in metagenomic analysis by providing a standard-ized analysis pipeline and framework. This provides researches with an effective starting point from whichto launch more custom analyses. The CAP provides consistent naming and file structure conventions,clear documentation for different pipeline stages, automatic installation of constituent programs, auto-matic construction of reference databases and indices, project level summaries of results, comprehensiveversioning system for reproducibility, and an easy to use python interface and API.Critically the CAP does not attempt to replace existing analysis tools. The CAP is, rather, a setof standardized practices for metagenomic analysis based largely on existing peer reviewed tools. Theflexible nature of the CAP also means that tools and databases can be swapped or modified with minimaleffort.
The CAP is written in Python and can be installed from the python package index using tools like pip or easyinstall . Once installed the CAP can be run from the command line or called from other programsvia a python API. The CAP does not require users to install any constituent programs or databases. Allnecessary sub-programs are installed through PyPi or Conda and databases can be downloaded or builtfrom source based on configuration. 1 a r X i v : . [ q - b i o . Q M ] S e p o run the CAP a user need only supply a CSV file with sample names and the locations of theforward and reverse FASTQ read file for each sample. Optionally the user can supply some configurationparameters (either through command line arguments or environment variables) which will direct the CAPto install programs and databases in specific locations. If for some reason installation of subprogramsthrough conda is not available users can supply custom executables via configuration. The CAP consists of 5 stages (Fig 1): a quality control stage (Fig S1), a preprocessing stage (Fig S2),a short read stage (Fig S3), an assembly stage (Fig S4), and a database stage (Fig S5). The databasestage consists of building relevant indices for the other stages. Since most users will not want to builddatabases the CAP is configured to download versioned databases by default.The quality control stage consists of three modules: a custom module to count reads and associatedstatistics (e.g. read length, GC content, etc), FastQC (Andrews, 2010), and MultiQC (Ewels et al.,2016).The preprocessing stage consists of 3 modules in series. First, adapters are removed from raw readsusing AdapterRemoval (Schubert et al., 2016). Second, reads mapping to the human genome (hg38, withalternate contigs) are removed using Bowtie2 (Langmead and Steven L Salzberg, 2013). Third, readsare error corrected using BayesHammer (Nikolenko et al., 2013).The short read stage follows the preprocessing stage. It consists of several modules (or small groups ofmodules) that all start directly from pre-processed reads. Taxonomic profiling is performed by Kraken2(Wood et al., 2019) using a database based on all of RefSeq microbial followed by Bracken (Lu et al.,2017). Functional profiling is done with HUMANn2 (Franzosa et al., 2018) after alignment of reads toUniRef90 using Diamond (Buchfink et al., 2014). MicrobeCensus (Nayfach and Pollard, 2015) is used toestimate the average genome size of a microbiome. MASH is used to generate small sketches of the datawhich are then used in a custom module to find the similarity to several human microbiomes (Consortiumet al., 2012). A custom module is used to generate read statistics.The assembly stage also follows the preprocessing stage. It consists of assembly using MetaSPAdes(omitting error correction since that is performed in the preprocessing stage) (Nurk et al., 2017) followedby contig clustering using MetaBAT2 (Kang et al., 2019). ORFs are predicted in contig clusters usingprodigal (Hyatt et al., 2010). Antimicrobial Resistance Genes are identified on contigs using RGI (Alcocket al., 2020).After processing the CAP includes tools to build summary reports for a projects. Generally thesereports consist of CSV files summarizing the outputs of tools across all samples in the project.
The CAP is composed of modules each of which corresponds (roughly) to a single sub-programs. Thesemodules contain documentation for the specific sub-program, specify dependencies, outputs and how thesub-program will be run. Modules are python classes and may be dynamically modified e.g. to substituteFigure 1:
Overall structure of the pipeline. igure 2:
Relationship of taxonomic profiles and functional pathways. Position of each point shows the similarityof each samples taxonomic profile to other samples (nearby samples are more similar, reduced to 2-d with UMAP).Color of each point indicates similarity of functional profiles of samples (reduced to 1-d with UMAP). A roughrelationship between taxonomic and function profiles can be seen.
Figure 3:
Relationship between functional pathways and estimated average genome size. The estimated averagegenome size is positively correlated (rho 0.24, p < 2e-16) to the second principal component of the functionalprofile. ifferent module versions or to insert stages for experimentation. This makes it easy to develop and addmodules to the CAP.A major feature of the CAP is its strict multi-layered versioning system. Each module specifiesa specific version number (typically a SemVer) and dependencies. To account for upstream changesmodules also specify a version tree that includes the version numbers of all upstream modules (and theirupstream modules, etc) into a tree. This tree can be accessed either as a human readable newick tree oras a version hash. When run each modules records the version number of the module run, the hash of theversion tree, and the version of the sub-program in the module (in case a novel version has been specifiedin the config) as a JSON file alongside the actual module results. This system enables comparison andreproducibility between pipeline versions.
To demonstrate some of the capabilities of the CAP we analyzed 1,521 environmental metagenomicsamples from the PathoMAP project (Afshinnekoo et al., 2015). These samples were collected from drysurfaces in the New York City Subway system in 2015.To demonstrate the CAP we present two results that show how multiple tools in the CAP can becombined to produce novel results. In particular we show two results from three different tools that arepart of the CAP: functional profiling using HUMANn2, taxonomic profiling with Kraken2, and averagegenome size estimation using MicrobeCensus. Our goal with these results is to show how multiple toolscan be used to uncover certain ecological relationships in metagenomic data.We reduced the dimensionality of our taxonomic profiles to 2 using UMAP and analagously reducedfunctional profiles to 1 dimension. The functional profiles and taxonomic profiles show some relation(Figure 2) with samples that have similar taxonomic profiles also having similar functional profiles.Average genome size of bacteria in microbiomes has been postulated to have some connection to metaboliccapacity with larger genomes in more generalist bacteria (Nayfach and Pollard, 2015). We show arelationship between average genome size and functional capacity (Figure 3) with a weak correlation(spearmans rho 0.24, p < 2e-16) between average genome size and the second principal component ofthe functional profile.
The CAP provides a consistent good-practice baseline for Metagenomic studies. It is based, largely offof existing peer reviewed tools that have themselves been extensively benchmarked. Routine use of theCAP is meant to improve reproducibility in metagenomic studies, decrease friction in performing suchstudies, and enable cross-project comparisons. The routine use of multiple analysis tools can result innovel insights and help to uncover ecological relationships.
We would like to thank Robert Petit and Ben Chrobot for their help.
References
Afshinnekoo, E., Meydan, C., Levy, S., and Mason, C. E. (2015). Geospatial Resolution of Humanand Bacterial Diversity with City-Scale Metagenomics Article Geospatial Resolution of Human andBacterial Diversity with City-Scale Metagenomics.
Cell Systems , 1:1–15.Alcock, B. P., Raphenya, A. R., Lau, T. T., Tsang, K. K., Bouchard, M., Edalatmand, A., Huynh, W.,Nguyen, A. L. V., Cheng, A. A., Liu, S., Min, S. Y., Miroshnichenko, A., Tran, H. K., Werfalli, R. E.,Nasir, J. A., Oloni, M., Speicher, D. J., Florescu, A., Singh, B., Faltyn, M., Hernandez-Koutoucheva,A., Sharma, A. N., Bordeleau, E., Pawlowski, A. C., Zubyk, H. L., Dooley, D., Griffiths, E., Maguire,F., Winsor, G. L., Beiko, R. G., Brinkman, F. S., Hsiao, W. W., Domselaar, G. V., and McArthur, A. G.(2020). CARD 2020: Antibiotic resistome surveillance with the comprehensive antibiotic resistancedatabase.
Nucleic Acids Research , 48(D1):D517–D525.ndrews, S. (2010). FastQC.
Babraham Bioinformatics
Nature , 486(7402):207–14.Ewels, P., Magnusson, M., Lundin, S., and Käller, M. (2016). MultiQC: Summarize analysis results formultiple tools and samples in a single report.
Bioinformatics , 32(19):3047–3048.Franzosa, E. A., McIver, L. J., Rahnavard, G., Thompson, L. R., Schirmer, M., Weingart, G., Lipson,K. S., Knight, R., Caporaso, J. G., Segata, N., and Huttenhower, C. (2018). Species-level functionalprofiling of metagenomes and metatranscriptomes.
Nature methods , 15(11):962–968.Hyatt, D., Chen, G. L., LoCascio, P. F., Land, M. L., Larimer, F. W., and Hauser, L. J. (2010). Prodigal:Prokaryotic gene recognition and translation initiation site identification.
BMC Bioinformatics , 11.Kang, D. D., Li, F., Kirton, E., Thomas, A., Egan, R., An, H., and Wang, Z. (2019). MetaBAT2: An adaptive binning algorithm for robust and efficient genome reconstruction from metagenomeassemblies.
PeerJ , 2019(7).Langmead and Steven L Salzberg (2013). Bowtie2.
Nature methods , 9(4):357–359.Lu, J., Breitwieser, F. P., Thielen, P., and Salzberg, S. L. (2017). Bracken: Estimating species abundancein metagenomics data.
PeerJ Computer Science , 2017(1).Nayfach, S. and Pollard, K. S. (2015). Average genome size estimation improves comparative metage-nomics and sheds light on the functional ecology of the human microbiome.
Genome Biology , 16(1).Nikolenko, S. I., Korobeynikov, A. I., and Alekseyev, M. A. (2013). BayesHammer: Bayesian clusteringfor error correction in single-cell sequencing.
BMC Genomics , 14.Nurk, S., Meleshko, D., Korobeynikov, A., and Pevzner, P. A. (2017). MetaSPAdes: A new versatilemetagenomic assembler.
Genome Research , 27(5):824–834.Schubert, M., Lindgreen, S., and Orlando, L. (2016). AdapterRemoval v2: rapid adapter trimming,identification, and read merging.
BMC Research Notes , 9(1):88.Wood, D. E., Lu, J., and Langmead, B. (2019). Improved metagenomic analysis with Kraken 2.
GenomeBiology , 20(1). upplement igure S1:
Quality control stage, black cylinders represent raw data, orange ovals are results, and green boxesare tools.
Figure S2:
Preprocessing stage, black cylinders represent raw data, orange ovals are results, and green boxesare tools. igure S3:
Short Read stage, black cylinders represent raw data, orange ovals are results, and green boxes aretools.
Figure S4:
Assembly stage, black cylinders represent raw data, orange ovals are results, and green boxes aretools. igure S5:igure S5: