[PDF] MOSGA: Modular Open-Source Genome Annotator

Abstract

The generation of high-quality assemblies, even for large eukaryotic genomes, has become a routine task for many biologists thanks to recent advances in sequencing technologies. However, the annotation of these assemblies - a crucial step towards unlocking the biology of the organism of interest - has remained a complex challenge that often requires advanced bioinformatics expertise. Here we present MOSGA, a genome annotation framework for eukaryotic genomes with a user-friendly web-interface that generates and integrates annotations from various tools. The aggregated results can be analyzed with a fully integrated genome browser and are provided in a format ready for submission to NCBI. MOSGA is built on a portable, customizable, and easily extendible Snakemake backend, and thus, can be tailored to a wide range of users and projects. We provide MOSGA as a publicly free available web service at this https URL and as a docker container at this http URL. Source code can be found at this https URL

Full PDF

aa r X i v : . [ q - b i o . GN ] S e p MOSGA: Modular Open-Source Genome Annotator

Roman Martin , Thomas Hackl , George Hattab , Matthias G. Fischer , DominikHeider Abstract

The generation of high-quality assemblies, even for large eukaryoticgenomes, has become a routine task for many biologists thanks to recentadvances in sequencing technologies. However, the annotation of theseassemblies - a crucial step towards unlocking the biology of the organism ofinterest - has remained a complex challenge that often requires advancedbioinformatics expertise.Here we present MOSGA, a genome annotation framework foreukaryotic genomes with a user-friendly web-interface that generates andintegrates annotations from various tools. The aggregated results can beanalyzed with a fully integrated genome browser and are provided in aformat ready for submission to NCBI. MOSGA is built on a portable,customizable, and easily extendible Snakemake backend, and thus, can betailored to a wide range of users and projects.We provide MOSGA as a publicly free available web service athttps://mosga.mathematik.uni-marburg.de and as a docker container atregistry.gitlab.com/mosga/mosga:latest. Source code can be found athttps://gitlab.com/mosga/mosga

Introduction

Over the last twenty years, whole-genome sequencing and analysis hasemerged as an essential and widely used technique across life sciences. Inparticular, the sequencing of new microbial genomes is now standardpractise and has accelerated discoveries into microbial diversity andevolvability, providing new insight into microbiome function, human health,and ecology.The technical advances accelerating the generation of genome assemblieshave also increased the need for their efficient annotation in terms of genesand other features. There are a few genome annotation pipelines available,e.g., PASA [3], MAKER [6], and Funannotate [12]. PASA and Funannotatewere developed for plant and fungal genome annotations, respectively. In ontrast, MAKER is universal and flexible in terms of modularity andextensibility. However, all them use command-line interfaces and lack agraphical user interface (GUI), limiting their usability to trainedbioinformaticians. Furthermore, these pipelines use strict workflows withpredefined tools and parameters, which cannot easily be tailored tonon-model organisms such as eukaryotic protists [18].To overcome these limitations, we have developed the ModularOpen-Source Genome Annotator (MOSGA), which has recently used toannotate protists genomes [4]. MOSGA enables the easy creation of drafteukaryotic genome annotations by providing a GUI with severaltask-specific prediction tools and a set of Snakemake workflow rules. As toour knowledge, MOSGA is the first modular, freely-available genomeannotation framework and pipeline with a modular graphical user interface.

Software description

The implementation of the MOSGA pipeline comprises three layers (seeFigure S1): (A) the graphical web-interface, (B) the Snakemake workflowengine [11], and (C) the data accumulator. The web-interface allowspipeline submission, execution, and job order management. According to aJSON rule file, the set of tools, parameters, filter options, and supportinginformation are dynamically created at the interface. Extensions to theinterface can be made by changing only the rules in the JSON file. TheSnakemake pipeline will apply the corresponding job-dependent rules outof our set of 63 predefined rules (see Table S1). The Snakemake workflowengine ensures optimal use of computational resources and guarantees asuccessful pipeline execution. Before the actual run, an exactrepresentation of the task-specific pipeline is generated as a graph bySnakemake. An example is shown in Figure S2. The MOSGA frameworkcan be extended at this layer by defining additional rules for new tools,parameters, or even filters. The data accumulator is responsible forreading every single output from every selected tool and finally writing thecorresponding output. Internally, the accumulator stores information intohighly abstracted objects retrieved from several classes thatcomprehensively read-in different outputs. It unifies, sorts, and filters allretrieved information and additionally performs quality checks. After eachtool has been executed, the accumulator writes the final genome featuretable and a SQN file, that can be used for NCBI GenBank submissions.Moreover, several workflow rules enable the integration of the predictionoutputs into JBrowse for visualizing the annotation results [2]. New inputor output formats can easily be implemented by providing new reader orwriter classes. Moreover, pre-implemented python classes for reading instandard formats like CSV or GFF facilitates the development ofextensions. MOSGA is freely available and hosted atmosga.mathematik.uni-marburg.de. In addition, we provide a Docker fileto allow the local deployment of the whole framework.

Results

The MOSGA framework includes state-of-the-art predictions tools forgenome annotations that were previously used for annotating four draft enomes by [4]. To extend the applicability of MOSGA to other projects,we included additional tools as described here:MOSGA uses WindowMasker and RepeatMasker for genomesoft-masking and repeats detection [15, 19]. Moreover, we integrated fourpopular protein-coding gene prediction tools, namely Augustus, BRAKER,GlimmerHMM, and SNAP [5, 10, 14, 20]. Furthermore, we provide aworkflow-specific figure (Fig S3) to help users choose the the best-fittingtool for ab initio predictions tasks based on a benchmark recentlyperformed by [17]. For the preparation of RNA-seq data, we alsointegrated TopHat2 and HiSat2 as alignment tools [8, 9]. An example of anRNA-seq based annotation is available online on the MOSGA site.Functional annotations can be carried out via EggNog 5 [7] andSwiss-Prot [1], tRNAs are predicted by tRNAscan-SE 2 [13], andribomsomal RNA is identified via SILVA [16].

Discussion

Unlike other genome annotation pipelines and frameworks, MOSGA’sintuitive interface enable the user to choose suitable applicationsdepending on the scientific question and task at hand. This permits usersto build their own task-specific pipeline or workflow and addresses a widecommunity. Moreover, the modularity of the MOSGA framework allowsquick extensions with new tools and modifications. MOSGA can bedirectly used to prepare NCBI-compliant submissions and the underlyingSnakemake workflow engine guarantees full reproducibility.

Acknowledgments

This work was supported by the BMBF-funded de.NBI Cloud within theGerman Network for Bioinformatics Infrastructure (de.NBI) (031A537B,031A533A, 031A538A, 031A533B, 031A535A, 031A537C, 031A534A,031A532B). We thank Marius Welzel for his best practices advice toSnakemake workflows.

Funding

This study was funded by the European Regional Development Fund,EFRE-Program, European Territorial Cooperation (ETZ) 2 eferences

1. A. Bairoch, B. Boeckmann, S. Ferro, and E. Gasteiger. Swiss-Prot:juggling between evolution and stability.

Briefings in bioinformatics ,5(1):39–55, 2004.2. R. Buels, E. Yao, C. M. Diesh, R. D. Hayes, M. Munoz-Torres,G. Helt, D. M. Goodstein, C. G. Elsik, S. E. Lewis, L. Stein, andI. H. Holmes. JBrowse: A dynamic web platform for genomevisualization and analysis.

Genome Biology , 17(1):1–12, 2016.3. B. J. Haas, A. L. Delcher, S. M. Mount, J. R. Wortman, R. K.Smith, L. I. Hannick, R. Maiti, C. M. Ronning, D. B. Rusch, C. D.Town, S. L. Salzberg, and O. White. Improving the Arabidopsisgenome annotation using maximal transcript alignment assemblies.

Nucleic Acids Research , 31(19):5654–5666, 2003.4. T. Hackl, R. Martin, K. Barenhoff, S. Duponchel, D. Heider, andM. G. Fischer. Four high-quality draft genome assemblies of themarine heterotrophic nanoflagellate Cafeteria roenbergensis.

Scientific Data , 7(1):29, dec 2020.5. K. J. Hoff, S. Lange, A. Lomsadze, M. Borodovsky, and M. Stanke.BRAKER1: Unsupervised RNA-Seq-based genome annotation withGeneMark-ET and AUGUSTUS.

Bioinformatics , 32(5):767–769,2016.6. C. Holt and M. Yandell. MAKER2: An annotation pipeline andgenome-database management tool for second-generation genomeprojects.

BMC Bioinformatics , 12(1):491, 2011.7. J. Huerta-Cepas, D. Szklarczyk, D. Heller, A. Hern´andez-Plaza,S. K. Forslund, H. Cook, D. R. Mende, I. Letunic, T. Rattei, L. J.Jensen, C. Von Mering, and P. Bork. EggNOG 5.0: A hierarchical,functionally and phylogenetically annotated orthology resourcebased on 5090 organisms and 2502 viruses.

Nucleic Acids Research ,47(D1):D309–D314, 2019.8. D. Kim, B. Langmead, and S. L. Salzberg. HISAT: A fast splicedaligner with low memory requirements.

Nature Methods ,12(4):357–360, 2015.9. D. Kim, G. Pertea, C. Trapnell, H. Pimentel, R. Kelley, and S. L.Salzberg. TopHat2: accurate alignment of transcriptomes in thepresence of insertions, deletions and gene fusions.

Genome biology ,14(4):R36, 2013.10. I. Korf. Gene finding in novel genomes.

BMC Bioinformatics , 5:59,may 2004.11. J. K¨oster and S. Rahmann. Snakemake-a scalable bioinformaticsworkflow engine.

Bioinformatics , 28(19):2520–2522, 2012.12. J. Love, J. Palmer, J. Stajich, T. Esser, E. Kastman, D. Bogema,and D. Winter. funannotate.

Zenodo , 2020.

3. T. M. Lowe and P. P. Chan. tRNAscan-SE On-line: integratingsearch and context for analysis of transfer RNA genes.

Nucleic acidsresearch , 44(W1):W54–W57, 2016.14. W. H. Majoros, M. Pertea, and S. L. Salzberg. TigrScan andGlimmerHMM: Two open source ab initio eukaryotic gene-finders.

Bioinformatics , 20(16):2878–2879, 2004.15. A. Morgulis, E. M. Gertz, A. A. Sch¨affer, and R. Agarwala.WindowMasker: Window-based masker for sequenced genomes.

Bioinformatics , 22(2):134–141, 2006.16. C. Quast, E. Pruesse, P. Yilmaz, J. Gerken, T. Schweer, P. Yarza,J. Peplies, and F. O. Gl¨ockner. The SILVA ribosomal RNA genedatabase project: Improved data processing and web-based tools.

Nucleic Acids Research , 41(D1):590–596, 2013.17. N. Scalzitti, A. Jeannin-Girardon, P. Collet, O. Poch, and J. D.Thompson. A benchmark study of ab initio gene prediction methodsin diverse eukaryotic organisms.

BMC Genomics , 21(1):1–20, 2020.18. S. J. Sibbald and J. M. Archibald. More protist genomes needed.

Nature ecology and evolution , 1(5):145, apr 2017.19. A. Smit, R. Hubley, and P. Green. RepeatMasker Open-4.0. 2015,2013.20. M. Stanke and B. Morgenstern. AUGUSTUS: a web server for geneprediction in eukaryotes that allows user-defined constraints.

Nucleicacids research , 33(Web Server issue):W465–7, jul 2005., 33(Web Server issue):W465–7, jul 2005.