[PDF] EDGE COVID-19: A Web Platform to generate submission-ready genomes for SARS-CoV-2 sequencing efforts

Abstract

Genomics has become an essential technology for surveilling emerging infectious disease outbreaks. A wide range of technologies and strategies for pathogen genome enrichment and sequencing are being used by laboratories worldwide, together with different, and sometimes ad hoc, analytical procedures for generating genome sequences. As a result, public repositories now contain non-standard entries of varying quality. A standardized analytical process for consensus genome sequence determination, particularly for outbreaks such as the ongoing COVID-19 pandemic, is critical to provide a solid genomic basis for epidemiological analyses and well-informed decision making. To address this need, we have developed a bioinformatic workflow to standardize the analysis of SARS-CoV-2 sequencing data generated with either the Illumina or Oxford Nanopore platforms. Using an intuitive web-based interface, this workflow automates SARS-CoV-2 reference-based genome assembly, variant calling, lineage determination, and provides the ability to submit the consensus sequence and necessary metadata to GenBank or GISAID. Given a raw Illumina or Oxford Nanopore FASTQ read file, this web-based platform enables non-bioinformatics experts to automatically produce a SARS-CoV-2 genome that is ready for submission to GISAID or GenBank. Availability:this https URL;this https URL

Full PDF

EEDGE COVID-19: A Web Platform to generate submission-ready genomes for SARS-CoV-2 sequencing efforts

Chien-Chi Lo *, Migun Shakya , Karen Davenport, Mark Flynn, Jason Gans, Adán Myers y Gutiérrez, Bin Hu, Po-E Li, Elais Player Jackson, Yan Xu, and Patrick S. G. Chain * Bioscience Division, Los Alamos National Laboratory, Los Alamos, New Mexico Contributed equally to this work. *To whom correspondence should be addressed. Chien-Chi Lo ( [email protected] ) and Patrick Chain ( [email protected] ) Running title : EDGE COVID-19: App for SARS-CoV-2 genome analysis Keyword: viral genome assembly, consensus and variant calling, bioinformatics workflow ABSTRACT

Genomics has become a pivotal technology for understanding emerging outbreaks of infectious diseases. A wide range of technologies and strategies for pathogen genome enrichment and sequencing are being used by laboratories and institutes, resulting in different analytical procedures for generating genome sequences and resulting in non-standard entries of varying quality into public repositories. A standardized analytical process for genome sequence generation, particularly for outbreaks such as the ongoing COVID-19 pandemic, are critical to provide scientists with a solid genomic basis for epidemiology analyses and for making well-informed decisions. To address this challenge, we have developed a web platform (EDGE COVID-19) to run standardized workflows for reference-based genome assembly and to perform further analysis of Illumina or Oxford Nanopore Technologies data for SARS-CoV-2 genome sequencing projects. Given a raw Illumina or Oxford Nanopore FASTQ read file, this web platform automates the production of a SARS-CoV-2 genome that is ready for submission to GISAID or GenBank. EDGE COVID-19 is open-source software that is available as a web application for public use at https://covid19.edgebioinformatics.org , and as a Docker container at https://hub.docker.com/r/bioedge/edge-covid19 for local installation. INTRODUCTION

Genome sequencing is playing an increasingly critical role in understanding disease outbreaks. Highlighting this trend, public health laboratories and scientists around the world have been sequencing strains of SARS-CoV-2, a positive stranded ~29kb RNA virus and the etiological agent responsible for the global COVID-19 pandemic. In under 6 months, more than 40,000 genomes have been sequenced and made publicly available via genome repositories such as GISAID (Shu and McCauley 2017) and GenBank (Clark et al. 2016) . Open access to these genomes has enabled scientists to design and refine diagnostic assays for the virus (Jansen van Rensburg et al. 2016) , monitor its evolution (Petit and Read 2018) , and track its movement within communities and around the globe (Hadfield et al. 2018) . DGE COVID-19: App for SARS-CoV-2 genome analysis

Multiple approaches have been applied to obtain SARS-CoV-2 genome sequences, including enriching for the virus prior to sequencing or performing deep, random, ‘shotgun’ sequencing of clinical samples. One effort helping to standardize these sequencing procedures comes from the international ARTIC network, that provides end-to-end instructions for the Oxford Nanopore Technologies (ONT) sequencing platforms. Currently in its third revision, this protocol outlines a multiplex PCR amplicon sequencing approach for SARS-CoV-2 ( https://artic.network/ncov-2019 ). Detailed command-line instructions for generating the consensus sequence of the viral genome from “raw” sequencing data is available on their website as well. The wet-lab protocols have also been adapted for Illumina sequencing ( Sevinsky, et al. 2020) , allowing a wider range of laboratories and institutes to enrich for and sequence SARS-CoV-2. Despite these efforts, there is no single, standardized protocol and there remain a number of sequencing strategies that continue to be employed. In addition to having multiple sequencing strategies available, routine SARS-CoV-2 genome sequencing by laboratories is further challenged by the complexity of downstream bioinformatics data processing. There is a distinct need for a standardized, easy-to-use, GUI-based, web application that covers the basic bioinformatics analyses required to generate genomes for submission to public repositories that works for all these sequencing protocols. Due to a lack of standardization, there is significant variability in the quality of publicly available genome sequences, which can potentially confound downstream analyses. The importance of standardized procedures for generating the genomes that are critical for robust interpretation of SARS-CoV-2 evolution and epidemiology has, in part, motivated the creation of SPHERES ( SARS-CoV-2 Sequencing for Public Health Emergency Response, Epidemiology and Surveillance ) , a new open genomics consortium to coordinate SARS-CoV-2 sequencing across the United States. A basic, standardized, bioinformatic, genome assembly workflow for reads obtained from clinical samples should, at minimum, include: per-read data quality assessment, background read removal/filtering (e.g. human data), per-read trimming, and genome sequence assembly (either de novo , or, more traditionally, reference-based). Submission of genome sequences to public repositories further requires sample and sequencing metadata (e.g. sample collection date, patient age, gender, geographic location, etc.) that also requires standardization. In addition to standardization, it is essential to reduce the complexity of the process of genome assembly for scientists during an outbreak. While there are many open-source software options for each individual bioinformatic step, most of these can only be executed from the command-line in a Linux environment. This software complexity presents a major challenge to most bench scientists who may not have extensive training in bioinformatics, and can result in ad hoc procedures for deciding the final genome sequence of an individual sample. Finally, the lack of automated genome sequence submission to public repositories results in additional complexity, and increases the chance of introducing human errors that can further confound downstream analyses. T his barrier also delays timely deposition and slows down follow-on analyses. RESULTS

Here we present EDGE COVID-19, a user-friendly web platform that enables rapid, automated and standardized processing of FASTQ reads, from either Illumina or ONT platforms, and the generation of a consensus SARS-CoV-2 genome sequence. EDGE COVID-19 records sample and sequencing metadata sufficient for submission to GenBank and GISAID, with an automated process available for GISAID submission. EDGE COVID-19 also includes single nucleotide polymorphism (sample to reference differences) and variant (within sample differences) analysis modules. Results are presented as tables and as an interactive genome browser within the GUI. This platform is freely available as a Docker container for local installation and as a public web service, where users can register for accounts, upload data, run analyses, and download results. This platform also enables direct submission of the genome and metadata to the GISAID genome repository. By default, a short description of the method used for calling the consensus is placed within the metadata form, while other metadata fields are entered manually by the user.

Validation of the workflow DGE COVID-19: App for SARS-CoV-2 genome analysis

Datasets from the NCBI Sequence Read Archive (SRA) were used to validate the EDGE COVID-19 default workflow. As of May 4, there were 45 (last accessed on May 4, 2020) BioProjects in NCBI representing over 4,500 COVID-19-related SRA datasets, from a variety of institutes, generated using a range of sample preparation methods, sequencing strategies, and sequencing platforms. Of these BioProjects, only 12 were able to be connected via metadata to corresponding SARS-CoV-2 genomes found in either GenBank or GISAID. We selected the SRA datasets from the FDA-ARGOS BioProject, as well as one example dataset for all other BioProjects with an associated SARS-CoV-2 genome sequence. The FDA-ARGOS database is ideal for validation, as it comprises a collection of quality-controlled and curated reference genomes intended to support diagnostic use and regulatory decisions, and are therefore ideal for validation purposes (Sichtig et al. 2019) . Supplemental Table S1 contains full descriptions of the datasets and our validation result summaries. All of the datasets and their analyses are hosted on the https://edge-covid19.edgebioinformatics.org website, and can be re-run at any time by any registered user by simply providing the same SRA accession number as input and using default parameters. The SARS-CoV-2 Wuhan Hu-1 genome ( NC_045512.2 ) is used as the default reference, and a consensus for each SRA dataset was calculated using the default parameters in the EDGE COVID-19 workflow (for details, see Methods ). All differences (Single Nucleotide Polymorphisms or SNPs, insertions/deletions or indels, gaps, and ambiguous bases) between the consensus genomes from EDGE COVID-19 and the reference genome used to call the consensus were verified manually by examining the underlying read support. For validation purposes, the corresponding GenBank or GISAID genome for each SRA dataset was also aligned, using NUCmer (Kurtz et al. 2004) to the Wuhan Hu-1 genome to compare differences, as well as directly compared to the consensus genome generated by our workflow. All differences were analyzed by examining the underlying read support. While some differences were observed, these comparisons demonstrate that the EDGE COVID-19 workflow automatically produces genomes that are of equal quality to those found in the public repositories. For example, in terms of SNPs, no differences between any of the EDGE COVID-19 consensus genomes versus GenBank/GISAID repository genomes were found, with the singular exception of one ONT dataset ( SRR11483109 ) generated using the ARTIC protocol (V1). Although our workflow and the genome in GISAID (GISAID: EPI_ISL_416419) shared the same four SNPs when compared to the Wuhan-Hu-1 reference genome, our workflow detected one additional SNP at reference position 17747. Greater than 85% of the 1391 reads that map to this position support the single C to T non-synonymous (proline to leucine) change in the helicase ORF1ab gene. The reason this may have been missed in the originally deposited genome is that this SNP is located within one of the PCR primers employed in the enrichment protocol. Insufficient primer trimming can result in missing genomic differences in primer regions (Figure 1A). The only indel identified by the EDGE COVID-19 workflow with the examined datasets, was a 3-nucleotide codon (Lysine) deletion in the Orf1ab polyprotein that was also captured in the repository genome (GenBank: MT039887.1). Almost all differences between the EDGE COVID-19 consensus and the repository genomes were found with respect to gaps (i.e., missing sequences compared with the reference) and ambiguous bases (Note: entries in the repositories do not distinguish these two, and only report N’s, while EDGE COVID-19 specifies gaps as lower-case n’s and ambiguities as upper-case N’s). Only one dataset ( SRR11140749 ), sequenced by random PCR, covered the full length of the genome, while all other datasets had at least one gap at one end of the genome, confirming that the ends of genomes are not targeted with amplicon approaches, and not as well represented as the middle of the genome with shotgun or capture data. The largest differences in gaps reported were observed within the viral genome. Because “gaps” in the EDGE COVID-19 workflow (reported as lower-case n’s in the consensus) do not have any underlying reads, we infer that sequence from a reference genome was used to fill in the gap prior to deposition in a genome repository. As an example, for the SARS-CoV-2 genome recovered from a tiger ( SRR11587600 ) and sequenced using an ARTIC protocol with an Illumina platform, our workflow introduced a large gap (as a stretch of 251 n’s, whose coordinates matched the ARTIC amplicon primer set DGE COVID-19: App for SARS-CoV-2 genome analysis to fill in this gap. Similar observations were made for several other datasets, including gaps found in the middle of the genome (e.g. with an Illumina ARTIC, SRR11563872, or Illumina shotgun protocol, SRR11092057), as well as at the ends of the genome (see Supplemental Table S1 ). A similar trend was found with nucleotides labeled as “ambiguous” using the EDGE COVID-19 workflow (positions with either insufficient coverage or when there is allelic variation with no dominant allele; output as upper-case N’s in the consensus). In most cases, the EDGE COVID-19 workflow, with a five-fold (5X) coverage cutoff, was more conservative at the ends of the genome, where there is generally less data to support a specific nucleotide call. Additional differences were also found within the genome, where our workflow was again more conservative. For example, using the SRR11397722 ONT dataset (V1 ARTIC amplicon protocol), the EDGE COVID-19 workflow reported 269 ambiguities in two regions within the genome. One of the two locations, from positions 22878 to 23144 (267 nucleotides, corresponding to ARTIC amplicon (Itokawa et al. 2020) . While a majority-rules consensus calling algorithm almost always agrees with the reference Wuhan Hu-1 genome in these regions, this is not always the case in such low coverage areas. Because protocols for calling consensus genomes are not part of standard metadata entries, the algorithm used to determine the consensus in existing deposited genomes remains unknown and thus it is not possible to tell, without rigorous analysis of raw supporting data, whether the genomes are indeed complete as reported or if sections have been filled in with a reference sequence. We found a notable exception in SRR11483109 (from the Yale School of Public Health group), that was processed with more stringent parameters and has more dropouts (genomic regions with N’s) in the corresponding genome (GISAID: EPI_ISL_416419) compared with the sequence produced by the default EDGE COVID-19 workflow. Six different regions containing 200-300 N’s were reported in the GISAID genome and consist of the unique portions of ARTIC amplicons 9, 18, 45, 64, 67, and 76. Only two of these regions (amplicons 18 and 64) were also labeled as gaps using the default EDGE COVID-19 workflow (Figure 1C). In an online preprint of their work, the Yale group describe using a 20 fold coverage cutoff value for calling their consensus sequences (Fauver et al. 2020) , while the default for the EDGE COVID-19 workflow is five-fold. Indeed, further verification of the other four reported gap regions in the deposited genome showed that three of them (amplicons 9, 45 and 67) were covered at an average of 10-16 fold coverage, with only a single ambiguous nucleotide reported in amplicon 9 at position 2625. Amplicon 76 only had 1X fold coverage, and was reported by EDGE COVID-19 as nine short (<10nt) gaps interspersed among 248 ambiguous nucleotides. Ironically, while stringent fold-coverage parameters were used for calling a consensus genome, this particular genome was the only genome we observed to have missed a SNP (see above). Together, these examples provide a survey of genome consensus calling techniques and the variability among publicly available genomes with associated SRA read sets. DGE COVID-19: App for SARS-CoV-2 genome analysis Figure 1: Position of an extraneous SNP and coverage of ARTIC amplicons across the genome. (A) An EDGE COVID-19 output, Jbrowse view of reads from SRR11483109 mapped to the reference genome’s position ~ 17650-18850. The first section can show the nucleotide sequence in the reference genome (if sufficiently zoomed in); the next section displays gene annotations; the third outlines the ARTIC primer locations; the fourth is coverage information (grey) with alternate bases and their abundance indicated by colored histograms and ‘droplets’ (when there is sufficient evidence to call the alternate base as the new consensus); and finally the fifth section displays reads that are mapped to those regions (nucleotide differences in the reads compared with the reference are indicated by colored lines in the read, and are summarized in the fourth section). The red droplet symbol at position 17747 (red arrow) lies within the right primer 58 and indicates a newly discovered C to T non-synonymous SNP. The other two SNPs (droplets) in this figure were also discovered and reported in the deposited genome. A closer zoomed-in view of the newly discovered SNP under the “nCov-2019_58_right” primer can be found in Supplemental Figure S1. (B) A bar chart output by EDGE COVID-19 shows the fold coverage of each unique amplicon region generated using the ARTIC protocol for sample SRR11397722. Amplicon (C ) Another ARTIC amplicon coverage bar chart for sample SRR11483109. Blue colored bars represent amplicons that passed the EDGE COVID-19 coverage threshold but were still lower than 20X, while black bars did not pass the coverage threshold and were reported as ambiguous regions.

CONCLUSION/DISCUSSION

While genomes are being rapidly sequenced and provided to the public during infectious disease outbreak scenarios, such as with the COVID-19 pandemic, it is most often unclear how the genomes themselves have been generated. This is further complicated by the number of different sequencing protocols available, the number of different bioinformatics strategies being applied, the number of different groups, with diverse backgrounds, generating the data, and the dearth of publicly-available raw data to help validate the accuracy of genomes. Indeed, the above examples illustrate a number of inconsistencies in publicly available genomes, highlighting the need for a consistent analytical workflow for generating consensus sequences, particularly during a pandemic, when genome analysis is routinely used for epidemiology and can impact response strategies. The EDGE COVID-19 web platform is a robust, standardized, user-friendly bioinformatics workflow designed for the global community in response to the ongoing COVID-19 pandemic. EDGE COVID-19 can help produce submission-ready, high-quality genomes from raw Illumina or ONT sequencing data generated from a range of protocols. The source code is being actively maintained and developed, with a commitment to adopting community-agreed upon standards (such as those being discussed and implemented within the SPHERES consortium). The use of a software container to ease local installation DGE COVID-19: App for SARS-CoV-2 genome analysis and implementation, and the ability to execute EDGE COVID-19 via command line or via a web-based interface are features designed to lower the barrier for widespread adoption. While this workflow is one of many possible examples of a standardized process for generating consensus genomes, it is clear that adoption of this type of standardized workflow by the scientific and public health communities during an outbreak would facilitate the downstream use and interpretation of these genomes.

METHODS

Implementation of EDGE COVID-19

EDGE COVID-19 is a tailored version of the more generic open source, EDGE bioinformatics platform (Li et al. 2017) , which was developed to democratize genomics analysis and make available complex bioinformatic tools to non-experts. EDGE COVID-19 is focused on generating high quality SARS-CoV-2 genomes from raw sequencing data, thus is stripped of non-essential bioinformatic tools and modules, but instead includes strict default reference genome alignment and reference-guided consensus generation. This COVID-19 specific workflow drives a series of best-practice, open-source bioinformatics software to aid in the reconstruction of complete genomes of SARS-CoV-2 from raw shotgun, or SARS-CoV-2 enriched, Illumina or ONT sequencing data (including those generated using the popular ARTIC protocols). Users can input their own Illumina or ONT sequencing FASTQ files, or identify datasets to be automatically obtained directly from the NCBI Sequence Read Archive (SRA) or European Nucleotide Archive (ENA) for analyses. The default workflow (Figure 2) of EDGE COVID-19 consists of data quality control and filtering, read mapping to a reference genome sequence, generation of a consensus-based genome, variant analysis, and with post-processing steps to provide web-based views of the results. Detailed parameters for each step can be accessed from online documentation . Initially, the data is processed for quality control using FaQCs v2.09 (Lo and Chain 2014) for either Illumina or ONT reads; ONT reads are additionally processed using Porechop v0.2.3 (Wick 2017) for adapter trimming. Removal of human reads by reference mapping is provided as an optional feature. Data QC is followed by mapping the high quality reads to a reference genome using either bowtie2 v2.4.1 (Langmead and Salzberg 2012) , BWA v0.7.12 (Li and Durbin 2009) (default), or minimap2 v2.17 (Li 2018) for Illumina data, and BWA or minimap2 (default) for ONT data. The SARS-CoV-2 Wuhan Hu-1 NCBI RefSeq genome, NC_045512.2 , is used as the reference by default (Wu et al. 2020) , but have slightly modified the RefSeq sequence by removing the 33 nucleotide poly A tail from the 3’ end of the genome. Users can also specify any additional SARS-CoV-2 genome sequence available from GenBank, or upload a user-specific sequence as reference. To provide high quality consensus genomes, the default behavior is to use stringent parameters (Figure 2). Most of these parameters can be altered by the users. SNP and variant calling are provided by SAMtools v1.10 and BCFtools v1.10.2 (Li et al. 2009) , followed by vcfutils.pl within SAMtools to further reduce false positive calls and filter the raw variant calls ( see online documentation for detailed parameters). Outputs of this workflow includes statistics for each series of analyses (in both graphical and tabular forms) and the results page for each dataset provides the interactive genome browser JBrowse (Buels et al. 2016) , to enable a deeper, visual inspection of the variant calls and underlying data. Finally, EDGE COVID-19 provides an optional automated submission (see below) of the consensus genome sequence (together with required metadata input by the users) to GISAID, which can be performed after viewing the results. Calling a Consensus Genome

Additional filtering steps after read mapping has been implemented based on how the samples were sequenced. For sequences generated using a shotgun approach, we first remove duplicate PCR sequences ( samtools markdup -r ), filter out low quality mapped reads based on MAPping Quality score (<60), base quality (<20 for Illumina and <7 for ONT), Base Alignment Quality for Illumina (Li 2011) , and to alleviate memory constraints, further subsample to 8000X coverage ( samtools mpileup --max-depth 8000 ) before calling the consensus genome. For samples generated using amplicon based protocols from ARTIC and CDC (Paden et al. 2020) , we trim the primers using align_tri m ( Quick J et al. 2017 ), and further process similar to shotgun data but without removing PCR duplicates. The workflow then calls gaps, ambiguous positions, and DGE COVID-19: App for SARS-CoV-2 genome analysis nucleotides in the consensus genome. When a BED file containing primer information is supplied for align_trim , an interactive graphical summary of abundance of each amplicon is presented, allowing users to identify when and where amplicon dropouts have occurred. If no reads are mapped to a position, a gap is called (indicated by a lower-case n), while if a position is covered by less than five reads, an upper-case N is placed in that position. For any position that has more than five mapped reads, we further check the proportion of reads that either agree or disagree with the reference genome. Any nucleotide (or deletion) found in greater than 50% of all mapped reads is called as the consensus (there is a minimum of 5 reads that must agree), otherwise the position is also deemed ambiguous and an upper-case N is called. For ONT datasets, we implemented a more stringent threshold for indels at 80% to account for a higher frequency of these errors, particularly within homopolymeric nucleotide stretches (Rang et al. 2018; Cretu Stancu et al. 2017) . All parameters can be adjusted within the web interface. Figure 2: EDGE COVID-19 workflow overview. The workflow takes FASTQ files from Illumina or ONT as input, and processes these raw data to produce a genome ready for repository submission. Dashed line indicates optional steps. Briefly, raw data are processed through several filters and quality trimming, and then read mapped to a reference genome, which is followed by variant and statistical analyses and consensus genome reporting. Optional data trimming and removal

Human host sequence screening is provided as an option. All input reads are mapped against the human reference genome GRCh38 using minimap2 and the reads that match at or above 80% identity (default settings) are removed prior to downstream processing . A human-filtered file is also made available if raw data without human reads are needed. For an amplicon sequencing experiment, such as the one employed in the ARTIC protocol, the option to remove primers before downstream processing is also available. We have provided two options to remove primers, one requires a file of primer sequences and searches the 5’ end of reads for trimming, while the second one requires a BED file containing the amplicon positions and implements align_trim to soft clip the primers from each amplicon set. Optional de novo assembly EDGE COVID-19 also provides de novo assembly of sequencing reads and the annotation of genomes, but these features are disabled by default (since genome consensus generation from read alignment to a reference sequence has been the de facto approach with viral outbreaks). For exploratory investigations, the assembly module comes equipped with multiple de novo assemblers for both short (Illumina) and long (ONT) read datasets. As these assemblers have been designed for shotgun sequencing approaches, it is worth noting that amplicon protocols often do not result in single contigs spanning the entire length of the viral genome. Once invoked, the assembled contigs are also aligned to the selected reference genome using NUCmer, and SNPs, indels and gaps are also reported from these alignments. The assembled contigs can also be annotated by transferring the annotation from the NC_045512.2 SARS-CoV-2 reference genome sequence (Otto et al. 2011) . Optional data submission DGE COVID-19: App for SARS-CoV-2 genome analysis

Once the consensus and supporting read evidence have been viewed, owners of the project are able to submit the genome directly from the platform to GISAID (see documentation for more details). Several criteria must be met in order for owners to have the option for automatic genome submission: genomes must have a length greater than 25kb, where at least 95% of the genome have assigned nucleotides (and not simply ambiguities or gap n’s), and the genome must have at least a ten-fold average depth of coverage. One step in the genome submission process involves filling in metadata that is required by GISAID. Because of the strict nature and limited fields for the metadata submission process to GISAID, information on consensus generation, including the thresholds used to report nucleotides and SNPs are included in the Assembly method field. We strongly encourage the inclusion of as much contextual data as possible and recommend following the guidelines outlined within the SPHERES consortium’s data standards and a related Public Health Alliance for Genomic Epidemiology ( PHA4GE ) effort. A similar, automated, option for GenBank and SRA submission are also under development, and are envisioned as a future feature of the platform. DATA ACCESS

All the data processed as part of the paper can be found in the SRA, GenBank and GISAID repositories. SRR ids and accession numbers from GISAID and GenBank are shown in the Supplemental Table S1 . FUNDING

The development of the EDGE COVID-19 and the EDGE platform is supported by the Los Alamos National Laboratory LDRD (20200732ER), DTRA (CB10152 and CB10623), and DOE (KP160101 and 4000150817).

ACKNOWLEDGEMENTS

The authors declare no conflict of interest. This research was supported by the DOE Office of Science through the National Virtual Biotechnology Laboratory, a consortium of DOE national laboratories focused on response to COVID-19, with funding provided by the Coronavirus CARES Act. Hosting of edgebioinformatics.org is provided by Cyverse, which is supported by the National Science Foundation under Award Numbers DBI-0735191, DBI-1265383, and DBI-1743442. REFERENCES

Buels R, Yao E, Diesh CM, Hayes RD, Munoz-Torres M, Helt G, Goodstein DM, Elsik CG, Lewis SE, Stein L, et al. 2016. JBrowse: a dynamic web platform for genome visualization and analysis. Genome Biol : 66. Clark K, Karsch-Mizrachi I, Lipman DJ, Ostell J, Sayers EW. 2016. GenBank. Nucleic Acids Res : D67–72. Cretu Stancu M, van Roosmalen MJ, Renkens I, Nieboer MM, Middelkamp S, de Ligt J, Pregno G, Giachino D, Mandrile G, Espejo Valle-Inclan J, et al. 2017. Mapping and phasing of structural variation in patient DGE COVID-19: App for SARS-CoV-2 genome analysis genomes using nanopore sequencing. Nat Commun : 1326. Fauver JR, Petrone ME, Hodcroft EB, Shioda K, Ehrlich HY, Watts AG, Vogels CBF, Brito AF, Alpert T, Muyombwe A, et al. 2020. Coast-to-coast spread of SARS-CoV-2 in the United States revealed by genomic epidemiology. Public and Global Health . Hadfield J, Megill C, Bell SM, Huddleston J, Potter B, Callender C, Sagulenko P, Bedford T, Neher RA. 2018. Nextstrain: real-time tracking of pathogen evolution. Bioinformatics : 4121–4123. Itokawa K, Sekizuka T, Hashino M, Tanaka R. 2020. A proposal of alternative primers for the ARTIC Network’s multiplex PCR to improve coverage of SARS-CoV-2 genome sequencing. BioRxiv . Jansen van Rensburg MJ, Swift C, Cody AJ, Jenkins C, Maiden MCJ. 2016. Exploiting Bacterial Whole-Genome Sequencing Data for Evaluation of Diagnostic Assays: Campylobacter Species Identification as a Case Study. J Clin Microbiol : 2882–2890. Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL. 2004. Versatile and open software for comparing large genomes. Genome Biol : R12. Li H, Durbin R. 2009. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics : 1754–1760. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, 1000 Genome Project Data Processing Subgroup. 2009. The Sequence Alignment/Map format and SAMtools. Bioinformatics : 2078–2079. Li P-E, Lo C-C, Anderson JJ, Davenport KW, Bishop-Lilly KA, Xu Y, Ahmed S, Feng S, Mokashi VP, Chain PSG. 2017. Enabling the democratization of the genomics revolution with a fully integrated web-based bioinformatics platform. Nucleic Acids Res : 67–80. Lo C-C, Chain PSG. 2014. Rapid evaluation and quality control of next generation sequencing data with FaQCs. BMC Bioinformatics : 366. Otto TD, Dillon GP, Degrave WS, Berriman M. 2011. RATT: Rapid Annotation Transfer Tool. Nucleic Acids Res : e57. Paden CR, Tao Y, Queen K, Zhang J, Li Y, Uehara A, Tong S. 2020. Rapid, sensitive, full genome sequencing of Severe Acute Respiratory Syndrome Virus Coronavirus 2 (SARS-CoV-2). bioRxiv (Accessed June 7, 2020). Petit RA 3rd, Read TD. 2018. Staphylococcus aureus viewed from the perspective of 40,000+ genomes. PeerJ : e5261. Rang FJ, Kloosterman WP, de Ridder J. 2018. From squiggle to basepair: computational approaches for improving nanopore sequencing read accuracy. Genome Biol : 90. Shu Y, McCauley J. 2017. GISAID: Global initiative on sharing all influenza data - from vision to reality. Euro Surveill . http://dx.doi.org/10.2807/1560-7917.ES.2017.22.13.30494 . Sichtig H, Minogue T, Yan Y, Stefan C, Hall A, Tallon L, Sadzewicz L, Nadendla S, Klimke W, Hatcher E, et al. 2019. FDA-ARGOS is a database with public quality-controlled reference genomes for diagnostic use and regulatory science. Nat Commun : 3313. Wick RR. 2017. Porechop. Github https://github com/rrwick/Porechop . Wu F, Zhao S, Yu B, Chen Y-M, Wang W, Song Z-G, Hu Y, Tao Z-W, Tian J-H, Pei Y-Y, et al. 2020. A new coronavirus associated with human respiratory disease in China. Nature : 265–269.: 265–269.