[PDF] Genotyping coronavirus SARS-CoV-2: methods and implications

Abstract

The emerging global infectious COVID-19 coronavirus disease by novel Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) presents critical threats to global public health and the economy since it was identified in late December 2019 in China. The virus has gone through various pathways of evolution. For understanding the evolution and transmission of SARS-CoV-2, genotyping of virus isolates is of great importance. We present an accurate method for effectively genotyping SARS-CoV-2 viruses using complete genomes. The method employs the multiple sequence alignments of the genome isolates with the SARS-CoV-2 reference genome. The SNP genotypes are then measured by Jaccard distances to track the relationship of virus isolates. The genotyping analysis of SARS-CoV-2 isolates from the globe reveals that specific multiple mutations are the predominated mutation type during the current epidemic. Our method serves a promising tool for monitoring and tracking the epidemic of pathogenic viruses in their gradual and local genetic variations. The genotyping analysis shows that the genes encoding the S proteins and RNA polymerase, RNA primase, and nucleoprotein, undergo frequent mutations. These mutations are critical for vaccine development in disease control.

Full PDF

GGenotyping coronavirus SARS-CoV-2: methods andimplications

Changchuan Yin ∗ Department of Mathematics, Statistics, and Computer ScienceUniversity of Illinois at ChicagoChicago, IL 60607USA

Abstract

The emerging global infectious COVID-19 coronavirus disease by novel SevereAcute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) presents critical threatsto global public health and the economy since it was identiﬁed in late December2019 in China. The virus has gone through various pathways of evolution. Forunderstanding the evolution and transmission of SARS-CoV-2, genotyping of virusisolates is of great importance. We present an accurate method for effectivelygenotyping SARS-CoV-2 viruses using complete genomes. The method employsthe multiple sequence alignments of the genome isolates with the SARS-CoV-2reference genome. The SNP genotypes are then measured by Jaccard distances totrack the relationship of virus isolates. The genotyping analysis of SARS-CoV-2isolates from the globe reveals that speciﬁc multiple mutations are the predominatedmutation type during the current epidemic. Our method serves a promising tool formonitoring and tracking the epidemic of pathogenic viruses in their gradual andlocal genetic variations. The genotyping analysis shows that the genes encodingthe S proteins and RNA polymerase, RNA primase, and nucleoprotein, undergofrequent mutations. These mutations are critical for vaccine development in diseasecontrol. • We genotyped 558 SARS-CoV-2 isolates from the globe as of March 23, 2020. • Frequent mutations in SARS-CoV-2 genomes are in the genes encoding the S protein andRNA polymerase, RNA primase, and nucleoprotein. • We established a method for monitoring and tracing SARS-CoV-2 mutations.

The novel coronavirus in humans, ﬁrst discovered in Wuhan, China, in December 2019, was initiallynamed as 2019-nCoV and then designated as SARS-CoV-2 due to its taxonomic and genomicrelationships with the species Severe acute respiratory syndrome-related coronavirus (Gorbalenyaet al., 2020). The present outbreak of the coronavirus-associated acute respiratory disease is namedcoronavirus disease 19 (COVID-19) by WHO. Since the epidemic of COVID-19, more than 332, 930people from 147 countries and territories have been conﬁrmed sicked and more than 14, 510 havedied from the rapidly-spreading SARS-CoV-2 virus as of March 23, 2020 (WHO, 2020). ∗ Correspondence author, [email protected] ID Preprint arXiv.org, March 25, 2020 a r X i v : . [ q - b i o . GN ] M a r oronaviruses (CoVs) are a family of enveloped positive-strand RNA viruses infecting vertebrates,named for the crown-like spikes on their surface. Coronavirus (CoV) belongs to the family Coron-aviridae and the order

Nidovirales . Coronavirus is widely spread in humans, other mammals, andbirds, and can cause diseases such as the respiratory, intestinal, liver, and nervous systems. Humancoronaviruses (HCoVs) were ﬁrst identiﬁed in the mid-1960s. Seven common HCovs are CoV-229E(alpha coronavirus), CoV-NL63 (alpha coronavirus), CoV-OC43 (beta coronavirus), CoV-HKU1 (betacoronavirus), Severe acute respiratory syndrome coronavirus (SARS-CoV), Middle East respiratorysyndrome coronavirus (MERS-CoV), and current SARS-CoV-2. CoV-229E and CoV-OC43 are thecause of the common cold in adults during the mid-1960s. Disease manifestations associated withCoV-HKU1 and CoV-NL63 include the common cold and chronic pneumonia. Coronavirus-HKU1has been predominantly reported in children in the United States but less common among adults.Three highly pathogenic coronaviruses, SARS-CoV, MERS-CoV, and SARS-CoV-2, which emergedin 2002, 2012, and 2019, respectively, have caused severe respiratory disease and thousands of deathsworldwide (Chen, 2020).SARS-CoV-2 coronavirus harbors a linear single-stranded positive RNA genome. The coronavirusSARS-CoV-2 genome consists of a leader sequence, ORF1ab encoding proteins for RNA replication,and genes for non-structural proteins (nps) and structural proteins. The genomic leader sequenceof about 265 bp is the unique characteristic in coronavirus replication and plays critical roles in thegene expression of coronavirus during its discontinuous sub-genomic replication (Li et al., 2005).ORF1ab encodes replicase polyproteins required for viral RNA replication and transcription (Chenet al., 2020). Expression of the C-proximal portion of ORF1ab requires (–1) ribosomal frame-shifting.The ﬁrst non-structural protein (nsp) encoded by ORF1ab is Papain-like proteinase (PL proteinase,nps3). Nsp3 is an essential and largest component of the replication and transcription complex. ThePL proteinase in nsp3 cleaves nsps 1-3 and blocks host innate immune response, promoting cytokineexpression (Serrano et al., 2009; Lei et al., 2018). Nsp4 encoded in ORF1ab is responsible forforming double-membrane vesicle (DMV). The other nsp are 3CLPro protease (3-chymotrypsin-likeproteinase, 3CLpro) and nsp6. 3CLPro protease is essential for RNA replication. The 3CLProproteinase is accountable for processing the C-terminus of nsp4 through nsp16 in all coronaviruses(Anand et al., 2003). Therefore, conserved structure and catalytic sites of 3CLpro may serve asattractive targets for antiviral drugs (Kim et al., 2012). Together, nsp3, nsp4, and nsp6 can induceDMV (Angelini et al., 2013).SARS-coronavirus RNA replication is unique, involving two RNA-dependent RNA polymerases(RNA pol). The ﬁrst RNA polymerase is primer-dependent non-structural protein 12 (nsp12), and thesecond RNA polymerase is nsp8. In contrast to nsp12, nsp8 has the primase capacity for de novo replication initiation without primers (Te Velthuis et al., 2012). Nsp7 and nsp8 are important in thereplication and transcription of SARS-CoV-2. The SARS-coronavirus nsp7 and nsp8 complex is amultimeric RNA polymerase for both de novo initiation and primer extension (Prentice et al., 2004;Te Velthuis et al., 2012). Nsp8 also interacts with ORF6 accessory protein. Nsp9 replicase protein ofSARS-coronavirus binds RNA and interacts with nsp8 for its functions (Sutton et al., 2004).Furthermore, the SARS-CoV-2 genome encodes four structural proteins. The structural proteinspossess much higher immunogenicity for T cell responses than the non-structural proteins (Li et al.,2008). The structural proteins are involved in various viral processes, including virus particleformation. The structural proteins include spike (S), envelope (E), membrane protein (M), andnucleoprotein (N), which are common to all coronaviruses (Marra et al., 2003; Ruan et al., 2003).The spike S protein is a glycoprotein, which has two domains S1 and S2. Spike protein S1 attachesthe virion to the cell membrane by interacting with host receptor ACE2, initiating the infection(Wong et al., 2004). After the internalization of the virus into the endosomes of the host cells, the Sglycoprotein is induced by conformation changes. The S protein is then cleaved by cathepsin CTSL,and unmasked the fusion peptide of S2, therefore, activating membranes fusion within endosomes.Spike protein domain S2 mediates fusion of the virion and cellular membranes by acting as a classI viral fusion protein. Especially, the spike glycoprotein of coronavirus SARS-CoV-2 contains afurin-like cleavage site (Coutard et al., 2020). The furin recognition site is important for beingrecognized by pyrolysis and therefore, contributing to the zoonotic infection of the virus. Theenvelope (E) protein interacts with membrane protein M in the budding compartment of the hostcell. The M protein holds dominant cellular immunogenicity (Liu et al., 2010). Nucleoprotein(ORF9a) packages the positive-strand viral RNA genome into a helical ribonucleocapsid (RNP)during virion assembly through its interactions with the viral genome and membrane protein M (He2t al., 2004). Nucleoprotein plays an important role in enhancing the efﬁciency of subgenomic viralRNA transcription as well as viral replication.The increasing epidemiological and clinical evidence implicates that the SARS-CoV-2 has strongertransmission power than SARS-CoV and lower pathogenicity (Guan et al., 2020). However, themechanism of high transmission of SARS-CoV-2 is unclear. DNA sequence comparisons usingsingle nucleotide polymorphisms (SNPs) are often used for evolutionary studies and can be especiallybeneﬁcial in recognizing the mutated coronavirus genomes, where high mutations can occur due toan error-prone RNA-dependent RNA polymerase in genome replication.To understand the virus evolution of SARS-CoV-2 from the genome mutation context, we establishthe SNP genotyping method and investigate the genotype changes during the transmission of SARS-CoV-2. Our results show that the genotypes of the virus are not uniformly distributed among thecomplete genomes of SARS-CoV-2. This genotyping study discovers a few highly frequent mutationsin the SARS-CoV-2 genomes. The highly frequent SNP mutations might be associated with thechanges in transmissibility and virulence of the virus. The mutations are located in the S protein, RNApolymerase, RNA primase, and nucleoprotein, which are fundamental proteins for vaccine efﬁcacy.Therefore, the high-frequency SNP mutations are important factors when developing vaccines forpreventing the infection of SARS-CoV-2 coronavirus.

Total 558 complete genome sequences of the SARS-CoV-2 strains from the infected individuals areretrieved from the GISAID database (Shu and McCauley, 2017) as of March 23, 2020. Only thecomplete genomes of high-coverage are included in the dataset. The countries and territories, whichare infected by SARS-CoV-2 and share the complete genomes of SARS-COV-2, are Australia (AU),Belgium (BE), Brazil (BR), Canada (CA), Chile (CL), China (CN), Czech Republic (CZ), Denmark(DK), England (UK), Finland (FI), France (FR), Georgia (GE), Germany (DE), Hong Kong (HK),Hungary (HU), India (IN), Ireland (IE), Italy (IT), Japan (JP), Korea (KR), Kuwait (KW), Mexico(MX), Netherlands (NL), New Zealand (NZ), Scotland (UK), Singapore (SG), Switzerland CH),Sweden(SE), Taiwan (TW), Thailand (TH), United Kingdom (UK), Unites States (US), and Vietnam(VN). The complete genome sequences are aligned with the reference genome of SARS-CoV-2 byMSA tool Clustal Omega using the default parameters (Sievers and Higgins, 2014). The alignedgenomes are then re-positioned according to the reference SARS-CoV-2 genome (GenBank accessnumber: NC_045512.2).

The SNP mutations including nucleotide changes and the corresponding positions in a genome arecalled an SNP proﬁle. The SNP proﬁles of SARS-CoV-2 isolates are retrieved and parsed from thealigned genomes according to the reference genome SARS-CoV-2. The SNP proﬁle of the completegenome of a virus can be considered as the genotype of the virus.

The Jaccard similarity coefﬁcient J ( A, B ) of two sets A and B is deﬁned as the intersection size ofthe two sets divided by the union size of two sets (Equation (1)) (Levandowsky and Winter, 1971). J ( A, B ) = | A ∩ B || A ∪ B | = | A ∩ B || A | + | B | − | A ∩ B | (1)The Jaccard distance is a metric on the collection of ﬁnite sets. The Jaccard distance d J ( A, B ) oftwo sets A and B is scored by the difference between 100% and the Jaccard similarity coefﬁcient(Equation (2)). d J ( A, B ) = 1 − J ( A, B ) = | A ∪ B | − | A ∩ B || A ∪ B | (2)The Jaccard distance measure of SNP variants takes account of the ordering of SNP mutations.Therefore, the genetic distance of two genomes corresponds to the Jaccard distance of their SNPvariants. The Jaccard distance of SNP variants was adopted in the phylogenetic analysis of human or3acterial genomes (Comas et al., 2009; Yu et al., 2017; Yin and Yau, 2019). In this study, we use theJaccard distance of the SNP mutations of virus genomes to measure the dissimilarity of virus isolates. Because a mutation is rarely reversed, more SNPs in a virus occur along time. Let A and B representtwo SNP sets of the virus, if A is the subset of B , i.e., ( A ∈ B, A (cid:54) = B ) , then B can be considered asone of A’s descendants A , and A can be considered as the ancestor of B . To this end, we propose thedirected Jaccard distance D J ( A, B ) of two SNP sets A and B as the measure of mutual relationship(Equation (3)). Obviously, if B is a descendant of A , then D J ( A, B ) is positive; otherwise, if A is adescendant of B , D J ( A, B ) is negative. In all the descendants of an SNP A , the closest descendantis the one having the minimum D J ( A, B ) of the A descendant sets. D J ( A, B ) = sgn(1 − J ( A, B )) =  | A ∪ B | − | A ∩ B || A ∪ B | , if A ∩ B ∼ = A | A ∩ B | − | A ∪ B || A ∪ B | , if A ∩ B ∼ = B (3)For two SNP sets A and B , if A ∩ B (cid:54) = ∅ , A (cid:54)⊂ B and B (cid:54)⊂ A , then the two viruses are relatives,sharing common SNP mutations. If two SNP sets are neither descendant-ancestor nor relatives,the corresponding two viruses are isolated mutants. Hence, the relevance of virus isolates can beidentiﬁed from the directed Jaccard measure on the SNP genotypes.Though the source of SARS-CoV-2 varies, we still consider the virus samples were randomly collectedfor sequencing. If a virus strain among all sequenced viruses has many descendants in the genomeset, we infer that this strain is conferred with high transmissibility. Therefore, the SNP mutations inthis strain are critical for increased transmissibility.We calculate the directed Jaccard distances of the SNP mutations to identify the relationships of virusstrains, therefore, we may determine the virus transmission pattern. The pipeline for SNP genotypingand analysis is described in Algorithm 1. Input:

The complete genomes of SARS-CoV-2 strains

Output:

SNP genotypes of SARS-CoV-2 strains

Step:

1. Divide the complete genomes of SARS-CoV-2 strains into subsets based on the originatingterritories.2. Add the reference genome of SARS-CoV-2 to each subset of the complete genomes.3. Perform multiple sequence alignments for each subset genomes using Clustal Omega.4. Convert the alignment ﬁles to SNP proﬁles using the reference genome of SARS-CoV-2.5. Merge the SNP proﬁles of all virus genomes.6. Calculate the pairwise directed Jaccard distances of all the SNPs proﬁles.7. Analyze the descendants, ancestors, and relative relationships of each SNP genotype fromthe Jaccard distances.

Algorithm 1:

SNP genotyping analysis of SARS-CoV-2.

The genomic analytics is performed using computer programs in Python and Biopython libraries(Cock et al., 2009). The computer programs and the updated SNP proﬁles of SARS-CoV-2 isolatesare available upon requests.

We retrieve the SNP genotypes of 442 SARS-CoV-2 strains in GISAID database from the globe. Toinvestigate the SNP distributions among all the virus isolates, we plot the SNP proﬁles of all the virus4solates from the globe and compare the frequency of each SNP mutation in the virus sets. The resultsshow large mutation diversity in these virus isolates.From the mutation frequency analysis, the mutations are due to the fact that RNA-dependent RNApolymerase (RdRp) of RNA viruses lacks proofreading, however, the mutations are not equallydistributed. The SNP mutations can be single mutation and multiple mutations at a few ﬁxedpositions. The impacts and roles that these SNP mutations have on the pathogenicity and transmissionability of SARS-CoV remain to be determined by biochemical experiments. These divers mutationsmight impact both transmissibility and pathogenicity of SARS-CoV-2.The ﬁrst common SNP mutation in the SARS-CoV-2 genome is in the leader sequence (241C>T), animportant genomic site for discontinuous sub-genomic replication. The leader sequence mutation241C>T is co-evolved with three important mutations, 3037C>T, 14408C>T, and 23403A>G, whichresult in amino acid mutations in nsp3 (synonymous mutation), RNA primase (P323L), and spikeglycoprotein (S protein, D614G), respectively. These three co-mutations (241C>T, 14408C>T, and23403A>G) are in critical proteins for RNA replication (241C>T, 14408C>T) and the S protein(23403A>G) for binding to ACE2 receptor. We observe that these four co-mutations are prevalentin the virus isolates from Europe, where infections COVID-19 by SARS-CoV-2 are generally moresevere than other geographical regions. Combined, these four co-mutations probably can conferincreased transmissibility of the virus.SARS-coronavirus RNA replication is unique, involving two RNA-dependent RNA polymerases(RdRp). The ﬁrst RNA polymerase is primer-dependent non-structural protein 12 (nsp12), whereasthe second RNA polymerase is nsp8. Nsp8 has the primase capacity for de novo initiation RNAreplication without primers (Te Velthuis et al., 2012). The most abundant SNP mutation in SARS-CoV-2 isolates is (28144T>C) in nsp8 protein, in which amino acid leucine (L) is mutated to serine (S).Our result is consistent with a previous study on 103 SARS-CoV-2 genomes in which SARS-CoV-2virus is classiﬁed as S and L types by the two co-mutations (8782C>T and 28144T>C) (Zhang et al.,2020).The third abundant SNP mutation is (26144G>T) in nonstructural protein 3 (nsp3: G251V). Theprotein nsp3 works with nsp4 and nsp6 to induce double-membrane vesicles (DMV), membranecomplex that acts as a platform for RNA replication and assembly (Angelini et al., 2013).The signiﬁcant SNP mutation (23403A>G) is located in the gene encoding spike glycoprotein (Sprotein: D614G). The S protein in the SARS-CoV-2 virus is an important determinant of the hostrange and pathogenicity. The S protein attaches the virion to the cell membrane by binding withthe host ACE2 receptor (Xiao et al., 2003). The mutation D614G is located in the putative S1–S2junction region near the furin recognition site (R667) for the cleavage of S protein when the vironenters or exists cells (Follis et al., 2006). However, the actual functional impact of this high-frequencySNP mutation (23403A>G) in the S protein (D614G) is unclear. The afﬁnity strength of the mutationS protein (D614G) with the ACE2 receptor shall be further determined by biochemical experiments.Especially, the SNP analytics result also shows that the primer independent RNA primase (nsp8)contains more mutations than any other proteins (28144T>C, 28881G>A, 28881G>A, 28882G>A,and 28883G>C). The RNA polymerase and primase mutations may confer resistance to mutagenicnucleotide analogs via increased ﬁdelity. The previous study indicated that a single mutation in RNApolymerase can improve the replication ﬁdelity in RNA virus (Pfeiffer and Kirkegaard, 2003). Ifa mutation is lethal or reduces the transmission ability, the mutations may not be carried on or getdeceased. The SNP proﬁles demonstrate that the mutations in the envelope glycoprotein and RNApolymerases predominate. Only the mutations in the S protein that have strongly binding to cellACE2 receptors while escape from immune system response can have chances to survive. Therefore,these critical mutations are the results of natural selection in virus evolution.In the SARS-CoV-2 strains found in the US, the nucleocapsid (N) protein gene has three mutations(28881G>A, 28882G>A, and 28883G>C), The N protein of SARS-CoV is responsible for theformation of the helical nucleocapsid during virion assembly. The N protein may cause an immuneresponse and has potential value in vaccine development (Zhao et al., 2005). These mutations shallbe considered when developing a vaccine using the N protein.5 a)(b)

Figure 1: Distribution of SNP mutations of SARS-CoV-2 isolates from the globe. (a) The SNPproﬁles of mutations in 442 SARS-CoV-2 isolates. (b) Frequencies of the single SNP mutations onthe genome. The nucleotide positions are on the reference genome of SARS-CoV-2.6able 1: High-frequency single SNP genotypes in SARS-CoV-2.SNP mutation protein mutation frequency241C>T leader sequence 1783037C>T synonymous mutation (nsp3, F105F) 1828782C>T synonymous mutation (nsp4, S75S) 13811083G>T nsp6, L37F 11514408C>T RNA pol (nsp12, P323L) 18217747C>T helicase, P504L 5517858A>G helicase, Y541C 5518060C>T synonymous mutation (3’-to-5’exonuclease, L6L) 6223403A>G spike glycoprotein (S protein), D614G 18326144G>T ORF3a, G251V 4927046C>T membrane glycoprotein, T175M 3328144T>C RNA primase (nsp8, L84S) 14028881G>A nucleocapsid phosphoprotein (R203K) 7428882G>A nucleocapsid phosphoprotein (R202R) 7428883G>C nucleocapsid phosphoprotein (G204R) 74

Note: The SNP mutation positions are on the reference genome. Nucleotide T represents nucleotideU in SARS-CoV-2 RNA virus genome. The frequencies of mutations are computed from total 558SARS-CoV-2 strains.

Table 2: Co-mutations with high descendants in SARS-CoV-2.SNP co-mutations proteins descendants8782C>T, 28144T>C, 18060C>T>C RNA pol (nsp8) 54241C>T, 3037C>T, 23403A>G, 28144T>C, S protein, RNA pol (nsp8) 82241C>T, 3037C>T, 14408C>T, 23403A>G RNA primase (nsp12), S protein 81

Note: The SNP mutation positions are on the reference genome. Nucleotide T represents nucleotideU in SARS-CoV-2 RNA genome. The frequencies of mutations are computed from total 558 SARS-CoV-2 strains.

To spread, a pathogen virus must multiply within the host to ensure transmission, while simultaneouslyavoiding host morbidity or death. Therefore, during the evolution of a virus, the transmissibilityof the virus is usually increased, whereas the pathogenicity becomes reduced (Alizon et al., 2009).From the SNP proﬁles of SARS-CoV-2 strain, high-frequency mutations predominate in the virusisolations, therefore, these high-frequency mutations probably contribute to increased transmissibility.In addition, these high-frequency mutations are associated with different critical proteins. We analyzeand trace the SNP proﬁles from 442 SARS-CoV-2 strains which have at least 10 descendants. Theresult suggests a number of high-frequency mutations that are associated with different criticalproteins. The results show that the SNP distribution is not random but is predominated at somepositions and then have more descendants. These high-frequency mutations may confer a hightransmissibility of the virus (Table 2). If we exclude the leader sequence mutation and the synonymousmutations (3037C>T, 8782C>T, 18060C>T), we classify the SNP mutations into four major groupsbased on the impacted proteins (Fig. 2.). (1) single mutation in nsp6 (11083G>T) (Fig.2(a)), (2)single mutation in ORF3a (26144G>T) (Fig.2(b)), (3) single mutation in RNA polymerase (nsp8)(8782C>T, 28144T>C) (Fig.2(c)), and (4) double mutations in S-protein and RNA polymerase:(241C>T, 3037C>T, 14408C>T, 23403A>G ) (Fig.2(d)). These strains in one group are derived fromthe same ancestor stain in that group according to their SNP proﬁles.The result shows that most SNP mutations in SARS-CoV-2 isolates in China and some from Europeand USA are located at two positions (8782C>T, 28144T->C) (Fig.2(c)). Later on this strain wasmutated at new position (8782C>T, 28144T>C, 18060C>T). These mutations are from the earlyphase of the strain. 7 a) (b)(c) (d)

Figure 2: The SNP proﬁles of four major genotypes. (a) Genotype I (11083G>T), (b) GenotypeII: (26144G>T), (c) Genotype III (8782C>T, 28144T>C), (d) Genotype IV (241C>T, 3037C>T,14408C>T, 23403A>G). The strains in a genotype group originate from the same ancestor. Thestrains from the same region are marked in the same color.The important and prevalent co-mutations (241C>T, 3037C>T, 23403A>G) occurred mostly inSARS-CoV-2 isolates in Europe countries. This strain then has additional extended mutations atpositions (241C>T, 3037C>T, 14408C>T, 23403A>G) (Fig.2(d)). The impacted critical proteins areNA pol (nsp8), RNA primase (nsp12), and the S protein. Most of the strains are found in Europecountries (Fig.2(d)). Italy is being heavily infected by SARS-CoV-2 with 59, 138 conﬁrmed casesand 5, 476 deaths as of March 23, 2020 (WHO, 2020). These critical mutations probably may becorrelated with the severe infections in Europe.From the SNP proﬁles of the viruses across the globe from a different time, we may estimate that onemutation can occur in one generation. For example, in USA (IL) two consecutive infection cases(US|IL1|EPI_ISL_404253|2020-01-21,US|IL2|EPI_ISL_410045|2020-01-28), the virus increasedone mutation (28854C>Y) between two same community members. Over the length of its 30kbgenome, SARS-CoV-2 may accumulate mutations ranging from single mutation to 14 mutations(NL|EPI_ISL_413591|2020-03-02), as seen from December 2019 to March 23, 2020. Therefore, wemay estimate that the transmission of SARS-CoV-2 has reached 14 generations since its ﬁrst infectionto humans in December 2019. 8esides the SNPs mutations, we also observed a few deletion or insertion mutations in SARS-CoV-2isolates. The deletion-insertion mutations do not happen often, however, whether these deletion andinsertion mutations can spread is unknown from the limited genome data.

Our study has a few notable limitations due to the nature of the genome data. Because the samplecollection dates may not reﬂect the actual infection date so the transmission path analysis is onlyapproximate. Caution should be exercised on the genotyping analytics because some countries havenot sequenced enough virus samples, the frequencies of the genotype groups may be unbalanceddue to the unavailability of complete genomes in some countries and regions. Whether any of thesecommon SNP mutations will result in biological and clinical differences remains to be determined.In this study, we use the complete genomes of SARS-CoV-2 for SNP genotype calling. However, inan emergency time, the complete genomes may not be available for SNP genotyping. In this case,the SNP variant calling process may directly use the raw NGS reads (Yin and Yau, 2019). The SNPvariants then can be obtained by mapping the NGS reads to the reference genome by BWA alignments(Li, 2013), followed by GATK variant calling (McKenna et al., 2010).

The SARS-CoV-2 epidemic has caused a substantial health emergency and economic stress in theworld. Therefore, understanding the nature of this virus and deriving methods to monitor the spreadof virus in the epidemic are critical in disease control. Our results show several molecular facets ofthe SARS-CoV-2 pertinent to this epidemic. The discovery of genotypes linked to geographic andtemporal clusters of infectious suggests that genome SNP signatures can be used to track and monitorthe epidemic.Rapid detection of different genotypes of SARS-CoV-2 are important for an efﬁcient response to theCOVID-19 outbreak Discriminating and relating viral isolates can be useful in genetic epidemiology.Determining the origin and monitoring the transmission pattern of the pathogenic agents are criticalto controlling the outbreak. In this work, the SNP genotyping of SARS-CoV-2 was developed byadapting fast MSA of the complete genomes of SARS-CoV-2 and SNP analytics using the directedJaccard distance of the SNP proﬁles. The genotyping analysis provides insights on the frequentmutations that confer fast transmissibility of the virus. The major mutations are in the criticalproteins, including the S protein, RNA polymerase, RNA primase, and nucleoprotein. Therefore,these high-frequency SNP mutation sites must be considered when designing a vaccine for preventingthe infection of SARS-CoV-2. • COVID-19: coronavirus disease 2019 • DMV: double-membrane vesicle • GATK: the genome analysis toolkit • MSA: multiple sequence alignment • NGS: next generation sequencing • SARS: severe acute respiratory syndrome • SARS-CoV-2: severe acute respiratory syndrome coronavirus 2 • SNP: single nucleotide polymorphisms • WHO: the world health organization eferenceseferences