Host immune response driving SARS-CoV-2 evolution
Rui Wang, Yuta Hozumi, Yong-Hui Zheng, Changchuan Yin, Guo-Wei Wei
HHost immune response driving SARS-CoV-2 evolution
Rui Wang , Yuta Hozumi , Yong-Hui Zheng , Changchuan Yin * , and Guo-Wei Wei , , † Department of Mathematics, Michigan State University, MI 48824, USA Department of Microbiology and Molecular Genetics,Michigan State University, MI 48824, USA Department of Mathematics, Statistics, and Computer Science,University of Illinois at Chicago, Chicago, IL 60607, USA Department of Biochemistry and Molecular BiologyMichigan State University, MI 48824, USA Department of Electrical and Computer EngineeringMichigan State University, MI 48824, USA
Abstract
The transmission and evolution of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) areof paramount importance to the controlling and combating of coronavirus disease 2019 (COVID19) pan-demic. Currently, near 15,000 SARS-CoV-2 single mutations have been recorded, having a great ramifica-tion to the development of diagnostics, vaccines, antibody therapies, and drugs. However, little is knownabout SARS-CoV-2 evolutionary characteristics and general trend. In this work, we present a comprehen-sive genotyping analysis of existing SARS-CoV-2 mutations. We reveal that host immune response viaAPOBEC and ADAR gene editing gives rise to near 65% of recorded mutations. Additionally, we showthat children under age five and the elderly may be at high risk from COVID-19 because of their overreact-ing to the viral infection. Moreover, we uncover that populations of Oceania and Africa react significantlymore intensively to SARS-CoV-2 infection than those of Europe and Asia, which may explain why AfricanAmericans were shown to be at increased risk of dying from COVID-19, in addition to their high risk ofgetting sick from COVID-19 caused by systemic health and social inequities. Finally, our study indicatesthat for two viral genome sequences of the same origin, their evolution order may be determined from theratio of mutation type C > T over T > C. Contents * Address correspondences to Changchuan Yin. E-mail:[email protected] † Address correspondences to Guo-Wei Wei. E-mail:[email protected] a r X i v : . [ q - b i o . GN ] A ug .3 Coronavirus evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Introduction
The ongoing raging outbreak of coronavirus disease 2019 (COVID19) caused by severe acute respiratorysyndrome coronavirus 2 (SARS-CoV-2) has led to tremendous human mortality and economic hardship.As of July 31, 2020, over 17106007 confirmed COVID-19 cases had been reported worldwide and 668910deaths have occurred from the disease [1]. To mitigate this devastating pandemic, we have to control itsspread by sufficient testing, social distancing, contact tracking, and developing effective diagnosis tools,efficacious antiviral drugs, antibody therapies, and preventive vaccines.SARS-CoV-2 is a positive-sense single-strand RNA virus that belongs to the beta coronavirus genus [2].It has a genome size of 29.82 kb, which encodes multiple non-structural and structural proteins. The leadersequence and ORF1ab encode non-structural proteins for RNA replication and transcriptions. The down-stream regions of the genome encode structural proteins, including the spike (S) protein, the nucleocapsid(N) protein, the envelope (E) protein, and the membrane (M) protein. All of the four major structural pro-teins are required to produce a structurally complete viral particle. The S protein mediates viral attachmentto host angiotensin-converting enzyme 2 (ACE2) receptor and subsequent fusion between the viral andhost cell membranes aided by transmembrane serine protease 2 (TMPRSS2) to allow the entry of virusesinto the host cell [3–5]. The nucleocapsid (N) protein, one of the most abundant viral proteins, binds to theRNA genome and is involved in replication processes, assembly, and host cellular response during viralinfection [6].Mutagenesis is a basic biological process that changes the genetic information of organisms. As a pri-mary source for many kinds of cancer and heritable diseases, mutagenesis maybe fearful but is a drivingforce for natural evolution [7,8]. Although viruses are not organisms per se, they are at the edge of life. OurSARS-CoV-2 Mutation Tracker ( https://users.math.msu.edu/users/weig/SARS-CoV-2 Mutation Tracker.html) shows that near 15,000 mutations have occurred on SARS-CoV-2 [9]. More than 1000 mutations onthe S protein gene have a significant impact on SARS-CoV-2 infectivity [10–12]. These mutations should beput into the perspective that COVID-19 has globally spread. The geographical and demographical diver-sity of the viral transmission and exogenous and endogenous genotoxins exposures have stimulated SARS-CoV-2 mutations. If we consider the average number of mutations per genome, SARS-CoV-2 is mutatingslower than other viruses, such as the flu and common cold viruses. This is because SARS-CoV-2 belongsto the coronaviridae family and the Nidovirales order, which has a genetic proofreading mechanism in itsreplication achieved by an enzyme called non-structure protein 14 (NSP14) in synergy with NSP12, i.e.,RNA-dependent RNA polymerase (RdRp) [13, 14]. As a result, SARS-CoV-2 has a relatively high fidelityin its transcription and replication process. In general, Coronavirus mutations are created from three ma-jor sources, namely, random errors in replication, such as genetic drift and spontaneous genotoxins, viralreplication proofreading and defective repair mechanisms, and host immune responses, such as destructivegene editing [11, 15]. Genotyping tracks mutations overpopulation, space, and time, while also providing amethod to understand the molecular mechanism of SARS-CoV-2 proteins, protein-protein interactions, andtheir synergy with host cell proteins, enzymes, and signaling pathways.The studies of SARS-CoV genomes have so far predominantly focused on understanding genome mu-tation variants, implications in virus transmissions [16, 17], and ramifications on the development of diag-nostics [9, 18], vaccines [19], antibodies [20], and drugs [19].Although it is difficult to determine the detailed mechanism of every specific mutation, early workon a few initial SARS-CoV-2 strains in Wuhan, China, revealed that hypermutations C > T are most likelyresulted from the APOBEC (apolipoprotein B mRNA editing enzyme, catalytic polypeptide-like) deamina-tion in RNA editing [21]. In the standard genetic code, all three stop codons, TAA, TAG, and TGA, involveT but not C. Therefore, the gene-editing imposed C > T mutations will have a high possibility to terminatethe translation of viral proteins, which undermines viral functions and survivability. Both spontaneous3 > T transitions and APOBEC deamination are regarded as genotoxins and can lead to cancers for humans.There are two well-known deaminase RNA editing mechanisms in human cells: the APOBEC [22] andthe ADAR (adenosine deaminases acting on RNA) [23]. The APOBEC enzymes deaminate cytosines intouracils (C > U) on single-stranded nucleic acids (ssDNA or ssRNA). It is well established that the humangenome encodes activation-induced cytidine deaminases (AIDs) and several homologous APOBEC cyti-dine deaminases that function in innate immunity as well as in RNA editing [24, 25]. In both innate andadaptive immunity, AID and APOBEC cytidine deaminases modulate immune responses by mutating spe-cific nucleic acid sequences of hosts and pathogens. The ADAR enzymes deaminate adenines into inosines(A-to-I) and result in A > G mutation. The significance of A-to-I editing is appreciated for its abundance inboth host and viral RNAs. ADAR enzymes play important roles during viral infections. They can haveeither a proviral or an antiviral consequence, dependent upon the virus-host combination [26, 27].The APOBEC family proteins play critical functional roles within the adaptive and innate immune sys-tem, which involves at early times after the infection [28]. Therefore, the higher ratio of C > T mutationsmay indicate the strong capacity of the host immune system. However, a strong immune response is adouble-edged sword. On the one hand, it may help host cells to defeat the virus more efficiently. On theother hand, it can result in a “cytokine storm, which is a key cause of the death of COVID-19 patients bythe exponential growth of inflammation and organ damage [29].In this work, we analyze a large volume of single nucleotide polymorphisms (SNPs) found in 33693complete SARS-CoV-2 genome isolates globally. By analyzing the distribution of 12 SNP types, we noticethat the ratio of C > T mutations is predominately higher than that of the other types of mutations, indi-cating that hypermutation C > T may result from extensive host RNA editing, i.e., the APOBEC deamina-tion. Additionally, we investigate the distribution of 12 SNP types in different age groups, gender groups,and geographic locations to understand whether these hypermutations have the age/gender/demographicpreference. Moreover, we provide deep insights into the mutation motif and hot-spot patterns from 13833single mutations decoded from 33693 complete SARS-CoV-2 genome sequences, revealing mutational sig-natures and preferred genetic environments. Finally, we hypothesize that virus genomes evolve throughhost innate immune response imposed gene editing, i.e., C > T, and virus protective mechanism-installeddefective revisionary mutations, T > C. As a result, both C > T and T > C mutation ratios are usually high.We show that the ratio of C > T to T > C mutations is higher than the unity in the forward viral evolution,which suggests the master and slave relationship between host gene editing and virus protective mech-anism. Therefore, we propose the use of the C > T to T > C ratio being higher than the unity ( >
1) as theindication of the forward viral evolution direction.
To reveal that C > T and A > G mutations are driven by RNA-APOBEC and RNA-ADAR editing, we first an-alyze 33693 complete SARS-CoV-2 genome sequences and total 13833 single mutations are found as of July31, 2020. To be noted, 13833 single mutations are unique mutations, i.e., the same mutation that appears indifferent SARS-CoV-2 isolates is only counted once. If we count the same mutation in different SARS-CoV-2 isolates repeatedly according to their frequency, then all of the mutations that are detected in the 33693complete SARS-CoV-2 genome sequences are called non-unique mutations. With the reference sequence ofSARS-CoV-2 genome collected on January 5, 2020 [2], we calculate the proportion of 12 SNP types (i.e., A > T,A > C, A > G, T > A, T > C, T > G, C > T, C > A, C > G, G > T, G > C, G > A) worldwide. The unusually high ratiosof C > T and A > G mutations indicate that RNA-APOBEC editing and RNA-ADAR editing are involved inthe host immune response to SARS-CoV-2 infection. Additionally, to understand gene-editing preference,we investigate the distribution of 12 SNP types of mutations in different countries/regions, age groups,and gender groups. Furthermore, we decode mutation motifs from the 2-mer and 3-mer sequence contexts4o survey the hot-spot patterns and mutational signatures driven by gene-editing. Moreover, we analyzethe proportion of 12 SNP types among SARS-CoV, Bat-SL-BM48-31, Bat-SL-CoVZC45, Bat-SL-RaTG13, andSARS-CoV-2. We discover that the viral evolution order can be determined by the ratios of C > T/T > C.These results are presented in following subsections.
Table 1 illustrates the proportion of 12 SNP types of SARS-CoV-2 (i.e., A > T, A > C, A > G, T > A, T > C, T > G,C > T, C > A, C > G, G > T, G > C, G > A) in the global. Here we only consider the unique SNPs.
Table 1: The distribution of 12 SNP types among unique mutations in the SARS-CoV-2 genome isolates worldwide. The unique SNPmutations are considered in the calculation, i,e., the same type of mutations in different genome isolates is only counted once.
Type A > T A > C A > G T > A T > C T > G C > T C > A C > G G > T G > C G > ARatio 4.44% 3.75% 14.87% 3.43% 14.53% 2.80% 24.06% 4.00% 1.25% 13.33% 2.36% 11.17%First, it is noticed that not all SARS-CoV-2 mutations are created equal. Mutation C > G only accountsfor 1.25%. A few other mutation types, G > C, T > G, T > A, and A > C, are not frequent either. If mutationsare random, each mutation should have a ratio of 8.3% on average. It can be seen that C > T owns the largestproportion (24.06%), which is much higher than the average ratio. Therefore, the hypermutation C > T mustbe driven by additional mechanisms. It is all known that host RNA-APOBEC editing leads to excessiveC > T transitions.Moreover, the second most frequent mutation type is A > G transition. Its ratio of 14.87% A > G is muchhigher than the average ratio of 8.3%, indicating that RNA-ADAR editing is also involved in the host im-mune response. Although the high ratios of C > T and A > G reveal that the immune system is combatingwith SARS-CoV-2 by two deaminase RNA editing mechanisms, the relatively high ratios of the reversedmutations T > C and G > A also indicate that SARS-CoV-2 fights back the destructive gene editing using itsdefective proofreading and repairing mechanisms.Finally, it is well-known that mutations can be classified into four transition types (i.e., A > G, G > A,C > T, and T > C) and eight transversion types. Table 1 shows that all transition types have relatively highratios. Whereas, all transversion types, except for G > T, have relatively low ratios. This is due to the factthat it is easier to substitute a single ring nucleotide structure for another single ring nucleotide structurethan to substitute a double ring nucleotide for a single ring nucleotide. Additionally, transitions are morelikely to result in silent mutations. Therefore, transversions can be more destructive to viral genomes.
Figure 1 illustrates the distribution of 12 SNP types among unique SNPs in SASR-CoV-2 genome isolatesfrom different age groups. In general, with the increase of age, the ratio of C > T gradually increased. Here,42.1% C > T mutations are detected in patients who are older than 90 years old, indicating that the immunesystems in elderly patients may fight against the SARS-CoV-2 harder than the immune systems in youngpatients. However, the severe COVID-19 cases may be due to immune systems’ over response. WhenSARS-CoV-2 infects a host cell, a set of proteins called cytokines will be released from a broad range ofcells (mainly immune cells). Cytokines are involved in the immune response to produce more immunecells and recruit them to the sites of inflammation in order to fight against the viral infection. In turn, morecytokines can be released from the immune cells. This positive feedback loop will result in a “cytokinestorm”, which can beget the exponential growth of inflammation, trigger apoptosis, and lead to organ5 igure 1: The distribution of 12 SNP types among unique mutations in the SARS-CoV-2 genome isolates from different age groups.The text inside each circle represents for the total number of records that have the age information in different age groups.Figure 2: The distribution of 12 SNP types among unique mutations in the SARS-CoV-2 genome isolates from two gender groups. Thetext inside each circle represents for the total number of records that have the gender information in different gender groups. igure 3: The distribution of 12 SNP types among unique mutations in the SARS-CoV-2 genome isolates from different age groupsamong female patients. The text inside each circle represents for the total number of records that have the age information in differentage groups. damage [29]. Therefore, we hypothesize that if the immune system overreacts to the invading pathogens,it is more likely to cause the cytokine storm and aggravate the condition of the COVID-19 patients. Itcan be seen in Figure 1, patients who are older than 80 years old have more C > T mutations comparedto other age groups. This result reveals that the APOBEC3 activity in the immune system is more activeand the immune response is stronger in older people. Consequently, the cytokine storm may happen morefrequently in older people than it does in younger people. This might be one of the main causes of the highCOVID-19 fatality for the elderly. Age-related mutagenesis, i.e., C > T transition, is known to cause morecancers diagnostics in the elderly [30].Notably, the SARS-CoV-2 samples from children under five years old have a relatively high ratio ofC > T mutations (39.6%), indicating that they also have a relatively active immune response when fightingagainst SARS-CoV-2. Moreover, the reversed mutation type T > C for samples from children under fiveyears old and adults older than 90 years old has the second-largest ratio. In other age groups, T > C hasthe fourth-largest ratio. As demonstrated before, the reversed mutation T > C may reveal that SARS-CoV-2is capable of fighting back against the host immune system. Therefore, we deduce that SARS-CoV-2 willfiercely counter-attack against the immune system in children under five and adults older than 90.7 igure 4: The distribution of 12 SNP types among unique mutations in the SARS-CoV-2 genome isolates from different age groupsamong male patients. The text inside each circle represents for the total number of records that have the age information in differentage groups.
Our result reveals that the immune systems of children under five years old are less well-developedand weaker than those of adults. They have to fight more intensively when SARS-CoV-2 infects. This resultsuggests children under five are at risk of COVID-19. However, the long-term health consequence of youngchildren’s unusual response to SARS-CoV-2 infection is to be further studied.
Figure 2 shows the distribution of 12 SNP types in SARS-CoV-2 genome isolates globally from two gendergroups. The ratio of C > T mutations in females is slightly higher in males, which matches the findingthat women have a stronger immune response than men [31, 32]. Moreover, Figure 3 and Figure 4 depictthe distribution of 12 SNP types in different age groups among female and male patients. Overall, theproportion of C > T mutations in the SARS-CoV-2 genomes from females is higher than the C > T proportionin the SARS-CoV-2 genomes from male except for the age between 6-19 and older than 90. Therefore, wecan deduce that the RNA editing has age and gender preference, it is more likely to happen or becomestronger for the females who are older than 90 years old or under 5 years old.8 able 2: The number of complete SARS-CoV-2 genomes with age/gender information in the United Kingdom, United States, Aus-tralia, India, and the world.
Country Total counts Age counts gender countsUnited Kingdom 10740 2159 2134United States 8729 1888 2095Australia 1329 776 750India 1088 1068 1071World 33693 12513 12181
Figure 5: The distribution of 12 SNP types among unique mutations in the SARS-CoV-2 genome isolates from the United Kingdom.The text inside each circle represents for the total number of records in different age groups.
In this section, we analyze the distribution of SARS-CoV-2 mutations in different countries and regions.Limited by the number of complete genome sequences submitted to GISAID that have appropriate labels,we only analyze the countries with more than 1000 labeled sequences to maintain statistical significance.Table 2 lists the total number of SARS-CoV-2 sequences in the United Kingdom, United States, Australia,and India. The number of sequences with age and gender information is given in Table 2.Figure 5, Figure 6, Figure 7, and Figure 8 illustrate the distribution of 12 SNP types in the SARS-CoV-2 genome isolates from different age groups in the United Kingdom, United States, Australia, and India,respectively. We can see that the SARS-CoV-2 genome isolates from the United Kingdom patients havethe highest ratio of C > T compared to those from the other three countries. It is interesting to note thatthe SARS-CoV-2 genome isolates from the patients older than 80 years old from the United Kingdom andAustralia have less C > T mutations, which is not consistent with the global pattern.9 igure 6: The distribution of 12 SNP types among unique mutations in the SARS-CoV-2 genome isolates from the United States. Thetext inside each circle represents for the total number of records in different age groups.
Figure 9 illustrates the distribution of 12 SNP types in six continents. The SARS-CoV-2 genome isolatesfrom Europe, Asia, and North America patients have a relatively low C > T mutation ratio (less than 35%),while the reversed T > C mutation ratio is relatively high (greater than 10%). On the contrary, South Amer-ica, Oceania, and Africa have higher C > T ratios but lower T > C ratio. It worth noting that the C > > G mutation ratios of genome isolates from Oceania and Africa arevery low ( < > G mutation ratios of genome isolates from other regions are significantlyhigher ( > The mutation preferences in sequence contexts may be used to predict the mutational signatures fromgenome sequences. Despite numerous studies of the mutation contexts in APOBEC editing inhuman cells,10 igure 7: The distribution of 12 SNP types among unique mutations in the SARS-CoV-2 genome isolates from Australia. The textinside each circle represents for the total number of records in different age groups. little is known for the mutation contexts in the SARS-CoV-2 genome. As we have a large number of SNPmutations from SARS-CoV-2 genomes, here we discuss the mutation frequencies from 2-mer and 3-mersequence contexts. We present 4-mer sequence contexts in the Supporting Information.In general, the patterns discussed in this section are consistent with those presented in Section 2.1.1.However, this section offers more detailed information about mutational signatures.For mutation motifs of SNPs at the first position of 2-mers Figure 10(a), we observe that motif 2-merCW (where W is either A or T) for C > T mutation is the predominant context. Similarly, for mutationmotifs of the SNPs at the second position of 2-mers Figure 10(b), motif 2-mer WC for C > T mutation is thepredominant context. These results are consistent with the previous study that TCW contexts (where W = A or T) are predominantly caused by APOBEC-catalyzed deamination of cytosine (C) to thymine (T) oruracil (U) in human cancer cells [33].For the SNPs at the first position of 3-mers (ANN or TNN) (Figure 11(a)), we observe the followingmutation patterns.(1) ANN (except for AAC and ACC) has high A > G mutation. AAC and ACC contexts have a high fre-quency in A > T mutations.(2) TNN has a high frequency in T > C mutations.For the SNPs at the first position of 3-mers (CNN or GNN) shown in Figure 11 (b), we observe the fol-lowing mutation patterns.(1) CNN has a high frequency in C > T mutations.(2) GGA has a high frequency in G > C mutations.(3) GCA has a relatively high frequency in G > A mutations.(3) GGN (N (cid:54) = A) has relatively high frequency in G > A mutations.11 igure 8: The distribution of 12 SNP types among unique mutations in the SARS-CoV-2 genome isolates from India. The text insideeach circle represents for the total number of records in different age groups.
For the SNPs at the second position of the 3-mers (NAN or NTN) as shown in Figure 12 (a), we observethe following mutation patterns.(1) NAN has a high frequency in A > G mutations(2) NTN has a high frequency in T > C mutations(3) The T > C mutation also has a larger proportion in AN.For the SNPs at the second position of the 3-mers (NCN or NGN) as shown in Figure 12 (b), we observethe following mutation patterns.(1) WGN (where W is A or T) has C > T dominated mutation except for AGG.(2) SGN (where S is G or C) has G > A dominated mutations.(3) AGG has high G > A mutations.(4) Characteristic combinations SCG (where S is G or C) are stable and only a few of G > T mutations aredetected.(5) Characteristic combinations GGS (where S is G or C) are stable, having few G > T mutations.For the SNPs at the third position of 3-mers (NNA or NNT) a shown in Figure 12 (a), we observe thefollowing mutation patterns.(1) A > G mutation has a high frequency in NNA.(2) T > C mutation has a high frequency in NNT.(3) T > C mutation is dominated in NGT and only a few of T > A and T > G are found in the sequence contextof NGT.For the SNPs at the third position of 3-mers (NNC or NNG) as shown in Figure 13 (b), we observe the12 igure 9: The distribution of 12 SNP types among unique mutations in the SARS-CoV-2 genome isolates in six continents. The textinside each circle represents for the total number of records in each continent.Figure 10: SNP frequencies on 2-mer motifs. (a) SNP frequencies are on the first position of 2-mer motifs. (b) SNP frequencies are onthe second position of 2-mer motifs. igure 11: SNP frequency at the first positions of the 3-mer motifs. (a) A or T is at the first positions of 3-mer motifs. (b) C or G is atthe first positions of 3-mer motifs.Figure 12: SNP frequency on the second position of 3-mer motifs. (a) A or T is on the second position of 3-mer motifs. (b) C or G is onthe second position of 3-mer motifs. following mutation patterns.(1) NNC has a high frequency in C > T mutations.(2) G > T mutation has a high frequency in NNG.(3) G > A also highly expressed in the sequence context of NCG.(4) Characteristic combinations CGC are stable and the mutations on these patterns are most likely to beC > T transitions.
It is reasonable to assume that five coronaviruses SARS-CoV (2003) [34], Bat-SL-BM48-31 (2008) [35], Bat-SL-CoVZC45 (2017) [36], Bat-SL-RaTG13 (2013) [37], and SARS-CoV-2 (2019) [2] are of the same origin butdiffer from each other by their evolutionary stages. Among them, the data collection date of Bat-SL-RaTG13(2013) was denoted as July 24, 2013 while the data was not uploaded to the GISIAD database until January27, 2020. Figure 14 shows the mutation ratio among these five genomes. First, similar to SARS-CoV-2mutations listed in Table 1, four transition types (i.e., A > C, C > A, C > T, and T > C) still have high mutation14 igure 13: SNP frequency on the third positions of the 3-mer motifs.(a) A or T is on the third positions of 3-mer motifs. (b) C or G ison the third positions of 3-mer motifs.Figure 14: The distribution of 12 SNP types among SARS-CoV, Bat-SL-BM48-31, Bat-SL-CoVZC45, Bat-SL-RaTG13, and SARS-CoV-2.Here, the text on the top represents the reference genome and the text at the bottom represents the mutant sequence. ratios. Particularly, C > T type has the highest ratio, indicating that host immune response still plays themajor role. However, transversion type G > T is not as important as that in the SARS-CoV-2 mutationsdiscussed early. Nonetheless, transversion types A > T and T > A appear on the top six mutation types.We hypothesize that gene editing via APOBEC (C > T) and ADAR (A > G) is a driving force for RNA viralevolution as shown in Table 1. Viruses may fight back the host immune response with either defective re-pair or reversed mutations (T > C) within survived isolates. Therefore, T > C mutation rate would decreaseduring evolution. We are interested in not only the C > T transition ratio, but also the ratio of C > T overT > C, the reversed transitions. From Table 1 and Figure 14, we can deduce that the following:15. From SARS-CoV-2 reference genome to 33693 genomes: C > T: 24.06%, T > C: 14.53% (Higher C > T ra-tio, relatively lower T > C ratio, and C > T to T > C ratio: 1.66)2. From SARS-CoV to Bat-SL-BM48-31: C > T: 17.40%, T > C: 14.50% (Higher C > T ratio, relatively lowerT > C ratio, and C > T to T > C ratio: 1.20)3. From SARS-CoV to Bat-SL-CoVZC45: C > T: 18.20%, T > C: 13.20% (Higher C > T ratio, relatively lowerT > C ratio, and C > T to T > C ratio: 1.37)4. From SARS-CoV to Bat-SL-RaTG13: C > T: 18.00%, T > C: 12.50% (Higher C > T ratio, relatively lowerT > C ratio, and C > T to T > C ratio: 1.50)5. From SARS-CoV to SARS-CoV-2: C > T: 18.20%, T > C: 12.40% (Higher C > T ratio, relatively lower T > Cratio, and C > T to T > C ratio: 1.47)6. From Bat-SL-BM48-31 to Bat-SL-CoVZC45: C > T: 15.10%, T > C: 13.40% (Higher C > T ratio, relativelylower T > C ratio, and C > T to T > C ratio: 1.13)7. from Bat-SL-BM48-31 to Bat-SL-RaTG13: C > T: 15.60%, T > C: 13.10% (Higher C > T ratio, relativelylower T > C ratio, C > T to T > C ratio: 1.19)8. From Bat-SL-BM48-31 to SARS-CoV-2: C > T: 15.70%, T > C: 13.00% (Higher C > T ratio, relatively lowerT > C ratio, and C > T to T > C ratio: 1.21)9. From Bat-SL-CoVZC45 to Bat-SL-RaTG13: C > T: 20.10%, T > C: 18.70% (Higher C > T ratio, relativelylower T > C ratio, and C > T to T > C ratio: 1.07)10. From Bat-SL-CoVZC45 to SARS-CoV-2: C > T: 20.20%, T > C: 18.20% (Higher C > T ratio, relativelylower T > C ratio, and C > T to T > C ratio: 1.11)11. From Bat-SL-RaTG13 to SARS-CoV-2: C > T: 30.80%, T > C: 29.00% (Higher C > T ratio, relatively lowerT > C ratio, and C > T to T > C ratio: 1.06)It is seen that viral evolution order may be determined by the T > C over T > C ratio. By this analysis, wehave the following evolution order for aforementioned coronaviruses, SARS-CoV (2003) → Bat-SL-BM48-31 (2008) → Bat-SL-CoVZC45 (2017) → Bat-SL-RaTG13 (2013) → SARS-CoV-2 (2019) → → Bat-SL-RaTG13 (2013). This may happen for a few reasons. First, these coronaviruses may not be of the sameorigin. Second, the data collection date may not be accurate. The sequence of Bat-SL-RaTG13 (2013) wasnot uploaded until 2020. Finally, our method may admit a few counterexamples.16
Discussions
The SNPs type distribution of 33693 SARS-CoV-2 isolates is listed in Table 1. The C > T SNP mutation isremarkably higher than other mutation types. From the distribution of the 12 SNP types, we may infer thatthe excessive C > T transitions cannot explained by random mutations, instead, hypermutation C > T is dueto the cytosine-to-uridine deamination gene editing in human host response.
Figure 15: Comparison of the ratios of 12 SNP types among unique mutations (red) and non-unique mutations (blue) in SARS-CoV-2 genomes globally. Here, if we count the same mutation that appears in different SARS-CoV-2 isolates only once, then wecall those mutations as unique mutations. If we count the same mutation in different SARS-CoV-2 isolates repeatedly according totheir frequency, then all of the mutations that are detected in the complete SARS-CoV-2 genome sequences are called the non-uniquemutations.
Figure 15 presents a comparison of the ratios of 12 SNP types of among unique and non-unique muta-tions over all of the SARS-CoV-2 genome isolates. The most striking feature is that the C > T ratio is morethan doubled in the non-unique mutations, which indicates the overwhelming host immune response toviral infection. Another interesting feature is that the inverse transition T > C has a dramatic reduction of68% from the unique mutation ratio to the non-unique mutation ratio. These changes reflect the fact thatmany C > T mutations are high-frequency ones whereas virus reverses T > C mutations are of low frequencyin nature. The same explanation applies to many mutation types in Figure 15 that have significantly re-duced their ratios in the non-unique mutations. However, we observed that ratios of mutation types A > G,G > T, and G > A do not change much in the non-unique mutations, reflecting the fact that these mutationtypes maintain a near-average frequency.Figure 15 shows that the second most frequent mutation type is A > G transitions, standing at 13.6 %.The combined C > T and A > G transition types account for near 65% of all mutations. Therefore, host geneediting via APOBEC and ADAR is the major driven force of SARS-CoV-2 evolution.Neutralizing antibodies play a significant role in the clearance of viruses and have been considered acrucial immune artifact for the defense or treatment of viral diseases. However, a clinical study shows thatfive percent of people recovered from COVID-19 had no detectable antibodies [38]. Another observation isthat there are a large number of asymptomatic carrier transmission of COVID-19 [39]. The reason for theno-antibody COVID-19 recovers and asymptomatic carriers is unknown. From the mutation analysis in thisstudy, the APOBEC3 RNA editing is implicated as a strong secondary defenses system for mutating virus,and consequently, mitigating infection. We postulate that COVID-19 recoveries or convalescents withoutantibody and some asymptomatic carriers are probably owing to the increased APOBEC3 activity in hostimmune systems. 17
Methods and material
Here, 33693 complete genomes of the SARS-CoV-2 strains of the globe are retrieved from the GISAIDdatabase [40] as of July 31, 2020. Only the complete genomes of high-coverage that have no stretches of’NNNNN’ include in the dataset. The complete genome sequences are aligned with the reference genomeof SARS-CoV-2 by the MSA tool Clustal Omega using the default parameters [41]. The SNP mutations areretrieved from the aligned genomes according to the reference SARS-CoV-2 genome (GenBank access num-ber: NC 045512.2) [2]. The SNP profile, including nucleotide changes and the corresponding positions in agenome, can be considered as the genotype of the virus.
The Cluster Omega is employed to carry out the multiple sequence alignment. The genomic analytics isperformed using computer programs in Python and Biopython libraries [42].
We use genotyping to analyze the mutation types and their distributions of SARA-CoV-2 genome iso-lates. We show that host gene editing, namely APOBEC (apolipoprotein B mRNA editing enzyme, cat-alytic polypeptide-like) and ADAR (adenosine deaminases acting on RNA), are the main driven forces ofSARS-CoV-2 evolution, accounting for near 65% recorded mutations. We reveal that the immune systemsof children under age five and the elderly appear to overreact to SARS-CoV-2 infection and may be at highrisk from COVID-19. Some minor gender dependence in immune response was also detected. We uncoverthat the populations of Oceania and Africa react significantly more intensive to SARS-CoV-2 infection thanthose of Europe and Asia. Our study indicates that while systemic health and social inequities have putAfrican Americans at increased risk of getting sick from COVID-19, their immune systems’ overreacting toviral infection may have put them at increased risk of dying from COVID-19. The mutational signatureshave been analyzed to explore the preferred gene editing environments. Finally, we show that the ratio ofmutation type C > T over T > C may be used to indicate the evolution direction and distinguish the evolutionorder between two genome sequences of the same origin.
Supporting Information
Supporting information is available for supplementary figures, including the distribution of 12 SNP typesamong non-unique mutations, the distribution of 12 SNP types between each pair of 10 coronaviruses, and18-mer analysis of mutational signatures. Supplementary tables are available for GISAID IDs and GISAIDacknowledgment.
Acknowledgment
This work was supported in part by NIH grants GM126189 and AI145504, NSF Grants DMS-1721024, DMS-1761320, and IIS1900473, Michigan Economic Development Corporation, George Mason University awardPD45722, Bristol-Myers Squibb, and Pfizer. The authors thank The IBM TJ Watson Research Center, TheCOVID-19 High Performance Computing Consortium, and NVIDIA for computational assistance.19 eferences [1] WHO. Coronavirus disease 2019 (COVID-19) situation report 193.
Coronavirus Disease (COVID-2019)Situation Reports , 00(00):00–00, 2020.[2] Fan Wu, Su Zhao, Bin Yu, Yan-Mei Chen, Wen Wang, Zhi-Gang Song, Yi Hu, Zhao-Wu Tao, Jun-HuaTian, Yuan-Yuan Pei, et al. A new coronavirus associated with human respiratory disease in China.
Nature , 579(7798):265–269, 2020.[3] Xiaodong Xiao, Samitabh Chakraborti, Anthony S Dimitrov, Kosi Gramatikoff, and Dimiter S Dimitrov.The SARS-CoV S glycoprotein: expression and functional characterization.
Biochemical and biophysicalresearch communications , 312(4):1159–1164, 2003.[4] Ilona Glowacka, Stephanie Bertram, Marcel A M ¨uller, Paul Allen, Elizabeth Soilleux, Susanne Pfefferle,Imke Steffen, Theodros Solomon Tsegaye, Yuxian He, Kerstin Gnirss, et al. Evidence that TMPRSS2activates the severe acute respiratory syndrome coronavirus spike protein for membrane fusion andreduces viral control by the humoral immune response.
Journal of virology , 85(9):4122–4134, 2011.[5] Markus Hoffmann, Hannah Kleine-Weber, Simon Schroeder, Nadine Kr ¨uger, Tanja Herrler, SandraErichsen, Tobias S Schiergens, Georg Herrler, Nai-Huei Wu, Andreas Nitsche, et al. SARS-CoV-2 cellentry depends on ACE2 and TMPRSS2 and is blocked by a clinically proven protease inhibitor.
Cell ,2020.[6] Ruth McBride, Marjorie Van Zyl, and Burtram C Fielding. The coronavirus nucleocapsid is a multi-functional protein.
Viruses , 6(8):2991–3018, 2014.[7] Peng Yue, Zhaolong Li, and John Moult. Loss of protein structure stability as a major causative factorin monogenic disease.
Journal of molecular biology , 353(2):459–473, 2005.[8] Shannon Stefl, Hafumi Nishi, Marharyta Petukh, Anna R Panchenko, and Emil Alexov. Molecularmechanisms of disease-causing missense mutations.
Journal of molecular biology , 425(21):3919–3936,2013.[9] Rui Wang, Yuta Hozumi, Changchuan Yin, and Guo-Wei Wei. Mutations on COVID-19 diagnostictargets. arXiv preprint arXiv:2005.02188 , 2020.[10] Bette Korber, Will M Fischer, Sandrasegaram Gnanakaran, Hyejin Yoon, James Theiler, Werner Abfal-terer, Nick Hengartner, Elena E Giorgi, Tanmoy Bhattacharya, Brian Foley, et al. Tracking changes inSARS-CoV-2 Spike: evidence that D614G increases infectivity of the COVID-19 virus.
Cell , 2020.[11] Nathan D Grubaugh, William P Hanage, and Angela L Rasmussen. Making sense of mutation: whatD614G means for the COVID-19 pandemic remains unclear.
Cell , 2020.[12] Jiahui Chen, Rui Wang, Menglun Wang, and Guo-Wei Wei. Mutations strengthened SARS-CoV-2infectivity. arXiv preprint arXiv:2005.14669 , 2020.[13] Marion Sevajol, Lorenzo Subissi, Etienne Decroly, Bruno Canard, and Isabelle Imbert. Insights intoRNA synthesis, capping, and proofreading mechanisms of SARS-coronavirus.
Virus research , 194:90–99, 2014.[14] Franc¸ois Ferron, Lorenzo Subissi, Ana Theresa Silveira De Morais, Nhung Thi Tuyet Le, Marion Seva-jol, Laure Gluais, Etienne Decroly, Clemens Vonrhein, G´erard Bricogne, Bruno Canard, et al. Structuraland molecular basis of mismatch correction and ribavirin excision from coronavirus rna.
Proceedingsof the National Academy of Sciences , 115(2):E162–E171, 2018.2015] Rafael Sanju´an and Pilar Domingo-Calap. Mechanisms of viral mutation.
Cellular and molecular lifesciences , 73(23):4433–4448, 2016.[16] Changchuan Yin. Genotyping coronavirus SARS-CoV-2: methods and implications.
Genomics , 2020.[17] Tung Phan. Genetic diversity and evolution of SARS-CoV-2.
Infection, genetics and evolution , 81:104260,2020.[18] Kashif Aziz Khan and Peter Cheung. Presence of mismatches between diagnostic PCR assays andcoronavirus sars-cov-2 genome.
Royal Society Open Science , 7(6):200636, 2020.[19] Rui Wang, Yuta Hozumi, Changchuan Yin, and Guo-Wei Wei. Decoding SARS-CoV-2 transmission,evolution, and ramification on COVID-19 diagnosis, vaccine, and medicine.
Journal of Chemical Infor-mation and Modeling , page https://doi.org/10.1021/acs.jcim.0c00501, 2020.[20] Alina Baum, Benjamin O Fulton, Elzbieta Wloga, Richard Copin, Kristen E Pascal, Vincenzo Russo,Stephanie Giordano, Kathryn Lanza, Nicole Negron, Min Ni, et al. Antibody cocktail to SARS-CoV-2spike protein prevents rapid mutational escape seen with individual antibodies.
Science , 2020.[21] Salvatore Di Giorgio, Filippo Martignano, Maria Gabriella Torcia, Giorgio Mattiuz, and Silvestro GConticello. Evidence for RNA editing in the transcriptome of 2019 novel coronavirus.
Science Advances ,6(25), 2020.[22] Yong-Hui Zheng, Dan Irwin, Takeshi Kurosu, Kenzo Tokunaga, Tetsutaro Sata, and B Matija Peterlin.Human APOBEC3F is another host factor that blocks human immunodeficiency virus type 1 replica-tion.
Journal of virology , 78(11):6073–6076, 2004.[23] Kazuko Nishikura. A-to-I editing of coding and non-coding RNAs by ADARs.
Nature reviews Molecularcell biology , 17(2):83–96, 2016.[24] Harold C Smith, Ryan P Bennett, Ayse Kizilyer, William M McDougall, and Kimberly M Prohaska.Functions and regulation of the APOBEC family of proteins. In
Seminars in cell & developmental biology ,volume 23, pages 258–268. Elsevier, 2012.[25] Mei-Chen Liu, Wen-Yun Liao, Katherine M Buckley, Shu Yuan Yang, Jonathan P Rast, and Sebastian DFugmann. AID/APOBEC-like cytidine deaminases are ancient innate immune mediators in inverte-brates.
Nature communications , 9(1):1–11, 2018.[26] Charles E Samuel. Adenosine deaminases acting on RNA (ADARs) are both antiviral and proviral.
Virology , 411(2):180–193, 2011.[27] Sarah R Gonzales-van Horn and Peter Sarnow. Making the mark: the role of adenosine modificationsin the life cycle of RNA viruses.
Cell host & microbe , 21(6):661–669, 2017.[28] Reuben S Harris and Jaquelin P Dudley. APOBECs and virus restriction.
Virology , 479:131–145, 2015.[29] Peipei Song, Wei Li, Jianqin Xie, Yanlong Hou, and Chongge You. Cytokine storm induced by SARS-CoV-2.
Clinica Chimica Acta , 2020.[30] Ludmil B Alexandrov, Serena Nik-Zainal, David C Wedge, Samuel AJR Aparicio, Sam Behjati, An-drew V Biankin, Graham R Bignell, Niccolo Bolli, Ake Borg, Anne-Lise Børresen-Dale, et al. Signaturesof mutational processes in human cancer.
Nature , 500(7463):415–421, 2013.[31] Anura Hewagama, Dipak Patel, Sushma Yarlagadda, Faith M Strickland, and Bruce C Richardson.Stronger inflammatory/cytotoxic T-cell response in women identified by microarray analysis.
Genes& Immunity , 10(5):509–516, 2009. 2132] Sabra L Klein. Sex influences immune responses to viruses, and efficacy of prophylaxis and treatmentsfor viral diseases.
Bioessays , 34(12):1050–1059, 2012.[33] Steven A Roberts, Michael S Lawrence, Leszek J Klimczak, Sara A Grimm, David Fargo, Petar Sto-janov, Adam Kiezun, Gregory V Kryukov, Scott L Carter, Gordon Saksena, et al. An APOBEC cytidinedeaminase mutagenesis pattern is widespread in human cancers.
Nature genetics , 45(9):970–976, 2013.[34] Nelson Lee, David Hui, Alan Wu, Paul Chan, Peter Cameron, Gavin M Joynt, Anil Ahuja, Man YeeYung, CB Leung, KF To, et al. A major outbreak of severe acute respiratory syndrome in Hong Kong.
New England Journal of Medicine , 348(20):1986–1994, 2003.[35] Jan Felix Drexler, Florian Gloza-Rausch, J ¨org Glende, Victor Max Corman, Doreen Muth, MatthiasGoettsche, Antje Seebens, Matthias Niedrig, Susanne Pfefferle, Stoian Yordanov, et al. Genomic char-acterization of severe acute respiratory syndrome-related coronavirus in European bats and classifica-tion of coronaviruses based on partial RNA-dependent RNA polymerase gene sequences.
Journal ofvirology , 84(21):11336–11349, 2010.[36] Dan Hu, Changqiang Zhu, Lele Ai, Ting He, Yi Wang, Fuqiang Ye, Lu Yang, Chenxi Ding, XuhuiZhu, Ruicheng Lv, et al. Genomic characterization and infectivity of a novel SARS-like coronavirus inChinese bats.
Emerging Microbes & Infections , 7(1):1–10, 2018.[37] Peng Zhou, Xing-Lou Yang, Xian-Guang Wang, Ben Hu, Lei Zhang, Wei Zhang, Hao-Rui Si, Yan Zhu,Bei Li, Chao-Lin Huang, et al. A pneumonia outbreak associated with a new coronavirus of probablebat origin.
Nature , 579(7798):270–273, 2020.[38] Fan Wu, Aojie Wang, Mei Liu, Qimin Wang, Jun Chen, Shuai Xia, Yun Ling, Yuling Zhang, Jingna Xun,Lu Lu, et al. Neutralizing antibody responses to SARS-CoV-2 in a COVID-19 recovered patient cohortand their implications. 2020.[39] Yan Bai, Lingsheng Yao, Tao Wei, Fei Tian, Dong-Yan Jin, Lijuan Chen, and Meiyun Wang. Presumedasymptomatic carrier transmission of COVID-19.
JAMA , 323(14):1406–1407, 2020.[40] Yuelong Shu and John McCauley. Gisaid: Global initiative on sharing all influenza data–from visionto reality.
Eurosurveillance , 22(13), 2017.[41] Fabian Sievers and Desmond G Higgins. Clustal omega.
Current protocols in bioinformatics , 48(1):3–13,2014.[42] Peter JA Cock, Tiago Antao, Jeffrey T Chang, Brad A Chapman, Cymon J Cox, Andrew Dalke, IddoFriedberg, Thomas Hamelryck, Frank Kauff, Bartek Wilczynski, et al. Biopython: freely availablepython tools for computational molecular biology and bioinformatics.