Characterizing SARS-CoV-2 mutations in the United States
Rui Wang, Jiahui Chen, Kaifu Gao, Yuta Hozumi, Changchuan Yin, Guo-Wei Wei
CCharacterizing SARS-CoV-2 mutations in the United States
Rui Wang , Jiahui Chen , Kaifu Gao , Yuta Hozumi , Changchuan Yin , and Guo-Wei Wei , , * Department of Mathematics,Michigan State University, MI 48824, USA. Department of Mathematics, Statistics, and Computer Science,University of Illinois at Chicago, Chicago, IL 60607, USA Department of Electrical and Computer Engineering,Michigan State University, MI 48824, USA. Department of Biochemistry and Molecular Biology,Michigan State University, MI 48824, USA.July 28, 2020
Abstract
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has been mutating since it wasfirst sequenced in early January 2020. The genetic variants have developed into a few distinct clusterswith different properties. Since the United States (US) has the highest number of viral infected patientsglobally, it is essential to understand the US SARS-CoV-2. Using genotyping, sequence-alignment, time-evolution, k -means clustering, protein-folding stability, algebraic topology, and network theory, we revealthat the US SARS-CoV-2 has four substrains and five top US SARS-CoV-2 mutations were first detectedin China (2 cases), Singapore (2 cases), and the United Kingdom (1 case). The next three top US SARS-CoV-2 mutations were first detected in the US. These eight top mutations belong to two disconnectedgroups. The first group consisting of 5 concurrent mutations is prevailing, while the other group withthree concurrent mutations gradually fades out. Our analysis suggests that female immune systems aremore active than those of males in responding to SARS-CoV-2 infections. We identify that one of the topmutations, 27964C > T-(S24L) on ORF8, has an unusually strong gender dependence. Based on the analysisof all mutations on the spike protein, we further uncover that three of four US SASR-CoV-2 substrainsbecome more infectious. Our study calls for effective viral control and containing strategies in the US.
Key words: COVID-19, SARS-CoV-2, spike protein, ORF8a, genotyping, persistent homology, networktheory, machine learning
Contents * Corresponding author. E-mail: [email protected] i a r X i v : . [ q - b i o . GN ] J u l .1.2 Top mutations in the United States . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1.3 Co-mutation analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.1.4 Evolutionary analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.1.5 Gender analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Protein-specific analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.1 Mutation on the NSP12 protein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.2 Mutation on the Spike protein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2.3 Mutation on the ORF3a protein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.4 Mutation on the NSP2 protein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.5 Mutations on the NSP13 protein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2.6 Mutations on the ORF8 protein . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3 Infectivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.1 Cluster A infectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3.2 Cluster B infectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.3.3 Cluster C infectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.3.4 Cluster D infectivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 K -means clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.5 Topology-based machine learning prediction of protein-protein binding affinity changes fol-lowing mutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.6 Topology-based machine learning prediction of protein folding stability changes followingmutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.7 Graph network models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 ii Introduction
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), a strain of β -coronavirus that causesthe respiratory illness, is responsible for the ongoing global pandemic of coronavirus disease 2019 (COVID-19). At this stage, more than 200 countries, regions, and territories have reported positive COVID-19 infec-tions. Among them, the United States (US) has over 4 million confirmed cases and 145 thousand deceasedcases as of July 22, 2020 [1]. The rapid spread of COVID-19 gives rise to a question of whether SARS-CoV-2 has become more transmissible or infectious in the US. Benefiting from the vast collection of completegenome sequencing data of SARS-CoV-2 deposited in GISAID, studies on the mutation dynamics provideus a way to investigate the characteristics of US SARS-CoV-2 strains and understand their ramifications inthe US population health and economy.SARS-CoV-2 belongs to the Coronaviridae family and the Nidovirales order, which has been shownto have a genetic proofreading mechanism in its replication achieved by an enzyme called non-structureprotein 14 (NSP14) in synergy with NSP12, i.e., RNA-dependent RNA polymerase (RdRp) [2, 3]. There-fore, SARS-CoV-2 has a higher fidelity in its transcription and replication process than that of other single-stranded RNA viruses, such as flu virus and HIV. Nonetheless, 4968 single mutations have been gradu-ally detected in the 7823 US strains in the past few months compared with the first SARS-CoV-2 genomecollected on December 24, 2019 [4, 5]. The frequency of virus mutations is accumulated by the natural se-lection, cellular environment, polymerase fidelity, random genetic drift, features of recent epidemiology,host immune responses, gene editing, replication mechanism, etc [6, 7]. However, it is difficult to deter-mine the detailed mechanism of a specific mutation. Additionally, the current mechanistic understandingof SARS-CoV-2 proteins, protein-protein interactions, and their synergy with host cell proteins, enzymes,and signaling pathways is very limited. There is an urgent need to further explore the SARS-CoV-2 struc-ture, function, and activity. Analyzing SARS-CoV-2 genome mutations and evolution provides an efficientmeans to understand viral genotyping and phenotyping. Although experimental verification is requiredto ultimately determine how single mutations with high frequency have changed SARS-CoV-2 properties,such as viral infectivity and virulence, the relationship between these single high-frequency mutations andSARS-CoV-2 properties can still be deciphered from genotyping, protein analysis, and transmission track-ing.In this work, we extract 7823 single nucleotide polymorphism (SNP) profiles from genome isolates col-lected in the United States. We found that the US SARS-CoV-2 has developed into four substrains. Tofurther understand the characteristic of the US SARS-CoV-2 evolution and transmission, we investigate itsmost prevalent high-frequency or top mutations. Based on genotyping results, top eight missense muta-tions (i.e., 14408C > T-(P323L), 23403A > G-(D614G), 25563G > T-(Q57H), 1059C > T-(T85I), 28144T > C-(L84S),17858A > G-(Y541C), 17747C > T-(P504L), and 27964C > T-(S24L)) are identified, in addition to three synony-mous mutations (i.e., 3037C > T-(F106F), 8782C > T-(S76S), and 18060C > T-(L7L)) that do not change SARS-CoV-2 proteins. Among them, mutations 17858A > G-(Y541C), 17747C > T-(P504L), and 27964C > T-(S24L)were originated and prevalent mainly in the US, attributing to the unique characteristics of COVID-19 inthe US. We focus on the top eight missense mutations in the present work. Based on co-mutation andtime evolution analysis, we hypothesize that three concurrent mutations 17747C > T-(P504L), 17858A > G-(Y541C), and 28144T > C tend to fade out, while the other five concurrent mutations can enhance the infec-tivity of SARS-CoV-2. Moreover, by analyzing the gender disparity, we find that a US-unique mutation,27964C > T-(S24L), shows an interesting female-dominated pattern.NSP2, NSP12 (RdRp), NSP13 (helicase), spike (S) protein, ORF3a, and ORF8 are six SARS-CoV-2 pro-teins that associate with eight different missense mutations mentioned above. It is important to analyzethe impact of mutations on the structures, stabilities, and functions of these proteins. Employing the Jac-card distance-based genotyping [8, 9], topological data analysis [10], artificial intelligence [11], flexibility-1igidity index (FRI) [12], and network models [13], we show that 23403A > G-(D614G) and 27964C > T-(S24L)strengthen the folding stability of the spike protein and ORF8 protein. Conversely, the other high-frequencymissense mutations disrupt the folding stability of its corresponding proteins. Furthermore, because SARS-CoV-2 enters the host cells by the interaction of the S protein and the host angiotensin-converting enzyme2 (ACE2) receptor [14], we quantitatively evaluate the mutation-induced protein-protein binding free en-ergy changes ( ∆∆ G ) of S protein and ACE2. The results reveal that overall, the frequency of infectivity-strengthening mutations on the S protein is higher than the infectivity-weakening mutations, explainingthe fast person-to-person transmission in the United States.In a nutshell, we analyze the characteristics of SARS-CoV-2 substrains and prevalent mutations in theUnited States. These mutations, together with additional mutations on the S protein, suggest that SARS-CoV-2 has become more infectious in the United States. Figure 1: Pie chart plot of four clusters in the United States as of July 14, 2020. The blue, red, yellow, and green colors representclusters A, B, C, and D, respectively. The base color of each state is decided by its dominant cluster. Some of the states do not submitthe complete genome sequences to GISAID. Therefore, we will not set the base color of these states. k -means clustering method. Based on the mutations, the 7832genome isolates in the United States are clustered into four distinct clusters, as shown in Figure 1. The blue,red, yellow, and green represent Cluster A, B, C, and D, respectively. The base color of each state is deter-mined by its dominated cluster. We show that most of the states are dominated by Cluster B and Cluster D.Table 1 lists the distribution of samples and the total number of single mutations in 20 US states that havesubmitted significantly many SARS-CoV-2 genome isolates. They are Alaska (AK), Arizona (AZ), California(CA), Connecticut (CT), Washington, D.C. (DC), Florida (FL), Idaho (ID), Illinois (IL), Louisiana (LA), Mary-land (MD), Massachusetts (MA), Michigan (MI), Minnesota (MN), New Mexico (NM), New York (NY),Oregon (OR), Utah (UT), Virginia (VA), Washington (WA), and Wisconsin (WI). In Cluster A, B, C, and D,the co-mutations with the highest number of descendants are [8782C > T, 18060C > T, 28144T > C], [241C > T,3037C > T, 14408C > T, 23403A > G], [11083G > T], and [3037C > T, 14408C > T], respectively. It is noted that allof the 20 states have the mutations from Clusters B and D. Alaska (AK) and Louisiana (LA) do not havemutations from Cluster A, and New Mexico (NM) does not have mutations in Clusters A and C. The com-plete list of the table can be found in the supporting materials. More analysis related to the infectivity ofSARS-CoV-2 based on our four distinct clusters is given in Section 2.3.
Table 1: The cluster distributions of samples ( N NS ) and total mutation counts ( N TF ) in 20 states of the U.S. The 20states are Alaska (AK), Arizona (AZ), California (CA), Connecticut (CT), Washington, D.C. (DC), Florida (FL), Idaho(ID), Illinois (IL), Louisiana (LA), Maryland (MD), Massachusetts (MA), Michigan (MI), Minnesota (MN), New Mexico(NM), New York (NY), Oregon (OR), Utah (UT), Virginia (VA), Washington (WA), and Wisconsin (WI). Cluster A Cluster B Cluster C Cluster DState N NS N TF N NS N TF N NS N TF N NS N TF AK 0 0 12 99 4 27 14 105AZ 8 51 46 364 4 44 18 156CA 99 752 414 3308 152 830 561 5999CT 1 11 44 365 1 12 62 701DC 1 6 15 108 4 51 2 34FL 2 12 85 672 21 275 73 810ID 1 10 16 147 1 9 33 460IL 8 57 28 202 19 184 30 265LA 0 0 264 2302 3 80 77 961MD 1 6 23 178 16 221 18 257MA 1 8 21 138 3 21 9 64MI 2 13 192 1486 11 111 61 572MN 43 311 177 1398 31 236 212 2537NM 0 0 15 98 0 0 6 52NY 17 108 977 7338 70 560 204 1918OR 13 84 46 395 50 269 87 762UT 12 76 32 243 6 50 21 161VA 15 100 163 1291 13 108 136 1430WA 982 7274 598 4799 49 558 465 5316WI 11 74 137 1019 178 973 334 3539
To investigate the implications of mutations on the transmission, infection, and virulence of SARS-CoV-2 inthe United States, we focus on the high-frequency or top mutations that represent the most common char-acteristics of SARS-CoV-2 in the United States. A total of 11 mutations in the United States has a frequencygreater than 700. Among them, 3 mutations are synonymous ones (i.e., 3037C > T-(F106F), 8782C > T-(S76S),3 able 2: Top 8 missense mutations that are prevalent in the United States. The ranking of these 8 mutations in the US and world areincluded in the table. NC
U.S. and NC
World represent for the total number of sequences with a specific mutation in the United Statesand in the world, respectively. The last column records the date that these eight missense mutations were detected for the first timein the world and in the United States. The second-last column records their corresponding countries, i.e. the country lists at the topshows where the mutations first detected, and the country lists at the bottom will always be the United States. Here, UK, US, CN, SGrepresent the United Kingdom, the United States, China, and Singapore, respectively. We use ISO 8601 format YYYY-MM-DD as thedate format.
Rank: US/World Mutation Protein NC
U.S. NC World
Country DateTop 1/Top 2 14408C > T-(P323L) NSP12(RdRp) 5918 22081 UK 2020-02-03US 2020-02-28Top 2/Top 1 23403A > G-(D614G) Spike 5912 22162 CN 2020-01-24US 2020-02-28Top 3/Top 3 25563G > T-(Q57H) ORF3a 4827 8326 SG 2020-02-16US 2020-02-29Top 4/Top 7 1059C > T-(T85I) NSP2 4237 6595 SG 2020-02-16US 2020-02-29Top 5/Top 8 28144T > C-(L84S) ORF8 1434 2487 CN 2020-01-05US 2020-01-19Top 6/Top 10 17858A > G-(Y541C) NSP13(Helicase) 1245 1428 US 2020-02-20US 2020-02-20Top 7/Top 11 17747C > T-(P504L) NSP13(Helicase) 1224 1396 US 2020-02-20US 2020-02-20Top 8/Top 14 27964C > T-(S24L) ORF8 705 749 US 2020-02-20US 2020-02-20and 18060C > T-(L7L)) and 8 mutations are the missense mutations (i.e., 14408C > T-(P323L), 23403A > G-(D614G), 25563G > T-(Q57H), 1059C > T-(T85I), 28144T > C-(L84S), 17858A > G-(Y541C), 17747C > T-(P504L),and 27964C > T-(S24L)). Since synonymous mutations do not change SARS-CoV-2 proteins, we only focuson the other eight missense mutations. Table 2 lists the frequencies of the top 8 missense mutations in theUnited States. The dates that these mutations were first detected in the world and in the United Statesare also included in the Table 2. The first missense mutation, 23403A > G-(D614G), occurred in China onJanuary 24, 2020. The missense mutation with the highest frequency, 14408C > T-(P323L), occurred in theUnited Kingdom on February 3, 2020. Both mutations were first detected in the US on February 28, 2020.The first top mutation recorded in the US was 28144T > C-(L84S), on January 19. This mutation was origi-nally detected in China on January 5, 2020.Note that three of the top 8 missense mutations, i.e., 17858A > G-(Y541C), 17747C > T-(P504L), and 27964C > T-(S24L), first appeared in the United States. In fact, over 87% of these mutations are kept in the United States.Note that the top 5 mutations in the United States are also the top 5 mutations in the world. However,the next 3 mutations in the US is not ranked within the top 8 globally.
The statistical values of pairwise co-mutations from the top 8 high-frequency mutations in Table 3. Theupper triangular reveals the total number of co-mutations for each pair of mutations, the diagonal presentsthe frequency of every single mutation, and the lower triangular shows the ratios of pairwise co-mutationsover single mutations. It is easy to see that the top 8 mutations can be grouped into two essentially dis-connected groups. The first group involves 5 mutations 1059C > T-(T85I), 14408C > T-(P323L), 23403A > G-(D614G), 25563G > T-(Q57H), and 27964C > T-(S24L) that are strongly correlated, though have a wide rangeof frequencies. The other three mutations, 17747C > T-(P504L), 17858A > G-(Y541C), and 28144T > C-(L84S),4 able 3: The statistical values of pairwise co-mutations from the top 8 high-frequency mutations. The values in thediagonal reveal the total number of a specific single mutation in the United States, the values in the upper triangularrepresent the total number of the co-mutations, and the values in the lower triangular present the ratios of pairwiseco-mutations over single mutations. > T 23403A > G 1059C > T 25563G > T 27964C > T 17747C > T 17858A > G 28144T > C14408C > T 5918 5904 4235 4818 705 0 0 423403A > G 1.00/1.00 5912 4230 4818 705 0 0 31059C > T 1.00/0.72 1.00/0.72 4237 4235 703 0 0 425563G > T 1.00/0.82 1.00/0.82 1.00/0.88 4827 703 1 1 427964C > T 1.00/0.12 1.00/0.12 1.00/0.17 1.00/0.15 705 0 0 017747C > T 0/0 0/0 0/0 0/0 0/0 1224 1224 122217858A > G 0/0 0/0 0/0 0/0 0/0 1.00/1.00 1245 124328144T > C 0/0 0/0 0/0 0/0 0/0 0.85/1.00 0.87/1.00 1434occur mostly together and have similar numbers of frequencies.
Figure 2: The blue line plots illustrate the evolution of the top 8 missense mutation ratios computed as the frequency of genomesequences having mutations over the counts of genome sequences at each 10-days period. The red lines represent the evolution of thetotal counts of genome sequences. Bar plot of the gender distributions of the ratio of the number of samples having top 8 missensemutations over the total number of samples having age and/or gender labels. Red bars represent the female ratio and the blue barsrepresent the male ratio in the United States.
Figure 2 plots time evolution trajectories of top 8 missense mutations. The red curves are the totalnumbers of genome samples over time, which become very insufficient after middle May 2020. First, asshown in Table 3, mutations 14408C > T-(P323L) and 23403A > G-(D614G) appear concurrently and thushave an identical trajectory as shown in Figure 2. Note that this pair of mutations exist in essentiallyall of the US infections. Additionally, mutation 1059C > T-(T85I) always occurs together with mutation25563G > T-(Q57H). Therefore, its time evolution trajectory is extremely similar to that of 25563G > T-(Q57H).Both mutations were first detected in Singapore on February 16, 2020. This pair of mutations occur in5bout 70% of US COVID-19 cases. The third pair of mutations, 17747C > T-(P504L) and 17858A > G-(Y541C),first detected and occurred mostly in the US, have an identical evolution trajectory. Suggested by genomesamples, this pair of US-based mutations on the helicase protein appears to have essentially died out inthe US. Unfortunately, because there are very insufficient sequencing in the US now as shown by the redcurve in Figure 2, one can not rule out the existence of these mutations. Mutation 28144T > C-(L84S), thefirst known mutation globally, has had a very unsteady trajectory. However, its trajectory became identicalto those of its co-mutations 17747C > T-(P504L) and 17858A > G-(Y541C) after February 20, 2020. Finally,mutation 27964C > T-(S24L) has an usually behavior. Its peak ratios occurred in early June when there wereinsufficient sequence samples in the US.
The last chart in Figure 2 displays the gender disparity of the top 8 mutations in the US. The overall patternmay correlate with the disparity in male and female immune response and gene editing strengths. In theUS, the total number of mutations in all 1044 female genome samples is 8438 compared with 7779 mutationsrecorded in 982 male samples. After the total number correction, female genome isolates still have about2% more mutations than those of male genome isolates. There is an apparent gender difference in mutation27964C > T-(S24L) on the ORF8 protein. Among 705 samples having this mutation, 204 isolates have genderlabels. The ratio of female samples (167) over male samples (67) is about 2 to 1. We currently do not have agood explanation for such a large difference in the ratio.
In this section, we discuss the properties of the top 8 missense mutations associated with 6 proteins (i.e.,NSP2, NSP12, NSP13, spike protein, ORF3a, and ORF8). Based on the analysis presented above, we re-veal the potential influence of these high-frequency mutations on the infective capacity of SARS-CoV-2in the United States using protein folding stability analysis. Furthermore, to understand the impact onthe protein’s structures induced by mutations, we employ artificial intelligence [11], flexibility-rigidityindex (FRI) [12], and subgraph centrality models [13]. Results of our theoretical analysis are summa-rized in Table 4, which lists the folding stability changes, the FRI rigidity changes, and the average sub-graph centrality changes following the mutations. Here, the folding stability change following mutation ∆∆ G = ∆ G w − ∆ G m measures the difference between the folding free energies of the wild type ∆ G w and the mutant type ∆ G w . More specifically, a positive folding stability change ∆∆ G indicates that themutation will stabilize the structure of the protein and vice versa. The molecular FRI rigidity R η measuresthe topological connectivity and the geometric compactness of the network consisting of C α at each residueand the heavy atoms involved in the mutant. In this work, η determines the range of pairwise interactionsand is set to 8 ˚A. Furthermore, the average subgraph centrality (cid:104) C s (cid:105) is employed to describe the connectionbetween any pair of atoms within a cutoff distance of 8.0 ˚A. The decrease of the average participating rate ofeach edge in the network will cause the increment of the average subgraph centrality (cid:104) C s (cid:105) . The incrementof the (cid:104) C s (cid:105) induced by a mutation is due to the decrease of the number of neighbor edges participatingrate of each edge [13]. In addition, we also list the statistical values of pairwise co-mutations for the top 8high-frequency mutations in Table 3. Mutation 14408C > T-(P323L) on the NSP12 (also called RNA-dependent RNA polymerase, abbreviationRdRp) was first found in the United Kingdom on February 03, 2020. The United States found its first case6 able 4: The protein folding stability changes of 8 missense mutations. The folding stability change ∆∆ G = ∆ G w − ∆ G m , where ∆ G w and ∆ G m are the folding free energies of the wild type and the mutant type, respectively. R w and R m are FRI rigidities forthe wild type and mutant type of the protein with η = 8 ˚A. Here, (cid:104) C ws (cid:105) and (cid:104) C ms (cid:105) are the average subgraph centralities of the wildtype and the mutant type, respectively. ∆ ¯ R and ∆ (cid:104) ¯ C s (cid:105) are the molecular FRI rigidity changes and the average subgraph centralitychange. Rank Mutation Protein ∆∆ G (kcal/mol) R w R m ∆ ¯ R % (cid:104) C ws (cid:105) (cid:104) C ms (cid:105) ∆ (cid:104) ¯ C s (cid:105) %Top 1 14408C > T-(P323L) NSP12(RdRp) -0.11 9.77 9.87 -1.0 1105 1959 -77Top 2 23403A > G-(D614G) S protein 0.34 10.27 10.10 1.7 2376 1386 42Top 3 25563G > T-(Q57H) ORF3a -0.24 11.33 11.66 -1.5 25061 58592 -134Top 4 1059C > T-(T85I) NSP2 -0.05 12.37 12.51 -1.1 89764 166399 -85Top 5 28144T > C-(L84S) ORF8 -0.99 12.28 12.05 1.9 12810 6504 49Top 6 17858A > G-(Y541C) NSP13(Helicase) -0.81 11.52 10.40 9.7 506640 7271 99Top 7 17747C > T-(P504L) NSP13(Helicase) -0.59 7.52 7.54 0.3 4668 6094 -31Top 8 27964C > T-(S24L) ORF8 0.20 11.72 11.66 0.5 11685 29777 -155related to 14408C > T-(P323L) on February 28, 2020. Since then, this mutation has become one of the singledominant mutations in the United States. Among 7823 complete genome sequences, 5918 are connected toP323L. Figure 3 shows the sequence alignment for the NSP12 of SARS-CoV-2, SARS-CoV [15], bat coron-avirus RaTG13 [16], bat coronavirus CoVZC45 [17], and bat coronavirus BM48-31 [18]. The red rectanglemarks the mutant residue with its neighborhoods. One can see that SARS-CoV-2 NSP12 is highly conserva-tive among the other four species. Although P323L mutates the residue of proline (P) to leucine (L), thesetwo residues are both non-polar and aliphatic, indicating P323L may not affect the functionality of NSP12too much. Figure 4 (a) shows the three-dimensional (3D) structure of SARS-CoV-2 NSP12, which is createdby PyMol [19].NSP12 is of paramount importance to the SARS-CoV-2 replication and transcription machinery. It isone of the primary targets for antiviral drugs. Researchers have engaged in developing new antiviral ther-apeutics that target NSP12 for combating COVID-19 in the past few months. NSP12 and its cofactors NSP7and NSP8 can form a hollow cylinder-like supercomplex that interacts with NSP14, an exonuclease that hasthe proofreading capability [20]. The structure of NSP12 contains a nidovirus-specific N-terminal extensiondomain (residues D60 to R249) and a right-hand RdRp domain (S367 to F920), which are connected by aninterface domain (residues A250 to R365) [21]. Mutation P323L locates on this interface domain, which,however, is still poorly characterized.The increasing ratio of P323L in Figure 2 indicates that this type of mutations may favor SARS-CoV-2and enhance the transmission capacity of SARS-CoV-2. However, the negative folding stability changes inTable 4 suggest that P323L destabilizes the NSP12, which may make SARS-CoV-2 less contagious. Figure 4(b) and (c) show the differences of FRI rigidity index and subgraph centrality between the network withwild type and the network with mutant type. The atoms on the mutant residue is mark with big color balls.We deduce that the slight increase in the rigidity means the mutation makes the protein less flexible or lesscooperative in synergistic interactions.Based on the statistical values of co-mutations in Table 3, we can see that 14408C > T-(P323L) alwaysshows up with 1059C > T-(T85I), 23403A > G-(D614G), 25563G > T-(Q57H), and 27964C > T-(S24L) simultane-ously. Therefore, we can deduce that the increasing tendency of P323L ratios per 10-days is due to itsco-mutation with other infectivity-strengthening mutations, such as 23403A > G-(D614G).
Mutation 23403A > G-(D614G) located on the spike protein has the second-highest frequency in the UnitedStates, which has been recently considered as the key mutation that makes SARS-CoV-2 more infectious7 igure 3: Sequence alignments for the NSP12 of SARS-CoV-2, SARS-CoV, bat coronavirus RaTG13, bat coronavirus CoVZC45, batcoronavirus BM48-31. Detailed numbering is given according to SARS-CoV-2. One high-frequency mutation 14408C > T-(P323L) isdetected on the NSP12 protein. Here, the red rectangle marks the P323L mutations with its neighborhoods. worldwide [22]. From Table 2, one can see that mutation D614G was initially detected in China on January24, 2020. However, D614G was not widely spread in China, which is probably due to the strict socialdistancing rules and the Wuhan lockdown implemented by the Chinese government. In the middle ofFebruary, the hot spot with the most COVID-19 cases shifted from China to Europe and the D614G variantrapidly pervades in Europe [7]. The first case with the D614G mutation in the United States was reportedon February 28, 2020. Since then, the D614G mutation has become a majority variant, and 68.7% of patientscarry D614G in the United States as of July 14, 2020.The SARS-CoV-2 spike protein is a multi-functional molecule that interacts with the ACE2, which me-diates the virus entering into the host cells [23]. Figure 6 (a) depicts the 3D structure of the SARS-CoV-2spike protein that interacts with the host ACE2. The D614G mutation is one of the most popular mutationsof SARS-CoV-2, which changes the amino acid aspartate (D) with the polar negative charged side changesto the amino acid glycine (G) with a non-polar side chain. Figure 5 depicts the sequence alignment for theS protein of SARS-CoV-2, SARS-CoV, bat coronavirus RaTG13, bat coronavirus CoVZC45, and bat coron-avirus BM48-31. We use the red rectangle to mark the position of D614G and its neighborhoods. The Sprotein of bat coronavirus RaTG13 has the highest similarity of 97.47% with the S protein of SARS-CoV-2.One can see that amino acids near position 614 are very conservative, indicating that D614G mutation willplay an important role in the functions of the S protein of SARS-CoV-2. Moreover, the D614G mutation ratio8 igure 4: (a) The 3D structure of SARS-CoV-2 NSP12 protein. The mutated residue is marked with color balls. (b) The differencesof FRI rigidity index between the network with wild type and the network with mutant type. (c) The difference of the subgraphcentrality between the network with wild type and the network with mutant type. in Figure 2 keeps climbing, and the ratio is approaching the unity after June 16, 2020, which also provesthat SARS-CoV-2 becomes more contagious as time goes on.Table 4 shows a positive folding free energy change, indicating the stabilization effect of the mutation.Figure 6 (b) and (c) illustrate the difference of FRI rigidity index and the subgraph centrality between thenetwork with wild type and the network with mutant type. The FRI rigidity decreases following the mu-tation, endowing the S protein higher flexibility to interact with ACE2. The same is confirmed by averagesubgraph centrality.
Mutation 25563G > T-(Q57H) is on the ORF3a protein. It was first found in Singapore on February 16, 2020.The case in the United States was detected on February 28, 2020. As of July 14, 2020, 3865 complete genomesequences have the Q57H mutation in the United States, while 6765 isolates carry this type of single muta-tion in the world. The Q57H mutation changes the amino acid glutamine (Q) with a non-charged polar sidechain to the positively charged polar side chain of amino acid histidine (H). Figure 7 shows the sequencealignment results in different species. The ORF3a of SARS-CoV-2 and SARS-CoV share a 71.53% sequencesimilarity. The red rectangle marks the Q57H mutations with its two neighborhoods. Similarly, one can seethat the amino acids nearby position 57 are all conservative in all of 5 species. Moreover, Figure 8 (b) visu-alizes the SARS-CoV-2 ORF3a proteoform, where we use red to mark the wild type amino acid glutamine(Q) and yellow to address the mutant amino acid histidine (H). Spatiotemporally, mutation Q57H locatesat the intramolecular interface and in touch with the membrane, which indicates the special functionalitychanges that Q57H can induce., and Figure 8 (b) is the visualization of ORF3a, which is generated by an9 igure 5: Sequence alignments for the S proteins of SARS-CoV-2, SARS-CoV, bat coronavirus RaTG13, bat coronavirus CoVZC45, batcoronavirus BM48-31. Detailed numbering is given according to SARS-CoV-2 S protein. One high-frequency mutation 23403A > G-(D614G) is detected on the S protein. Here, the red rectangle marks the D614G mutations with its neighborhoods online server Protter [24].The ORF3a gene of SARS-CoV-2 encodes 275 amino acids with TNF receptor-associated factors (TRAFs),ion channel, and caveolin binding domain. Previous studies have shown that SARS-CoV ORF3a contains acysteine-rich motif, a tyrosine-based sorting motif, and a diacidic EXD motif, which are the critical domainsthat participate in the apoptosis and activate the innate immune signaling receptor NLRP3 (NOD-, LRR-,and pyrin domain-containing 3) inflammasome [25, 26]. ORF3a of SARS-CoV-2 shares 71.53% similarity10 igure 6: Illustration of S-protein and ACE2 interaction. The RBD is displayed in green, the ACE2 is given in red, and mutation D614Gis highlighted in red. (b) The difference of FRI rigidity index between the network with wild type and the network with mutant type.(c) The difference of the subgraph centrality between the network with wild type and the network with mutant type. with the ORF3a of SARS-CoV. Similarly, SARS-CoV-2 ORF3a protein is widely expressed in intracellularand plasma membranes, which induces apoptosis and inflammatory responses in the infected cells andtransfected cells [26]. Figure 8 (c) and (d) depict the differences of the FRI rigidity index and the subgraphcentrality between the network with wild type and the network with mutant type. We can see that theORF3a becomes more rigidity after the mutation, which may result in a less flexible mutant for ORF3a toinvolve in the apoptosis and inflammatory response.As illustrated in Figure 2, the ratio of the 25563G > T-(Q57H) mutation on ORF3a in each 10-day periodkept increasing once it was introduced to the United States. This tendency indicates that mutation Q57H be-comes popular in the viral patients of the United States, which may make the SARS-CoV-2 more infectious.From Table 3, we can see that 4827 SNP variants have 25563G > T-(Q57H) mutation on ORF3a. Amongthem, 4818 variants have the [25563G > T-(Q57H), 23403A > G-(D614G)] co-mutations. As we mentionedabove, the S protein plays an important role in viral transmission, indicating that mutation 25563G > T-(Q57H) may also be of great importance for viral transmission. The negative folding stability changes ofmutation Q57H in Table 4 reveals that ORF3a becomes unstable following the Q57H mutation, which mayharm the function of ORF3a in apoptosis and increase the viral load in the host cell. However, the Q57Hmutation locates near TRAF, ion channel, and caveolin binding domain [26], which may affect the NLRP3inflammasome activation. ORF3a activates the innate immune signaling receptor NLRP3 inflammasomevia mechanisms such as ion-redistribution and lysosomal disruption, which causes tissue inflammationduring respiratory illness and the production of inflammatory cytokines [27].11 igure 7: Sequence alignments for the ORF3a protein of SARS-CoV-2, SARS-CoV, bat coronavirus RaTG13, bat coronavirus CoVZC45,bat coronavirus BM48-31. Detailed numbering is given according to SARS-CoV-2. One high-frequency mutation 25563G > T-(Q57H)locates on the ORF3a protein. Here, the red rectangle marks the Q57H position with its neighborhoods.Figure 8: (a) The 3D structure of SARS-CoV-2 ORF3a protein. (b) The visualization of SARS-CoV-2 ORF3a proteoform. The high-frequency mutation 25563G > T-(Q57H) on ORF3a is marked in color. The red color represents the wild type and the yellow representsthe wild type. (c) The difference of FRI rigidity index between the network with wild type and the network with mutant type. (d) Thedifference of the subgraph centrality between the network with wild type and the network with mutant type.
Mutation 1059C > T-(T85I) was first detected in Singapore on February 16, 2020. On February 29, 2020, thefirst SNP variants with mutation 1059C > T-(T85I) appeared in the United States. As of July 14, 2020, morethan half of mutation1059C > T-(T85I) counts found worldwide are from the United States. The residue1285 on the NSP2 is polar and non-charged, and it changes to a non-polar residue I85 after the mutation.Figure 10 (a) shows the 3D structure of SARS-CoV-2 NSP2. From Figure 9, we can see that coronaviralNSP2 is relatively conservative for the first 91 residues. Moreover, the T85 residue with its neighbors is allconservative in the other four SARS-like sequences, indicating this type of mutations may be substantial tocoronaviral structures and properties.NSP2 is also a viral protein that does not attract much research attention. In the SARS-CoV genome,the deletion of NSP2 will only result in a modest reduction in viral titers, which is considered to be adispensable protein [28]. Table 4 shows that the folding stability change of T85I is -0.05 kcal/mol. Althoughthe negative value reveals that T85I may destabilize the structure of NSP2, this small change is negligible.The FRI rigidity change is also minor, as is shown in Figure 4 (c), indicating the mutation of T85I on theNSP2 does not change the flexibility of NSP2 too much. However, the growing trend in Figure 2 stillindicates that 1059C > T-(T85I) is an infectivity-strengthening mutation, which mainly benefits from theco-mutation with other infectivity-strengthening mutations, such as 23403A > G-(DD614) and 25563G > T-(Q57H).
Figure 9: Sequence alignments for the NSP2 of SARS-CoV-2, SARS-CoV, bat coronavirus RaTG13, bat coronavirus CoVZC45, batcoronavirus BM48-31. Detailed numbering is given according to SARS-CoV-2. One high-frequency mutation 1059C > T-(T85I) locateson the NSP2 protein. Here, the red rectangle marks the T85I position with its neighborhoods.
NSP13 of SARS-CoV-2 is a superfamily 1 helicase, which can unwind a double-stranded RNA (dsRNA) orDNA (dsDNA) in the 5’ to 3’ direction into two single-stranded nucleic acids. Moreover, NSP13 can hy-drolyze the deoxyribonucleotide and ribonucleotide triphosphates, which are involved in the ATP-couplingprocess [29, 30]. Furthermore, NPS13 is recruited to the double-membranes vesicles (DMVs) which aremainly found in the cell’s perinuclear area during viral infection, indicating that NSP13 is crucial to theviral infection and replication [31]. As illustrated in Figure 11, NSP13 of SARS-CoV-2 shares the mosthomology with the other 4 species and is one of the most conservative proteins in SARS-CoV-2 genome.13 igure 10: (a) The 3D structure of SARS-CoV-2 NSP2 protein. The mutant residue is marked with color balls. (b) The difference of FRIrigidity index between the network with wild type and the network with mutant type. (c) The difference of the subgraph centralitybetween the network with wild type and the network with mutant type.
Therefore, the existence of two high-frequency mutations on the NSP13 is very unusual.Similar to 27964C > T-(S24L), although 17858A > G-(Y541C), 17747C > T-(P504L) are in the final list in Ta-ble 2, more than 87% of them were detected in the United States. Mutation Y541C changes the amino acidtyrosine (Y) to cysteine (C). Figure 12 shows the 3D structure of SARS-CoV-2 NSP13, where the mutantresidue is marked with color balls. It is worthwhile to note that tyrosine has an aromatic side chain, whilecysteine only has a polar non-charged side chain, indicating that the 3D structure of NSP13 will be incred-ibly affected. Moreover, the other single mutation at the residue position 504 will not change the structureof NSP13 very much since both the mutant and wild-type residues are non-polar aliphatic residues.From Figure 2, one can note that both mutations on the NSP13 have the same trajectory of the mutationratios over time. Once mutations 17858A > G-(Y541C) and 17747C > T-(P504L) were first found in the UnitedStates, they had a rapid increase in the first two weeks. However, these two mutations do not benefitSARS-CoV-2. In early March, the ratio of both mutations start to decrease and approach zero after May 19,2020, suggesting that mutations 17858A > G-(Y541C) and 17747C > T-(P504L) may hinder the transmissionof SARS-CoV-2. It is interesting to note that in Table 3, 17858A > G-(Y541C) and 17747C > T-(P504L) nevershow up with 23403A > G-(D614G) in more than a thousand SNP variant, which provides a side evidencethat these two mutations may inhibit the contagiousness of SARS-CoV-2.Table 4 shows that both high-frequency mutations Y541C and P504L have negative folding stabilitychanges, which will destabilize the structure of NSP13. As stated earlier, mutations 17858A > G-(Y541C) and17747C > T-(P504L) happen simultaneously after analyzing 24715 genome sequences, which means the fold-ing stability changes on the NSP13 are superimposed by two simultaneously occurred mutations. This alsoexplains the same decreasing tendency in Figure 2 after early March. Based on the protein-specific analysismentioned above, we can deduce that mutations Y541C and P504L prevent SARS-CoV-2 from efficientlyinteracting with host interferon signaling molecules and impede the NSP13 from efficacious participation14n the replication/transcription process. Figure 4 (b) shows the difference of the FRI rigidity index betweenthe network with wild type and the network with the mutant type. One mutation (17747C > T-(P504L)) doesnot affect the rigidity much whereas the other mutation (17858A > G-(Y541C)) leads to a significant decreasein the NSP12 rigidity, which may make NSP13 not as robust as before to involve in the viral infection andreplication process.
Figure 11: Sequence alignments for the NSP13 protein of SARS-CoV-2, SARS-CoV, bat coronavirus RaTG13, bat coronavirus CoVZC45,bat coronavirus BM48-31. Detailed numbering is given according to SARS-CoV-2. Two high-frequency mutations 17858A > G-(Y541C)and 17747C > T-(P504L) locate on NSP13. Here, the red rectangles mark the Y541C and P504: mutations with their neighborhoods.
The ORF8 protein also has two high-frequency mutations, 28144T > C-(L84S) and 27964C > T-(S24L). No-tably, although 27964C > T-(S24L) has the lowest frequency in the top 8 missense mutations, more than94.1% mutation 27964C > T-(S24L) worldwide were found in the United States. Moreover, the first con-firmed case with 27964C > T-(S24L) was discovered on February 20, 2020, in the United States, suggestingthat S24L initially happened in the US. Another mutation on ORF8 is 28144T > C-(L84S), the number of se-quences with the L84S mutation in the United States is 1437, which accounts for more than 50% proportionof the isolates in the world. It is interesting to address that these two high-frequency mutations S24L andL84S mutate reversibly. The amino acid serine (S) has a non-charged polar side chain, while the leucine(L) has a non-polar aliphatic residue. Figure 13 illustrates the sequence alignment of SARS-CoV-2 ORF8with the other 4 species. The SARS-CoV-2 ORF8 shares a really low similarity among all the other fourSARS-like species. SARS-CoV, Bat coronavirus RaTG13, Bat coronavirus CoVZC45, and Bat coronavirusBM48-31 have the same residues at positions 24 and 84. Nonetheless, SARS-CoV-2 ORF8 owns differenttypes of residues. Here, we would like to address that although the ORF8 of SARS-CoV-2 at position 84 hasa different residue compared to the other 4 species, it mutates back to S in 1434 isolates in the United States.15 igure 12: (a) The 3D structure of SARS-CoV-2 NSP13 protein. The mutant residue is marked with color balls. (b) The difference of FRIrigidity index between the network with wild type and the network with mutant type. (c) The difference of the subgraph centralitybetween the network with wild type and the network with mutant type.
Figure 14 (a) shows the 3D structure of SARS-CoV-2 ORF8. ORF8 protein of SARS-CoV-2 shares theleast homology with SARS-CoV among all viral proteins, which mediates the immune evasion by inter-acting with major histocompatibility complex molecules class I (MCH-I) and down-regulating the surfaceexpression of MHC-I on various cells [32]. Once the “outer” invaders or antigens (i.e., viruses) enter thehost cell, the MCH-I will bind to them and bring them to the surface of the infected cell. Then, the T-cellreceptors (TCRs) that are expressed by cytotoxic T cells (TCLs) can recognize this unique signal presentedby MHC-I-peptide complex and directly eradicate the virus-infected cells, which is an effective and effi-cient anti-viral immune response [33]. However, in ORF8-expressing cells, the MHC-I protein becomes themajor target for the lysosomal degradation mediated by bafilomycin A1 (Baf-A1), resulting in the limitedexpression of MHC-I. Therefore, ORF8 protein has an effect on downregulating MHC-I, which will favorSARS-CoV-2 to evade immune surveillance by hindering the presentation of antigen regulated by MHC-I [32].In Figure 2, the ratio of S24L per 10 days is slightly going up before June 2, 2020. In the next ten days’period, the sharp increase and sharp drop in the ratio are due to the limited complete genome sequencessubmitted to GISAID in June. However, the overall upward trend of the S24L ratio over time reveals thatS24L may enhance SARS-CoV-2’s ability to spread. In contrast, the time evolution plot shows that the ratioof mutation 28144T > C-(L84S) goes up before the beginning of March, and then the ratio goes down andapproach zero after May 23, 2020. Due to the small number of sequence data, we can say that the ratio ofL84S has a decreasing tendency.As discussed earlier, the female patients with S24L mutation on ORF8 account for a large proportion,which indicates that the S24L is most likely to happen in the female population in the United States.Table 4 shows that the folding stability change of 28144T > C-(L84S) is -0.99 kcal/mol, indicating thatORF8 becomes unstable. Figure 14 (b) shows that the ORF8 becomes slightly less rigidity after both L84Sand S24L mutations. To be noted, the rigidity changes induced by S24L is less than the L84S. Based on thefunction of ORF8 that involved in the immune response, we deduce that L84S may disfavor SARS-CoV-216nd favor the host immune surveillance to decrease the viral load in the human cells, which provides anexplanation that the ratio of L84S in Figure 2 keeps decreasing. Meanwhile, the positive folding stabilitychange of 27964C > T-(S24L) lists in Table 4 reveals that this type of mutation may enhance the functionof ORF8. Therefore, the MHC-I will be inhibited more, and the eradication of SARS-CoV-2 in vivo willbe hindered. This explains the increasing trend in the ratio of S24L. Notably, after analyzing 28726 com-plete genome sequences, none of them have mutations 28144T > C-(L84S) and 27964C > T-(S24L) happenedsimultaneously.
Figure 13: Sequence alignments for the ORF8 protein of SARS-CoV-2, SARS-CoV, bat coronavirus RaTG13, bat coronavirus CoVZC45,bat coronavirus BM48-31. Detailed numbering is given according to SARS-CoV-2. Two high-frequency mutations 28144T > C-(L84S)and 27964C > T-(S24L) locate on the ORF8. Here, the red rectangles mark the S24L and L84S mutations with their neighborhoods.Figure 14: (a) The 3D structure of SARS-CoV-2 ORF8 protein. The mutant residue is marked with color balls. (b) The difference of FRIrigidity index between the network with wild type and the network with mutant type. (c) The difference of the subgraph centralitybetween the network with wild type and the network with mutant type.
As mentioned above, SARS-CoV-2 enters the host cell by the interaction of the S protein and ACE2. Theviral S protein is primed by TMPRSS2 to entail its cleavage at two potential sites, Arg685/Ser686 andArg815/Ser816 [14]. Based on the 7823 SNP variants’ information in the United States, we found 264non-degenerate single mutations on the spike protein. Among them, 32 single mutations are detected onthe receptor-binding domain (RBD). Moreover, 7 single mutations occurred on the receptor-binding motif17RBM), the region that directly connects with the ACE2. In this section, we separate 7823 SNP variants inthe United States into four clusters and calculate the mutation-induced binding affinity changes of S pro-tein RBD and ACE2 in each cluster, which will help us understand the potential transmission tendencyinduced by the mutations on the S protein RBD. The binding affinity change induced by single mutation ∆∆ G = ∆ G W − ∆ G M is defined as the subtraction of the binding affinity of the mutant type ( ∆ G M ) fromthe binding affinity of the wild type ( ∆ G W ). Furthermore, the positive binding affinity change of a singlemutation means that the mutation can enhance the binding affinity of the S protein RBD and ACE2 andmake SARS-CoV-2 more infectious.Figure 15 illustrates overall binding affinity changes ∆∆ G (kcal/mol) induced by 32 single mutationson SARS-CoV-2 S protein RBD. The color bar on the left-hand side of the figure represents the mutationfrequency. We can see that 50% single mutations have positive binding changes (16 out of 32). Moreover,the frequency of mutations with positive binding affinity changes is higher than those with negative bind-ing affinity changes, suggesting that SARS-CoV-2 is more likely to be infectious. Notably, the mutation23010T > C-(V483A) has the highest frequency (30) localized on the RBM has the positive binding affinitychange, which indicates that V483A is prevalent in COVID-19 patients’ in the United States has a potentialcapacity to enhance the infectivity of SARS-CoV-2. However, mutations that locate away from the RBMwill also have a crucial impact on the infectivity [34]. Although away from the RBM, the relatively high fre-quency and positive binding affinity changes of N354K, R403K, and G467S indicate more attention shouldbe paid to them in the future. Additionally, an interesting finding is that the mutations occurred at the sameresidue position such as A348S and A348T, P384L and P384S, and Q414E and Q414P always have similarbinding affinity changes. A S T S R K A S A T N K S N V F V L P L P S T AV I R K D V Q P Q E I V L I R KK Q P S A V G S T A P L V A F L Q L A S P R A S B i n d i n g a ff i n i t y c h a n g e s ( k c a l / m o l ) F r e q u e n c y RBM
Figure 15: Overall binding affinity changes ∆∆ G (kcal/mol) on the receptor-binding domain (RBD). The blue color region marksthe binding affinity changes on the receptor-binding motif (RBM). The height of each bar indicates the predicted ∆∆ G . The colorindicates the occurrence frequency in the GISAID genome dataset. Figure 16 illustrates the time evolution trajectories of 274 single mutations on SARS-CoV-2 S proteinRBD. The red line shows the mutations that have positive binding affinity changes and the blue lines repre-sent the mutations that have negative binding affinity changes. Here, we hypothesize that single mutationson S protein RBD with positive binding affinity changes will enhance the viral transmission since naturalselection favors them. One can see that the red lines gradually outpace the blue lines as time progresses,suggesting that our hypothesis is correct.Additionally, green lines indicate the evolutions of mutations that locate away from the RBD. The muta-tion that has the highest frequency is D614G, which was reported to enhance SARS-CoV-2 infectivity [35,36].18he trajectories of the other two high frequency S protein mutations (Q675R and E583D) indicate that theyare co-mutations with infectivity-enhancing S protein mutations, such as D614G. We found that the otherhigh frequency S protein mutation L5F is independent of mutation D614G.Based on the genotyping results, we separate 7823 SNP variants from the United States into four clusters.Table 1 shows the mutation distribution of four clusters with number of samples ( N NS ) and total singlemutation counts ( N TF ) in 20 states. Accordingly, a more specific analysis of the relationship between thesign of the binding affinity changes and the transmission ability is discussed below. Figure 16: The time evolution of 264 SARS-CoV-2 S protein mutations. The red lines represent the RBD mutations that strengthenthe infectivity of SARS-CoV-2 (i.e., ∆∆ G is positive), the blue lines represent the RBD mutations that weaken the infectivity of SARS-CoV-2 (i.e., ∆∆ G is negative), and the green lines are for S protein mutations that away from the RBD. The mutation with the highestfrequency is D614G. Figure 17 depicts the binding affinity changes of mutations in Cluster A. Total seven single mutations arefound in Cluster A. Four of them have the positive binding affinity changes, while the other three muta-tions induced the negative binding affinity changes. Therefore, the mutations in Cluster B will strengthenthe infectivity of SARS-CoV-2 in general. The V483A mutation is localized on the RBM with the high-est frequency, indicating that V483A may favor SARS-CoV-2 by natural selection and cause SARS-CoV-2more infectious. Although A348T and N354K have relatively low frequencies due to the limited number ofgenome samples, their high binding affinity changes lead to a more contagious SARS-CoV-2 substrain. Itis worth noting that from Table 1, mutations in Cluster A are involved in all of the 20 states except for LA.However, DC, MA, and TX each only have one SARS-CoV-2 isolate related to Cluster A. Therefore, LA, DC,MA, and TX are not contributed to the infectivity-strengthening mutations in Cluster A.
Figure 18 describes the binding affinity changes of mutations in Cluster B. There are thirteen single muta-tions on the S protein RBD and six single mutations on the RBM. Although the number of single mutationswith positive binding affinity changes is less than those with negative binding affinity changes, the highfrequency of N354K slightly enhances the infectivity of SARS-CoV-2. We can notice that all of the states in19 A S A S A T N K G S V A A S B i n d i n g a ff i n i t y c h a n g e s ( k c a l / m o l ) A348T051015202530 F r e q u e n c y RBM
A344S
Figure 17: Cluster A. Left: binding affinity changes ∆∆ G (kcal/mol) induced by mutations in Cluster V. Right: mutations on theSARS-CoV-2 S protein RBD. the Table 1 are associated with Cluster B. Additionally, a large proportion of single mutations in Cluster Bare in CA, LA, MN, MI, NY, and WA. N354K S359NP384LQ414EG476S P479L A520SA522S N K S N P L P S Q E R K P S G S T A P L F L A S A S B i n d i n g a ff i n i t y c h a n g e s ( k c a l / m o l ) P463SR457KF490L T478A0123456 F r e q u e n c y RBM
P384S
Figure 18: Cluster B. Left: binding affinity changes ∆∆ G (kcal/mol) induced by mutations in Cluster B. Right: mutations on theSARS-CoV-2 S protein RBD. Figure 19 describes the binding affinity changes in Cluster C. This is the only cluster that has more infectivity-weakening RBD mutations. To be noted, A475V on the RBM has a negative binding affinity change with arelatively high frequency compared to the other 3 mutations in Cluster C. However, only four mutations aredetected in the SNP variants that belong to Cluster C, indicating that Cluster C has a limited proportion ofthe infected patients in the United States. Our conjecture is confirmed by Table 1. Although 20 states are in-20olved in Cluster C except for NM, fewer samples related to the mutations in Cluster C are found comparedto the other three clusters. This means the infectivity-weakening mutations do not rapidly spread acrossthe United States. However, CA and WI have slightly more significant number of samples, indicating thatthe recent situation in these states is better than the other states.
A475VQ414E V395IV367F V F V I Q E A V B i n d i n g a ff i n i t y c h a n g e s ( k c a l / m o l ) F r e q u e n c y RBM
Figure 19: Cluster C. Left: binding affinity changes ∆∆ G (kcal/mol) induced by mutations in Cluster C. Right: mutations on theSARS-CoV-2 S protein RBD. The binding affinity changes of RDB mutations in Cluster D are illustrated in Figure 20. Eighteen differentsingle mutations are classified in Cluster D. Among them, nine mutations have positive binding affinitychanges and relatively higher frequencies, showing that overall the mutations in Cluster D can enhancethe transmission capacity of SARS-CoV-2. From Table 1, we can see that R346K, A348T, and N354K haverelatively high binding affinity changes. In addition, R403K has the highest frequency among the nineinfectivity-strengthen mutations. From Table 1, all of these 20 states have a large proportion in Cluster D,suggesting that the infectivity-strengthen mutations are widely spreading in the United States, especiallyin CA, MN, NY, WA, and WI.Finally, since the infectivity-strengthening D614G mutation is associated with all clusters and essentiallyall the US genome isolates, it may be quite reasonable to say all of the US SARS-CoV-2 substrains becomemore infectious compared with the original genome collected on December 24, 2019 in China.
On January 5, 2020, the complete genome sequence of SARS-CoV-2 was first released on the GenBank(Access number: NC 045512.2) submitted Zhang’s group at Fudan University [4]. Since then, there has beena rapid accumulation of SARS-CoV-2 genome sequences. In this work, 24,715 complete genome sequenceswith high coverage of SARS-CoV-2 strains from the infected individuals in the world were downloaded21 T S R K A S A T N K V L P L T A R K D V Q E Q P I V L I K QQ L A S P R B i n d i n g a ff i n i t y c h a n g e s ( k c a l / m o l ) K458QL441I I418VQ414EQ414P V382L0123456789 F r e q u e n c y RBM
Figure 20: Cluster D. Left: binding affinity changes ∆∆ G (kcal/mol) induced by mutations in Cluster D. Right: mutations on theSARS-CoV-2 S protein RBD. Single nucleotide polymorphism (SNP) genotyping measures the genetic variations between different mem-bers of a species. Establishing the SNP genotyping method to the investigation of the genotype changesduring the transmission and evolution of SARS-CoV-2 is of great importance [9, 41]. By analyzing the rear-ranged genome sequences, SNP profiles, which record all of the SNP positions in teams of the nucleotidechanges and their corresponding positions, can be constructed. The SNP profiles of a given SARS-CoV-2genome isolated from a COVID-19 patient capture all the differences from a complete reference genomesequence and can be considered as the genotype of the individual SARS-CoV-2.
In this work, we use the Jaccard distance to measure the similarity between SNP variants and compare thedifference between the SNP variant profiles of SARS-CoV-2 genomes.The Jaccard similarity coefficient is defined as the intersection size divided by the union of two sets A and B [8]: J ( A, B ) = | A ∩ B || A ∪ B | = | A ∩ B || A | + | B | − | A ∩ B | . (1)22he Jaccard distance of two sets A and B is scored as the difference between one and the Jaccard similaritycoefficient and is a metric on the collection of all finite sets: d J ( A, B ) = 1 − J ( A, B ) = | A ∪ B | − | A ∩ B || A ∪ B | . (2)Therefore, the genetic distance of two genomes corresponds to the Jaccard distance of their SNP variants.In principle, the Jaccard distance of SNP variants takes account of the ordering of SNP positions, i.e.,transmission trajectory, when an appropriate reference sample is selected. However, one may fail to identifythe infection pathways from the mutual Jaccard distances of multiple samples. In this case, the dates of thesample collection provide key information. Additionally, clustering techniques, such as k -means, UMAP,and t-distributed stochastic neighbor embedding (t-SNE), enable us to characterize the spread of COVID-19onto the communities. K -means clustering K -means clustering aims at partitioning a given data set X = { x , x , · · · , x n , · · · , x N } , x n ∈ R d into k clusters { C , C , · · · , C k } , k ≤ N such that the specific clustering criteria are optimized. The standard K -means clustering algorithm picks k points as cluster centers randomly at beginning and separates each datato its nearest cluster. Here, k cluster centers will be updated subsequently by minimizing the within-clustersum of squares (WCSS): k (cid:88) i =1 (cid:88) x i ∈ C k (cid:107) x i − µ k (cid:107) , (3)where µ k is the mean of points locating in the k -th cluster C k and n k is the number of points in C k . Here, (cid:107) · (cid:107) denotes the L distance.The aforementioned algorithm offers an optimal partition of k clusters. However, it is more importantto find the best number of clusters for the given set of SNP variants. Therefore, the Elbow method isemployed. A set of WCSS can be calculated in the k -means clustering process by varying the number ofclusters k , and then plot WCSS according to the number of clusters. The optimal number of clusters willbe the elbow in this plot. The WCSS measures the variability of the points within each cluster which isinfluenced by the number of points N . Therefore, as the number of total points of N increases, the value ofWCSS becomes larger. Additionally, the performance of k -means clustering depends on the selection of thespecific distance metric.In this work, we implement k -means clustering with the Elbow method for analyzing the optimal num-ber of the subtypes of SARS-CoV-2 SNP variants. The Jaccard distance-based representation is consideredas the input features for the k -means clustering method. If we have a total of N SNP variants concerning areference genome in a SARS-CoV-2 sample, the location of the mutation sites for each SNP variant will besaved in the set S i , i = 1 , , · · · , N . The Jaccard distance between two different sets (or samples) S i , S j isdenoted as d J ( S i , S j ) . Therefore, the N × N Jaccard distance-based representation is D J ( i, j ) = d J ( S i , S j ) . (4)This representation is used in our k -means clustering.23 .5 Topology-based machine learning prediction of protein-protein binding affinitychanges following mutations The topology-based network tree (TopNetTree) model was developed by an innovative integration betweenthe topological representation and network tree (NetTree) to predict the binding affinity changes of protein-protein interaction (PPI) following mutation ∆∆ G [42]. The TopNetTree is applied to predict the bindingaffinity changes upon mutations that occurred on the RBD of SARS-CoV-2. Algebraic topology [10] isutilized to simplify the structural complexity of protein-protein complexes and embed vital biological in-formation into topological invariants. NetTree integrates the advantages of convolutional neural networks(CNN) and gradient-boosting trees (GBT), such that CNN is treated as an intermediate model that con-verts vectorized element- and site-specific persistent homology features into a higher-level abstract feature,and GBT uses the upstream features and other biochemistry features for prediction. The performance testof tenfold cross-validation on the dataset (SKEMPI 2.0 [43]) carried out using gradient boosted regressiontrees (GBRTs). The errors with the SKEMPI2.0 dataset are 0.85 in terms of Pearson correlations coefficient( R p ) and 1.11 kcal/mol in terms of the root mean square error (RMSE) [42]. In this work, the prediction of protein folding stability changes upon mutation is carried out using atopology-based mutation predictor (TML-MP) (https://weilab.math.msu.edu/TML/TML-MP/) which wasintroduced in literature [11]. The essential biological information is revealed by persistent homology [10].The machine learning features are generated by the element-specific persistent homology and biochemistrytechniques. The dataset includes 2648 mutations cases in 131 proteins provided by Dehouck et al [44] andis trained by a gradient boosted regression trees (GBRTs). The error with the corresponding dataset is givenas Pearson correlations coefficient ( R p ) of 0.79 and root mean square error (RMSE) of 0.91 kcal/mol fromprevious work [11].The persistent homology is widely applied in a variety of practical feature generation problems [10]. It isalso successful in the implementation of predictions of protein folding stability changes upon mutation [11].The key idea in TML-MP is using the element-specific persistent homology (ESPH) which distinguishes dif-ferent element types of biomolecules when building persistent homology barcodes. Commonly occurringprotein element types include C, N, O, S, and H, where hydrogen and sulfur are excluded according tothat hydrogen atoms are often absent from PDB data and sulfur atoms are too few in most proteins to bestatistically important. Thus, C, N, and O elements are considered on the ESPH in protein characteriza-tion. Features are extracted from the different dimensions of persistent homology barcodes by dividingbarcodes into several equally spaced bins which is called binned barcode representation. The auxiliaryfeatures, such as geometry, electrostatics, amino acid type composition, and amino acid sequence, are in-cluded for machine learning training as well. In TML-MP, gradient boosted regression trees (GBRTs) [45]are employed to train the dataset according to the size of the training dataset, absence of model overfitting,non-normalization of features, and ability of nonlinear properties [11]. Graph networks can model interactions and their strength between pairs of units in molecules. Theseapproaches are employed to understand mutation-induced structural changes. The biological and chemicalproperties are measured by comparing descriptors on different networks. In this work, the network consistsof a set S of C α atoms from every residue of protein structure except the target mutation residue such that24 C α atom is included if it is within 16 ˚A to any atom of the target mutation. The total atom set T is definedas the atoms (C, N, and O) of the target residue and C α atoms of the network set S . Moreover, two verticesare connected in the network if their distance is less than 8 ˚A. Thus the adjacency matrix A can be definedas well where A is a matrix containing 0 and 1 such that A ( i, j ) = 0 if i -th and j -th atoms are disconnectedand A ( i, j ) = 1 if i -th and j -th atoms are connected. Two graph network models employed in this work aredescribed below. Flexibility-rigidity index (FRI)
FRI was introduced to study the flexibility of protein molecules [12, 46].The single residue molecular rigidity index measures its influence on the set S which is given as R αη = N S (cid:88) i =1 N T (cid:88) j =1 e − (cid:0) (cid:107) r i − r j (cid:107) η (cid:1) , (5)where α = w or m stands for the wild type w or the mutant type m , N S is the number of C α atoms of theset S , and N T is the number of atoms in total atom set T . Here, (cid:107) r i − r j (cid:107) is the distance between atoms at r i and r j . Average subgraph centrality
Average subgraph centrality is built on the exponential of the adjacency matrix, E = e A , where A is theaforementioned adjacency matrix. The subgraph centrality is the summation of weighted closed walks ofall lengths starting and ending at the same node [13, 47]. Thus the average subgraph centrality reveals theaverage of participating rate of each vertex in all subgraph and the network motif, which is given as (cid:104) C αs (cid:105) = 1 N S N (cid:88) i =1 ,i/ ∈ I T E ( i, i ) , (6)where I T is the index set of the mutation residue. The prevalence of coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coro-navirus 2 (SARS-CoV-2) in the United States (US) has led to over 4 million infection cases and 145 thousandfatalities as July 22, 2020. Understanding the US SARS-CoV-2 characteristics is of paramount importance tothe control of COVID-19 and the reopening of the world’s largest economy. We genotype the SARS-CoV-2genome isolates collected since January 2020 and analyze their cluster distributions, co-mutation patterns,and time evolution traits. We show that the US SARS-CoV-2 has evolved into four substrains. We unveil thatthe top eight US SARS-CoV-2 mutations are split into two essentially disconnected groups. The first grouphas 5 concurrent mutations that have prevailed over time, while the other group involving three concur-rent mutations is fading out. While five of the top eight US SARS-CoV-2 mutations were initially detectedelsewhere namely, China (2), Singapore (2) and the United Kingdom (1), three of them were homegrown. Awide variety of protein-specific analyses, including sequence alignment, folding stability changes follow-ing mutations modeled by persistent homology and machine learning, molecular flexibility-rigidity index,and averaged subgraph centrality, are employed to investigate the mutation-induced SARS-CoV-2 propertychanges. We reveal that the first group is associated with stability enhancing or infectivity strengtheningmutations, while the second group involves destabilizing mutations.We demonstrate that overall, genome samples isolated from female patients show more mutations thanthose isolated from males. This observation may be due to different strengths of immune response amongfemales and males after the viral infection. We found that one of the top mutations, 27964C > T- (S24L onORF8), has an unusually strong gender dependence.25inally, we employ algebraic topology and machine learning to predict spike (S) protein and angiotensin-converting enzyme 2 (ACE2) binding affinity changes after mutations. We show that mutations that strengthenthe S protein and ACE2 binding prevail over time. Additionally, three out of four US SARS-CoV-2 sub-strains have become significantly more infectious. Our results indicate the need for strict SARS-CoV-2control and containment strategies in the US.
Supporting Information
The supporting information is available forS1: Supplementary Figures for k -means clustering and proteoformsS2: Supplementary Tables for the SNP profiles in the United States and the acknowledgment of GISAIDdata contributors. Acknowledgment
This work was supported in part by NIH grant GM126189, NSF Grants DMS-1721024, DMS-1761320, andIIS1900473, Michigan Economic Development Corporation, George Mason University award PD45722,Bristol-Myers Squibb, and Pfizer. The authors thank The IBM TJ Watson Research Center, The COVID-19 High Performance Computing Consortium, NVIDIA, and MSU HPCC for computational assistance.
References [1] WHO. Coronavirus disease 2019 (COVID-19) situation report 172.
Coronavirus Disease (COVID-2019)Situation Reports , 2020.[2] Marion Sevajol, Lorenzo Subissi, Etienne Decroly, Bruno Canard, and Isabelle Imbert. Insights intoRNA synthesis, capping, and proofreading mechanisms of SARS-coronavirus.
Virus Research , 194:90–99, 2014.[3] Franc¸ois Ferron, Lorenzo Subissi, Ana Theresa Silveira De Morais, Nhung Thi Tuyet Le, Marion Seva-jol, Laure Gluais, Etienne Decroly, Clemens Vonrhein, G´erard Bricogne, Bruno Canard, et al. Structuraland molecular basis of mismatch correction and ribavirin excision from coronavirus RNA.
Proceedingsof the National Academy of Sciences , 115(2):E162–E171, 2018.[4] Fan Wu, Su Zhao, Bin Yu, Yan-Mei Chen, Wen Wang, Zhi-Gang Song, Yi Hu, Zhao-Wu Tao, Jun-HuaTian, Yuan-Yuan Pei, et al. A new coronavirus associated with human respiratory disease in China.
Nature , 579(7798):265–269, 2020.[5] Rui Wang, Yuta Hozumi, Changchuan Yin, and Guo-Wei Wei. Decoding SARS-CoV-2 Transmissionand Evolution and Ramifications for COVID-19 Diagnosis, Vaccine, and Medicine.
Journal of ChemicalInformation and Modeling , 2020. PMID: 32530284.[6] Rafael Sanju´an and Pilar Domingo-Calap. Mechanisms of viral mutation.
Cellular and Molecular LifeSciences , 73(23):4433–4448, 2016.[7] Nathan D Grubaugh, William P Hanage, and Angela L Rasmussen. Making sense of mutation: whatD614G means for the COVID-19 pandemic remains unclear.
Cell , 2020.268] Michael Levandowsky and David Winter. Distance between sets.
Nature , 234(5323):34–35, 1971.[9] Changchuan Yin. Genotyping coronavirus SARS-CoV-2: methods and implications.
Genomics , 2020.[10] Gunnar Carlsson. Topology and data.
Bulletin of the American Mathematical Society , 46(2):255–308, 2009.[11] Zixuan Cang and Guo-Wei Wei. Analysis and prediction of protein folding energy changes uponmutation by element specific persistent homology.
Bioinformatics , 33(22):3549–3557, 2017.[12] Kelin Xia, Kristopher Opron, and Guo-Wei Wei. Multiscale multiphysics and multidomain models-Flexibility and rigidity.
The Journal of Chemical Physics , 139(19):11B614 1, 2013.[13] Ernesto Estrada. Topological analysis of SARS-CoV-2 main protease.
Chaos: An Interdisciplinary Journalof Nonlinear Science , 30(6):061102, 2020.[14] Markus Hoffmann, Hannah Kleine-Weber, Simon Schroeder, Nadine Kr ¨uger, Tanja Herrler, SandraErichsen, Tobias S Schiergens, Georg Herrler, Nai-Huei Wu, Andreas Nitsche, et al. SARS-CoV-2 cellentry depends on ACE2 and TMPRSS2 and is blocked by a clinically proven protease inhibitor.
Cell ,2020.[15] Nelson Lee, David Hui, Alan Wu, Paul Chan, Peter Cameron, Gavin M Joynt, Anil Ahuja, Man YeeYung, CB Leung, KF To, et al. A major outbreak of severe acute respiratory syndrome in Hong Kong.
New England Journal of Medicine , 348(20):1986–1994, 2003.[16] Peng Zhou, Xing-Lou Yang, Xian-Guang Wang, Ben Hu, Lei Zhang, Wei Zhang, Hao-Rui Si, Yan Zhu,Bei Li, Chao-Lin Huang, et al. A pneumonia outbreak associated with a new coronavirus of probablebat origin.
Nature , 579(7798):270–273, 2020.[17] Dan Hu, Changqiang Zhu, Lele Ai, Ting He, Yi Wang, Fuqiang Ye, Lu Yang, Chenxi Ding, XuhuiZhu, Ruicheng Lv, et al. Genomic characterization and infectivity of a novel SARS-like coronavirus inChinese bats.
Emerging microbes & infections , 7(1):1–10, 2018.[18] Jan Felix Drexler, Florian Gloza-Rausch, J ¨org Glende, Victor Max Corman, Doreen Muth, MatthiasGoettsche, Antje Seebens, Matthias Niedrig, Susanne Pfefferle, Stoian Yordanov, et al. Genomic char-acterization of severe acute respiratory syndrome-related coronavirus in European bats and classifica-tion of coronaviruses based on partial RNA-dependent RNA polymerase gene sequences.
Journal ofvirology , 84(21):11336–11349, 2010.[19] Warren L DeLano et al. Pymol: An open-source molecular graphics tool.
CCP4 Newsletter on ProteinCrystallography , 40(1):82–92, 2002.[20] Maria Pachetti, Bruna Marini, Francesca Benedetti, Fabiola Giudici, Elisabetta Mauro, Paola Storici,Claudio Masciovecchio, Silvia Angeletti, Massimo Ciccozzi, Robert C Gallo, et al. Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant.
Journal of Trans-lational Medicine , 18:1–9, 2020.[21] Yan Gao, Liming Yan, Yucen Huang, Fengjiang Liu, Yao Zhao, Lin Cao, Tao Wang, Qianqian Sun,Zhenhua Ming, Lianqi Zhang, et al. Structure of the RNA-dependent RNA polymerase from COVID-19 virus.
Science , 368(6492):779–782, 2020.[22] B Korber, WM Fischer, S Gnanakaran, H Yoon, J Theiler, W Abfalterer, N Hengartner, EE Giorgi, T Bhat-tacharya, B Foley, et al. Tracking changes in SARS-CoV-2 Spike: evidence that D614G increases infec-tivity of the COVID-19 virus.
Cell , 2020.[23] Jun Lan, Jiwan Ge, Jinfang Yu, Sisi Shan, Huan Zhou, Shilong Fan, Qi Zhang, Xuanling Shi, QishengWang, Linqi Zhang, et al. Structure of the SARS-CoV-2 spike receptor-binding domain bound to theACE2 receptor.
Nature , 581(7807):215–220, 2020.2724] Ulrich Omasits, Christian H Ahrens, Sebastian M ¨uller, and Bernd Wollscheid. Protter: interactive pro-tein feature visualization and integration with experimental proteomic data.
Bioinformatics , 30(6):884–886, 2014.[25] Yujie Ren, Ting Shu, Di Wu, Jingfang Mu, Chong Wang, Muhan Huang, Yang Han, Xue-Yi Zhang,Wei Zhou, Yang Qiu, et al. The ORF3a protein of SARS-CoV-2 induces apoptosis in cells.
Cellular &Molecular Immunology , pages 1–3, 2020.[26] Sk Sarif Hassan, Pabitra Pal Choudhury, Pallab Basu, and Siddhartha Sankar Jana. Molecular conser-vation and Differential mutation on ORF3a gene in Indian SARS-CoV2 genomes.
Genomics , 2020.[27] Adnan Shah. Novel Coronavirus-Induced NLRP3 Inflammasome Activation: A Potential Drug Targetin the Treatment of COVID-19.
Frontiers in Immunology , 11, 2020.[28] Cromwell T Cornillez-Ty, Lujian Liao, John R Yates, Peter Kuhn, and Michael J Buchmeier. Severeacute respiratory syndrome coronavirus nonstructural protein 2 interacts with a host protein complexinvolved in mitochondrial biogenesis and intracellular signaling.
Journal of Virology , 83(19):10314–10318, 2009.[29] Adeyemi O Adedeji, Bruno Marchand, Aartjan JW Te Velthuis, Eric J Snijder, Susan Weiss, Robert LEoff, Kamalendra Singh, and Stefan G Sarafianos. Mechanism of nucleic acid unwinding by SARS-CoVhelicase.
PloS One , 7(5):e36521, 2012.[30] Chun-Kit Yuen, Joy-Yan Lam, Wan-Man Wong, Long-Fung Mak, Xiaohui Wang, Hin Chu, Jian-PiaoCai, Dong-Yan Jin, Kelvin Kai-Wang To, Jasper Fuk-Woo Chan, et al. SARS-CoV-2 nsp13, nsp14, nsp15and orf6 function as potent interferon antagonists.
Emerging Microbes & Infections , pages 1–29, 2020.[31] K`evin Knoops, Marjolein Kikkert, Sjoerd HE Van Den Worm, Jessika C Zevenhoven-Dobbe, YvonneVan Der Meer, Abraham J Koster, A Mieke Mommaas, and Eric J Snijder. SARS-coronavirus replicationis supported by a reticulovesicular network of modified endoplasmic reticulum.
PLoS Biol , 6(9):e226,2008.[32] Yiwen Zhang, Junsong Zhang, Yingshi Chen, Baohong Luo, Yaochang Yuan, Feng Huang, Tao Yang,Fei Yu, Jun Liu, Bingfeng Liu, et al. The ORF8 Protein of SARS-CoV-2 Mediates Immune Evasionthrough Potently Downregulating MHC-I. bioRxiv , 2020.[33] Ling Ni, Fang Ye, Meng-Li Cheng, Yu Feng, Yong-Qiang Deng, Hui Zhao, Peng Wei, Jiwan Ge, Mengt-ing Gou, Xiaoli Li, et al. Detection of SARS-CoV-2-specific humoral and cellular immunity in COVID-19 convalescent individuals.
Immunity , 2020.[34] Jiahui Chen, Rui Wang, Menglun Wang, and Guo-Wei Wei. Mutations strengthened SARS-CoV-2infectivity.
Journal of Molecular Biology, https://doi.org/10.1016/j.jmb.2020.07.009 , 2020.[35] Bette Korber, Will Fischer, S Gnana Gnanakaran, Heyjin Yoon, James Theiler, Werner Abfalterer, BrianFoley, Elena E Giorgi, Tanmoy Bhattacharya, Matthew D Parker, et al. Spike mutation pipeline revealsthe emergence of a more transmissible form of SARS-CoV-2. bioRxiv , 2020.[36] Lizhou Zhang, Cody B Jackson, Huihui Mou, Amrita Ojha, Erumbi S Rangarajan, Tina Izard, MichaelFarzan, and Hyeryunc Choe. The D614G mutation in the SARS-CoV-2 spike protein reduces S1 shed-ding and increases infectivity. bioRxiv , 2020.[37] Yuelong Shu and John McCauley. GISAID: Global initiative on sharing all influenza data–from visionto reality.
Eurosurveillance , 22(13), 2017.[38] Fabian Sievers and Desmond G Higgins. Clustal omega.
Current Protocols in Bioinformatics , 48(1):3–13,2014. 2839] Dennis A Benson, Ilene Karsch-Mizrachi, David J Lipman, James Ostell, and Eric W Sayers. GenBank.
Nucleic acids research , 37(suppl 1):D26–D31, 2009.[40] Jianyi Yang, Renxiang Yan, Ambrish Roy, Dong Xu, Jonathan Poisson, and Yang Zhang. The I-TASSERSuite: protein structure and function prediction.
Nature methods , 12(1):7–8, 2015.[41] Rui Wang, Yuta Hozumi, Changchuan Yin, and Guo-Wei Wei. Decoding asymptomatic COVID-19infection and transmission. arXiv preprint arXiv:2007.01344 , 2020.[42] Menglun Wang, Zixuan Cang, and Guo-Wei Wei. A topology-based network tree for the prediction ofprotein–protein binding affinity changes following mutation.
Nature Machine Intelligence , 2(2):116–123,2020.[43] Justina Jankauskait ˙e, Brian Jim´enez-Garc´ıa, Justas Dapk ¯unas, Juan Fern´andez-Recio, and Iain H Moal.SKEMPI 2.0: an updated benchmark of changes in protein–protein binding energy, kinetics and ther-modynamics upon mutation.
Bioinformatics , 35(3):462–469, 2019.[44] Yves Dehouck, Aline Grosfils, Benjamin Folch, Dimitri Gilis, Philippe Bogaerts, and MarianneRooman. Fast and accurate predictions of protein stability changes upon mutations using statisticalpotentials and neural networks: PoPMuSiC-2.0.
Bioinformatics , 25(19):2537–2543, 2009.[45] Jerome H Friedman. Greedy function approximation: a gradient boosting machine.
Annals of statistics ,pages 1189–1232, 2001.[46] Duc Duy Nguyen, Kelin Xia, and Guo-Wei Wei. Generalized flexibility-rigidity index.
The Journal ofChemical Physics , 144(23):234106, 2016.[47] Ernesto Estrada and Juan A Rodriguez-Velazquez. Subgraph centrality in complex networks.