[PDF] Statistical challenges in the analysis of sequence and structure data for the COVID-19 spike protein

Abstract

As the major target of many vaccines and neutralizing antibodies against SARS-CoV-2, the spike (S) protein is observed to mutate over time. In this paper, we present statistical approaches to tackle some challenges associated with the analysis of S-protein data. We build a Bayesian hierarchical model to study the temporal and spatial evolution of S-protein sequences, after grouping the sequences into representative clusters. We then apply sampling methods to investigate possible changes to the S-protein's 3-D structure as a result of commonly observed mutations. While the increasing spread of D614G variants has been noted in other research, our results also show that the co-occurring mutations of D614G together with S477N or A222V may spread even more rapidly, as quantified by our model estimates.

Full PDF

SStatistical challenges in the analysis of sequence andstructure data for the COVID-19 spike protein

Shiyu He and Samuel W.K. Wong ∗ Department of Statistics and Actuarial Science, University of WaterlooJanuary 30, 2021

Abstract

As the major target of many vaccines and neutralizing antibodies against SARS-CoV-2, the spike (S) protein is observed to mutate over time. In this paper, we present statisticalapproaches to tackle some challenges associated with the analysis of S-protein data. We build aBayesian hierarchical model to study the temporal and spatial evolution of S-protein sequences,after grouping the sequences into representative clusters. We then apply sampling methods toinvestigate possible changes to the S-protein’s 3-D structure as a result of commonly observedmutations. While the increasing spread of D614G variants has been noted in other research,our results also show that the co-occurring mutations of D614G together with S477N or A222Vmay spread even more rapidly, as quantiﬁed by our model estimates.

Key words and phrases:

SARS-CoV-2, Bayesian hierarchical models, compositional dataanalysis, mutant clusters, conformational sampling

The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), a strain of novel coronavirusthat caused the COVID-19 outbreak in Wuhan, China in December 2019, has quickly spread acrossthe world and has been characterized as a global pandemic (Zhou et al., 2020a; Wu et al., 2020).As of December 19, 2020, there have been over 74 million probable or conﬁrmed cases of COVID-19, and the illness has been associated with 1.66 million deaths around the world (WHO, 2020a).The development of vaccines and antibody-based therapeutic agents has been initiated since thebeginning of the pandemic and several have moved into phase III trials (Krammer, 2020; WHO,2020b). Results concerning the long-term immunogenicity and eﬃcacy of these vaccine candidatesare a subject of continued research. Meanwhile, the virus has been found to mutate in human-to-human transmissions over time, and these changes can potentially alter the eﬃcacy of theseinterventions. Therefore, it is also of vital importance to identify and study mutations with possibleﬁtness advantages and increased infectiousness.SARS-CoV-2 is a single-stranded RNA virus, and RNA viruses are known to have high mutationrates and genetic diversity compared to DNA viruses (Duﬀy, 2018; Lauring and Andino, 2010).Their ability to evolve underlies why they can adapt to novel hosts and develop resistance to either ∗ Address for correspondence: Department of Statistics and Actuarial Science, University of Waterloo, Waterloo,ON, Canada. E-mail: [email protected] a r X i v : . [ s t a t . A P ] J a n accine or infection-induced immunity. Often, the most consequential mutations in terms of viralfunctions and resistance to neutralizing antibodies are those that alter the surface proteins of thevirus. For instance, the mutation A82V in the Ebola virus glycoprotein was conﬁrmed to haveenhanced infectivity and increased the severity of the EVD epidemic (Diehl et al., 2016). Further,co-occurring mutations of A143V and R148K in the inﬂuenza H7N9 surface protein led to a 10-foldreduction in its sensitivity to neutralizing antibodies (Ning et al., 2019). As a result, mutations inthe SARS-CoV-2 genome are being continuously monitored over time, and a major public repositoryfor sequenced genomes is GISAID ( https://gisaid.org ). In addition to collecting viral genomedata, GISAID also provides tools for visualizing the spread of various mutations, organized intophylogenetic clusters (also known as clades ) over space and time.Four structural proteins – spike (S), envelope (E), membrane (M), and nucleocapsid (N) – arethe building blocks for the SARS-CoV-2 virus particle (Phan, 2020). Out of these four proteins, theS-protein plays the most critical role in attachment and entry into host cells, through its bindingwith the human ACE2 receptor (Wan et al., 2020). For this reason, the S-protein is the majortarget of many vaccines and neutralizing antibodies against SARS-CoV-2 (Amanat and Krammer,2020). In the event of infection, these antibodies can disrupt the spike protein’s ability to bind withthe ACE2 receptor, thereby blocking its entry into host cells. The analyses in this paper focus onmutations in the amino acid sequence of the S-protein due to its particular importance.Among all currently known S-protein sequence variants resulting from mutations in the under-lying genome, D614G has been the most extensively studied due to its relatively early emergenceand subsequent prevalence. The notation “D614G” means that the amino acid D (aspartic acid) inposition 614 of the original (or reference ) sequence has mutated to G (glycine), where the lettersare used to denote the 20 diﬀerent amino acid types. A rapid increase in D614G was observed inmany regions after its initial appearance, which suggested ﬁtness advantages and the hypothesisthat variants with D614G are likely more infectious (Korber et al., 2020). This was later corrobo-rated by experimental evidence that D614G, either by itself or in conjunction with other mutations,is signiﬁcantly more infectious than the reference S-protein sequence (Li et al., 2020). Overall, thecontinued evolution of the virus has resulted in thousands of distinct S-protein sequence variantsrecorded in GISAID, although many of these only diﬀer by a few mutated sequence positions. Whileclinical or laboratory experiments can test the infectivity of speciﬁc mutations, it is challenging toanalyze large numbers of sequence variants.Computational researchers have thus used clustering as a means to gain interpretable insightinto the eﬀect of S-protein mutations across diﬀerent geographical regions. Temporal changes inthe prevalence of S-protein mutations have also been studied in related research. For instance,Chen et al. (2020b) clustered mutations occurring in the receptor binding domain (RBD) of theS-protein and studied binding aﬃnity changes for each cluster. Based on common amino acidmutations, Toyoshima et al. (2020) classiﬁed 28 countries into three clusters and studied correla-tions between fatality rate and S-protein D614G variants. In addition, the hypotheses of mono-tonic trends for D614G and various other mutations have been tested using isotonic regressionby Korber et al. (2020) and in the COVID-19 pipelines of the Los Alamos National Laboratory( https://cov.lanl.gov ). However, to the best of our knowledge, few authors have built com-prehensive statistical models for the evolution of S-protein mutant clusters (i.e., groups of closelyrelated sequence variants) over space and time. Such models can have an important practical valuein providing forecasts and early warnings for countries where S-protein sequence variants with po-tential ﬁtness advantages or higher infectiousness are actively being transmitted. To that end, thispaper presents one such Bayesian hierarchical model for multinomial time series that can help tacklethis problem. 2he 3-D structure of a protein corresponding to its amino acid sequence is a crucial part of thepuzzle for understanding how the protein functions; thus, of particular interest here is the 3-D struc-ture of the SARS-CoV-2 S-protein and its mutated variants. Often, sequence mutations associatedwith changes in viral infectivity can be attributed to changes in protein structure (Schaefer andRost, 2012). The ﬁrst 3-D structure of the SARS-CoV-2 S-protein was released in mid-February2020 (Wrapp et al., 2020), and since then many other S-protein structures have been added to thepublicly available Protein Data Bank (PDB) (Bernstein et al., 1977). However, laboratory exper-iments to determine protein structure are laborious and costly, and ultimately some prove to beintractable. For this reason, the structural impacts of common S-protein mutations are not yet well-documented in the PDB, and computational methods are needed to predict their impact. Diﬀerenttools for 3-D protein structures have been used for this purpose thus far, including protein-proteinbinding aﬃnity prediction (Chen et al., 2020b), comparative modeling with known PDB structures(Sedova et al., 2020), and Monte Carlo sampling of protein segments (Wong, 2020). In this paper wefollow the illustrative analysis in Wong (2020), applying similar statistical sampling approaches toassess the potential local structural changes to the S-protein for common mutations in the currentmutant clusters considered.Overall then, our goal in this paper to illustrate statistical ideas for tackling the aforementionedchallenges associated with the analysis of S-protein data, both their sequence and structure aspects.Speciﬁcally based on presently available data, we study temporal and spatial changes in the mu-tations of S-protein sequences and their structural impact, with the aim to better understand theongoing evolution of the disease. Our contribution can be summarized in three parts. First, wedevelop a Bayesian hierarchical model to study the evolution of mutant clusters. Second, we applysampling methods to analyze the local structural changes of the most frequently occurring proteinsequence mutations in these clusters. Third, we discuss our ﬁndings and relate them to other recentwork reported in the literature. The S-protein sequence dataset for SARS-CoV-2 was obtained from GISAID on Oct 14th, 2020,with the number of sequences totaling 98,699 after incomplete sequences were removed. The fullS-protein, based on the ﬁrst discovered reference sequence, is 1273 amino acids long. Out of all com-plete sequences, 3,205 of them are unique, indicating that viral evolution has resulted in substantialgenetic diversity. Our analysis of complete sequences shows that D614G is the most frequent mu-tation, appearing in 86.5% of recorded sequences, followed by S477N (6.3%), A222V (3.6%), L18F(1.9%), and L5F (0.99%), R21I (0.98%), and D936Y (0.74%). Many of these mutations are alsomentioned in the recent literature where mutation analysis was considered (Korber et al., 2020;Chen et al., 2020a; Hodcroft et al., 2020). The sequences from GISAID are indexed by country anddate of deposition, which allows us to conveniently group them for subsequent analysis. Sequencesare separately deposited by local laboratories, therefore sequence counts vary widely by countryand may not be well-correlated with actual case counts. Due to this concern, we instead focus onthe relative prevalence, i.e., proportions of counts, throughout this paper.To analyze the large numbers of sequence variants, we ﬁrst implemented hierarchical clusteringto group the unique sequences according to their similarities. The distance matrix for hierarchicalclusters was based on the number of pairwise mismatched letters and the Ward-D linkage criterion,which creates groups such that variance is minimized within clusters (Ward, 1963). For our illus-3rative analysis, we chose to use a total of ﬁve clusters, which aims to achieve a balance betweenseparability of the diﬀerent clusters and interpretability of the results. Table 1 shows the mostcommon mutations in each cluster and their frequencies, where it can be seen that these clustersall have identiﬁable patterns. For example, D614G is dominant in cluster I and present in 89% ofsequences within that cluster, while cluster II has the lowest frequency of mutations, indicating thatit is composed of sequences with a high level of similarity to the original reference sequence. In clus-ters III, IV, and V, D614G frequently occurred together with L5F, S477N, A222V respectively, andthese paired mutations are observed in almost all sequences within those clusters, evidencing a highprobability for the co-occurrence of some common mutations. Evidence for these co-occurrencescan also be seen in Table 2, which displays the top three unique sequences in each cluster (as rankedby frequency within that cluster) and their speciﬁc mutation positions.Rank Cluster I Cluster II Cluster III Cluster IV Cluster VMutation Freq Mutation Freq Mutation Freq Mutation Freq Mutation Freq1 D614G 89% P863H 3% L5F 98% D614G 100% D614G 99%2 S477N 3% A262T 2% D614G 87% S477N 100% A222V 98%3 L5F 2% Y453F 2% H655Y 15% T632N 8% L18F 54%4 D936Y 2% T572I 1% A222V 13% L822F 4% A262S 13%5 A222V 2% V615I 1% V3G 10% S939F 4% P272L 9%6 R21I 2% K77M 1% D574Y 10% W258L 4% D1163Y 9%7 L54F 1% A845S 1% S459Y 6% E1144Q 4% L5F 7%8 P1263L 1% L8V 1% M1229I 4% G566S 4% G1167V 6%9 Q677H 1% H655Y 1% T859I 4% P330A 2% L176F 4%Table 1: Relative frequencies of the most common mutations present in each cluster. The top 9mutations in descending order and their corresponding relative frequencies are shown in the columnsfor each of clusters I–V. For example, the D614G mutation is present in 89% of the sequencesbelonging to cluster I.We selected 9 countries from the GISAID database to study based on the larger numbers ofsequences deposited, which are United States (US), Canada (CA), United Kingdom (UK), Nether-lands (NL), France (FR), Spain (SP), China (CN), India (IN), and Australia (AU). The pie chartsin Figure 1 show the composition of clusters for each country for sequences accumulated since theoutbreak. Countries with distinctly diﬀerent compositions are China and Australia, where Chinahas the majority of its sequences from cluster II, and Australia has cluster IV as its major cluster.Cluster I is the largest cluster for the rest of the countries, followed by cluster II, while cluster Vhas a noticeable presence in Europe, especially the UK.The composition of clusters also changes over time, and as examples we show the temporal trendof the daily cluster counts (upper panels) and proportions (lower panels) for the United States (US)and the United Kingdom (UK) in Figure 2. The graphs show a major diﬀerence between the US andUK composition trends over time: cluster I in the UK peaked around May to July and cluster V sawa surge since late August or early September, while cluster I remains dominant in the US. Similarly,most European countries have cluster V surging during this period, which suggests the most commonmutations in cluster V, D614G and A222V, might be related to the rise of infections across Europeduring the late summer and early autumn of 2020 (Dong et al., 2020). In addition to this, themajority of countries have cluster II proportions decreasing over time in direct correspondence tothe emergence of cluster I infections over time. It can be seen that the GISAID sequence counts inthe upper panels of Figure 2 vary widely over time and do not necessarily correspond well with the4luster Frequency Mutation PositionsI 58,271 D614G777 R21I, D614G602 D614G, D936YII 11,617 Reference Sequence117 A829T43 L8VIII 486 L5F, D614G121 L5F43 L5F, A222V, D574Y, D614G, H655YIV 5,472 S477N, D614G85 S477N, D614G, T632N51 S477N, D614G, A930VV 1,388 A222V, D614G1,369 L18F, A222V, D614G150 A222V, A262S, P272L, D614GTable 2: Top three unique sequences in each cluster, ranked by frequency. For each unique sequence,its speciﬁc mutation positions are shown in the right column. For example, 58,271 sequencesbelonging to cluster I had exactly the one mutation D614G, while 777 sequences in cluster I hadexactly the two mutations R21I and D614G.actual case counts in the US and UK during this period.

The 3-D structure data for the S-protein were obtained from the Protein Data Bank (PDB) (Bern-stein et al., 1977). The ﬁrst laboratory-determined of a standalone 3-D structure of the SARS-CoV-2S-protein was contributed in mid-February 2020 by a team of scientists at UT Austin using cyro-EMtechniques (Wrapp et al., 2020), with PDB accession code 6VSB. Since then, many other groupsaround the world have contributed to the eﬀort of studying diﬀerent aspects of the S-protein usinglaboratory techniques. As of Oct 14th, 2020, there were 108 3-D structures publicly available in thePDB associated with the SARS-CoV-2 S-protein: 40 of these considered the S-protein in isolation,under diﬀerent conformational states and sequence variants; 11 of these studied the structure of S-protein when bound together with ACE2; the remaining 57 studied the structure of S-protein wheninteracting with diﬀerent potential antibodies. Of the 68 structures containing the S-protein boundtogether with ACE2 or an antibody, 32 of these focused on a speciﬁc region of the S-protein, knownas the receptor binding domain (RBD), primarily to decipher the binding behaviour of the S-proteinto these molecules. Otherwise, for the majority of the structures (76 out of 108), experimentersattempted to determine the structure for the full S-protein.Together, the PDB reﬂects the current state of knowledge for the S-protein structure, includingthe attempts made to assess possible neutralizing antibodies in the development of therapeuticinterventions. Overall, 3-D structure determination has not kept pace with genome sequencing:among all the amino acid mutations listed in Table 1, only D614G has been studied in the laboratory.Thus, we cannot leverage the PDB to ascertain the potential changes in the 3-D structure of theS-protein as a result of those mutations. Even in the best consensus 3-D structure for the reference5

S CA UK NL FRSP CN IN AU

Cluster I Cluster II Cluster III Cluster IV Cluster V

Figure 1: Pie charts of cluster proportions for 9 countries, based on all complete sequences accu-mulated in GISAID. For each country, the proportions of cluster I to cluster V are represented bythe 5 diﬀerent colors as indicated.sequence to date (Zhou et al., 2020b), parts of the S-protein have not been successfully determinedby the laboratory experiments, resulting in missing data. This consensus structure (PDB accessioncode 6XM0) is visualized in Figure 3, where the mutations identiﬁed in Table 2 are labelled; mutationlocations where there is no 3-D structural information available are omitted from the ﬁgure.

In this section we present a Bayesian hierarchical model for the observed sequence counts, by dayand country, in each of the ﬁve clusters identiﬁed via our exploratory analysis. In motivating themodel, we recall that the number of deposited GISAID sequences varies widely over time and bycountry. Thus, it is sensible to focus inference on the fraction of sequences belonging to each cluster(i.e., cluster proportions) in each country, as in compositional data analysis (Aitchison, 1982).Also, while LOESS identiﬁed some temporal trends for these cluster proportions in each country, itcannot be used for prediction or for comparing the overall growth rates of diﬀerent clusters. Thus,our model contains the following key features. First, it provides estimates of the cluster proportionsfor each country on any given day. Second, it assumes a growth rate parameter for each cluster thatis common across all countries, through which diﬀerences in prevalence among the S-protein mutantclusters can be quantiﬁed. Third, the temporal evolution of the cluster proportions is allowed to bedependent across countries via correlated errors.Denote the observed sequence counts by the vectors y it = ( y it , . . . , y it ), where y itc is thenumber of sequences observed in GISAID for country i = 1 , . . . , K on day t ≥ S UKJan Apr Jul Oct Jan Apr Jul Oct0100200300400

Dates C oun t US UKJan Apr Jul Oct Jan Apr Jul Oct0.000.250.500.751.00

Dates P r opo r t i on Cluster I Cluster II Cluster III Cluster IV Cluster V

Figure 2: Temporal trend of cluster counts (upper panels) and cluster proportions (lower panels)in the US and UK. The points indicate the observed proportions and counts on each day, while thesolid lines show the corresponding smoothed LOESS curves.7igure 3: 3-D structure of the reference sequence S-protein (PDB accession code 6XM0). Proteinsegments containing common mutations are highlighted in green and labeled with the correspondingmutation in Table 2; segments with mutations that have incomplete 3-D structure data are notshown.cluster c = 1 , . . . y it ∼ M ultinomial (cid:32) n it , t (cid:88) j = t − p ij (cid:33) for t ≥ n it = (cid:80) c =1 y itc is the total number of GISAID sequences recorded for country i on day t , and p it = ( p it , . . . , p it ) is the vector of probabilities representing the true underlying cluster proportionsfor country i on day t such that (cid:80) c =1 p itc = 1. Thus the model assumes that the observed sequencesrepresent a random sample from the population of infected individuals in a country, with a reportingdelay uniformly at random over the commonly assumed 14 day incubation period for the virus (Laueret al., 2020). The sampling fraction (i.e., number of reported GISAID sequences out of the numberof infected individuals in the country) may vary over time, and this setup permits inference on p it ina manner that accommodates that sampling variability. Prior to Jan 20th, 2020 (i.e., when t < t (cid:80) tj =1 p ij .We apply a log-ratio transformation on the cluster proportions, as commonly used in composi-tional data analysis (Aitchison, 1999). Treating cluster I as the baseline, we deﬁne˜ p it = (cid:20) log (cid:18) p it p it (cid:19) , log (cid:18) p it p it (cid:19) , log (cid:18) p it p it (cid:19) , log (cid:18) p it p it (cid:19)(cid:21) p it values according to˜ p it = ˜ p i,t − + α + (cid:15) it for t > α = ( α , α , α , α ) is a vector of growth rate parameters for clusters II to V, and (cid:15) it =( (cid:15) it , (cid:15) it , (cid:15) it , (cid:15) it ) represents a random noise vector with mean zero that governs daily ﬂuctuations.Note that when (cid:15) it = , equation (1) implies p itc = p i,t − ,c exp( α c ) (cid:80) c =1 p i,t − ,c exp( α c ) ∝ p i,t − ,c exp( α c )for c = 1 , . . . , α deﬁned to be 0, so that exp( α ) can be interpreted as the multiplicative dailygrowth rates of clusters II to V relative to cluster I, with the proportions normalized to sum to 1,similar to a model recently used to study the prevalence of diﬀerent ﬂu strains (Huddleston et al.,2020). The α are assumed to be the same for all countries, to reﬂect the intrinsic ﬁtness of eachmutant cluster. Deﬁne (cid:15) K,t,c = ( (cid:15) tc , . . . , (cid:15) Ktc ), namely the random noise vector on day t acrossall K countries for cluster c , where c ∈ { , , , } . Then we let (cid:15) K,t,c ∼ N K ( , Σ) independentlyfor each day and cluster, where N K denotes a K -variate Normal distribution and Σ is a covariancematrix. This formulation allows for spatial dependence in the sense that the noise term can becorrelated among countries; intuitively, if a certain cluster experiences faster than expected growthin one country for a period of time, nearby or geographically linked countries may experience similarchanges. In practice, we could set Σ to be of block diagonal form, for example, with each block ascountries within the same continent, assuming correlations between diﬀerent continents are likelynegligible. Overall, our model setup follows the general structure of Gaussian dynamic models formultinomial time series described in Cargnoni et al. (1997).We complete the model speciﬁcation with the choice of priors. During the early outbreak periodfrom Dec 24th, 2019 to Jan 6th, 2020, a total of 40 sequences worldwide were deposited in GISAID,with two that we would now classify to be in cluster I and 38 in cluster II. Thus we set a Dirichletprior for the cluster proportions in the model for the starting date of the model (Jan 7th, 2020), p i ∼ Dirichlet (3 , , , ,

1) independently for all i = 1 , . . . K countries, obtained by adding thecluster I and II observations as pseudocounts to a uniform Dirichlet distribution. Weakly informativeCauchy priors with scale 0.5 are assigned to the each unique diagonal (variance) element of Σ. Aweakly informative LKJ correlation distribution with shape parameter 2 is assigned as the prior forthe correlation matrices corresponding to the blocks of Σ. Finally, uniform priors are assigned for α . To obtain the samples for the posterior distribution of the parameters, Markov chain MonteCarlo sampling for the model was carried out via Stan (Carpenter et al., 2017). Four parallel chainswere run, with 5000 iterations each and the ﬁrst half discarded as burn-in. Certain mutant clusters may have a higher prevalence than others, as identiﬁed via the estimatesfor α . A natural follow-up question is to ask whether these diﬀerences might be related to changesin the 3-D structure of the S-protein as a result of the common mutations shown in Table 2.To compare two 3-D protein structures, it is standard practice to compute the root-mean-squaredeviation (RMSD) between its corresponding backbone atoms; the four backbone atoms (N, C α , C,O) are common to all amino acids, so this RMSD calculation can be applied even when amino acidmutations are present and provides a simple metric for assessing structural changes. However, as9escribed in section 2.2, currently the PDB lacks laboratory-determined structures for all but theD614G mutation, and thus modeling approaches are needed.A protein structure is represented by the arrangement of its atoms in 3-D space, which is knownas a conformation . Letting x denote a conformation and H a given scalar energy (or potential)function, a statistics-based approach to the problem is to draw samples from the Boltzmann distri-bution π ( x ) ∝ exp {− H ( x ) /T } , (2)where T is the eﬀective temperature. According to the energy landscape theory (Onuchic et al.,1997), a protein structure tends to be most stable around the lowest energy conformation. Whilenature’s ‘true’ energy function is not known, various energy approximations H have been developedto mimic this property for use in computational protein structure prediction (Zhang et al., 2007b).Thus in this context, the goal is to draw samples from equation (2) corresponding to the amino acidsequence before and after mutation, to assess possible structure diﬀerences among the low-energyconformations sampled.We focus here on local structural impacts, that is, possible changes to the protein structurein the segment of amino acids near the mutation position. To do so, we treat the PDB structurein Zhou et al. (2020b) (with accession code 6XM0, and visualized in Figure 3) as the referenceconsensus 3-D structure for the S-protein corresponding to the reference sequence. Then we maysample conformations for speciﬁc segments of amino acids while holding the rest of the structureﬁxed at the coordinates in this reference 3-D structure. Segment lengths of up to approximately 15amino acids has been recognized to be a rough upper bound where current sampling methods canperform adequately (Webb and Sali, 2017). For an individual mutation occurring at position j , wethus sample conformations for the length 15 segment of amino acids from positions j − j + 7.These length 15 segments for each individual mutation listed in Table 2 are highlighted in green inFigure 3.Since proteins are composed of a linear sequence of amino acids, a sequential sampling approachcan eﬀectively exploit that property, by incrementally adding one amino acid at a time to constructapproximate samples from equation (2). The idea of devising sequential sampling algorithms as away to stochastically search for realistic low-energy conformations was originally proposed in Zhanget al. (2007a) and tested on lattice representations of proteins. Subsequently, extensions of themethod applicable to real protein structures have been developed, including distance-guided chaingrowth (DisGro, Tang et al., 2014) and sequential Monte Carlo (SMC, Wong et al., 2018). Theimplementations of these two algorithms also use slightly diﬀerent approximations for the energyfunction H , and together can provide a more complete picture of the energy landscape. Thuswe apply these algorithms to sample conformations for the segments shown in Figure 3, on boththe reference sequence and the mutated sequence. Then following Wong (2020), we may computethe probability distribution of RMSDs between pairs of sampled conformations, as a way to assesspotential diﬀerences in the low-energy conformational space as a result of the mutation. Speciﬁcallyfor the D614G mutation, we can use its known structure in the PDB to validate the results of thesesampling methods. We present our results from ﬁtting the proposed Bayesian hierarchical model in section 4.1 and theresults of sampling 3-D protein conformations in section 4.2.10 .1 Estimates of growth for the mutant clusters of sequences

The posterior distribution of the parameters in the Bayesian hierarchical model (Table 3) showthat the daily growth rate parameters of clusters II to V relative to cluster I, estimated via theirposterior means, are -0.05 (-0.07, -0.03), 0.00 (-0.01, 0.02), 0.02 (-0.01, 0.04), 0.03 (0.00, 0.05),with 95% credible intervals in brackets. These estimates show that during the study period, thesequences from clusters IV and V tend to have higher growth relative to clusters I, II, and III,of which mutant cluster II clearly has the weakest growth, as seen via its posterior interval thatdoes not overlap the others. Referring to the mutation positions in Table 2, this indicates thatsequences with amino acid D in position 614 (as in the reference sequence) will tend to decreasein prevalence over time in the presence of the other mutant clusters. In addition, cluster I, whichis mostly characterized by the lone D614G mutation, may have a growth disadvantage if clustersIV or V are also spreading in the country. The clusters with the strongest growth (IV and V) areprimarily composed of variants with the co-occurrence of D614G with S477N or A222V.Parameter Mean 2.5% 97.5% α -0.0495 -0.0658 -0.0344 α α α σ NA σ EU σ AS σ AU p · , , p · , , p · , , p · , , p · , , NA1 , NA1 , -0.0059 -0.0475 0.0344Σ NA2 , -0.0059 -0.0475 0.0344Σ NA2 , EU1 , EU1 , EU1 , EU1 , -0.0133 -0.1055 0.0841Σ EU2 , EU2 , EU2 , EU2 , EU3 , EU3 , EU3 , EU3 , EU4 , -0.0133 -0.1055 0.0841Σ EU4 , EU4 , EU4 , AS1 , AS1 , AS2 , AS2 , α represents the daily growth rate parametersfor clusters II to V relative to cluster I. σ C is the random noise standard deviation for countrieswithin continent C , and p · , represents the initial proportions on Jan 7th, 2020 for all countries.Σ C for each continent together forms the diagonal block matrix Σ that represents the covariancematrix for the random noise vector.The covariance matrix Σ for the random noise vector is set up in block diagonal form: each blockrepresents countries within the same continent and parameterized as Σ C = diag ( σ C ) × Ω C × diag ( σ C ),where σ C is the standard deviation of the daily noise term for countries within continent C , andΩ C is the corresponding correlation matrix. Speciﬁcally, we deﬁned four continents: North America(NA) as (US, CA); Europe (EU) as (UK, NL, FR, SP); Asia (AS) as (CN, IN); and Australia (AU)as its own continent. The posterior means for σ EU and σ AS are relatively larger than σ NA and11 llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llll lllll llllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llll lllll llllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llll lllll llllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llll lllll llllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llll lllll llllllllllllllllllllll llllll l l lllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllll lllll ll l lllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllll lllll ll l lllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllll lllll ll l lllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllll lllll ll l lllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllll lllll llllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll l llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll l lll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll l lll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll l lll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll l lll lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll l lll llllllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll llllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll l China (CN) India (IN) Australia (AU)Netherlands (NL) France (FR) Spain (SP)United States (US) Canada (CA) United Kingdom (UK)Jan Apr Jul Oct Jan Apr Jul Oct Jan Apr Jul Oct0.000.250.500.751.000.000.250.500.751.000.000.250.500.751.00

Dates P r opo r t i on s l l l l l Cluster I Cluster II Cluster III Cluster IV Cluster V

Figure 4: Posterior means and 95% credible intervals of inferred cluster proportions from theBayesian hierarchical model for 9 countries. For each day, the points show the observed proportions,the middle solid lines indicate the posterior means of cluster proportions, and the bands indicatethe 95% credible interval of cluster proportions. 12 AU , indicating that overall cluster growth trends are somewhat more predictable in North Americaand Australia than Europe or Asia. On the other hand, spatial dependence in the noise termsin general is low for countries studied, with no estimated correlations exceeding 0.4. That said,based on the posterior means, European countries still have relatively larger spatial correlationsin their daily ﬂuctuations, e.g., UK and Netherlands have a correlation of 0.33, UK and Francehave a correlation of 0.28, and Netherlands and Spain have a correlation of 0.22. The estimatesshow these European countries may experience more similar day-to-day changes, while countries inNorth America and Asia do not, as the correlation appears to be negligible between US and Canada(-0.087) and between China and India (-0.045), both of which are close to 0. The relatively lowcorrelations between countries suggest that zero noise correlation between continents (i.e., blockdiagonal Σ) is a reasonable simplifying assumption.Figure 4 shows the posterior means and 95% credible intervals for the inferred cluster proportionsin the diﬀerent countries on each day t from Jan 7th, 2020 to Oct 14th, 2020. The posterior meansof p · , indicate that on Jan 7th, 2020, cluster I accounts for around 10% and cluster II accountsfor around 90% of sequences. Nonetheless these initial proportions are quite uncertain due to thelimited number of early cases, as seen in the wide credible intervals on the plot. The credibleintervals narrow as we reach periods where a larger number of sequences are deposited. Overall, wesee growth for cluster I accelerates during January to June but appears to rapidly fall oﬀ in July formany countries, while maintaining a substantive presence in the US, India and Australia. ClusterII clearly diminishes over time worldwide, and cluster III is fairly small but stable for all countries.Cluster IV estimates show small ﬂuctuations for most countries except Australia, where it expandsto over 90% from July to August and may remain as the dominant cluster. Cluster V is estimatedto ﬁrst appear in August and thereafter shows a rapid growth in all European countries.Our Bayesian hierarchical model also allows prediction of changes in cluster proportions forcountries with missing or very sparse GISAID sequence data. Canada, Spain and China onlyhave deposited sequences up to August, while France only has sequence data deposited until mid-September; our model is nonetheless able to provide the point and interval estimates for theircluster proportions over the entire period. As suggested in Figure 4, both Canada and Chinaare projected to have cluster I gradually decrease together with a possible rise in cluster V; theexpanding credible intervals reﬂect the increasing uncertainty associated with the increasing numberof days with missing data. Meanwhile, the main clusters present in France by mid-October mightbe IV or V (or their combination), while Spain is projected to be dominated by cluster V much likethe rest of Europe.To ensure that our results are robust, we performed a sensitivity analysis on the posteriorparameters given diﬀerent sets of priors, with a special focus on the sensitivity of α and p i . Wecreated the following three scenarios to compare with our base scenario prior choices in Section 3.1,and the results are shown in Table 4. In scenario 1, we set the cluster proportions on the startingdate as p i ∼ Dirichlet (2 . , . , . , . , . Dirichlet (0 . , . , . , . , . α c ∼ N (0 ,

1) independently for c = 2 , . . . ,

5, insteadof the uniform priors in our base scenario. In scenario 3, we combined both changes to the priorsmade in scenario 1 and 2. Compared with the main results in Table 3, all three scenarios in Table 4show comparable posterior means and 95% credible intervals with the base scenario, which indicatesour posterior parameters are fairly stable and robust to the diﬀerent choices of priors.13cenario 1 Scenario 2 Scenario 3Parameter Mean 2.5% 97.5% Mean 2.5% 97.5% Mean 2.5% 97.5% α -0.0492 -0.0655 -0.0351 -0.0494 -0.0656 -0.0345 -0.0503 -0.0664 -0.0352 α α α p · , , p · , , p · , , p · , , p · , , p i ∼ Dirichlet (2 . , . , . , . , . α c ∼ N (0 , , c = 2 , . . . , For each of the nine segments identiﬁed in Figure 3, we ran the SMC and DiSGro algorithms tosample conformations for the reference sequence and the mutated sequence. The speciﬁc segments,including the starting and ending positions, are shown in the ﬁrst three columns of Table 5. Notethat each of the segments considered contains a single mutation, for example, the length 15 segment923-937 sampled for A930V does not overlap with 929-943 for D936Y since they occur in diﬀerentclusters. For DiSGro (Tang et al., 2014), we used the program from the authors to generate 100000conformations and kept the 5000 with the lowest energies as the representatives. For SMC (Wonget al., 2018), we ran the algorithm with 60000 particles as in Wong (2020), and also kept the 5000with the lowest energies as the representatives. RMSD R RMSD RM Cluster Mutation Sampled segment SMC DiSGro SMC DiSGroI,III,IV,V D614G 607–621 2.98 2.56 1.99 1.93III,V A222V 215–229 1.19 1.66 1.04 1.63III D574Y 567–581 5.35 5.13 2.40 12.19III H655Y 648–662 1.41 1.71 1.48 1.54IV S477N 470–484 3.76 14.71 3.14 13.05IV T632N 625–639 2.91 4.44 2.71 4.86V P272L 265–279 0.64 0.95 0.92 1.90IV A930V 923–937 0.89 0.76 1.15 1.04I D936Y 929–943 4.17 1.72 6.33 2.44Table 5: Results for the lowest energy conformations sampled by the SMC and DiSGro algorithms onthe reference and mutated sequence segments. The RMSD R columns calculate the RMSD betweenthe reference 3-D structure and the lowest energy conformation sampled for the reference sequence,thus measuring the accuracies of the algorithms for reconstructing each of these segments of theS-protein. The RMSD RM columns calculate the RMSD between the lowest energy conformationssampled for the reference sequence and mutated sequence, thus measuring the extent to which thelocation of the energy mode may have shifted in 3-D space as a result of mutation.14o summarize the distributions of these sampled low-energy conformations in 3-D space, weused a similar metric as in Wong (2020), by computing all pairwise RMSDs between conformations.For each individual segment, let x (1) R , . . . , x (5000) R denote the 5000 conformations sampled for thereference sequence, and x (1) M , . . . , x (5000) M denote the 5000 conformations sampled for the mutatedsequence. Then we computed three sets of RMSDs, deﬁned via d RR . = (cid:110) RMSD( x ( k ) R , x ( l ) R ) (cid:111) for k, l ∈ { , , . . . , } such that k (cid:54) = l,d MM . = (cid:110) RMSD( x ( k ) M , x ( l ) M ) (cid:111) for k, l ∈ { , , . . . , } such that k (cid:54) = l, and d RM . = (cid:110) RMSD( x ( k ) R , x ( l ) M ) (cid:111) for all k, l ∈ { , , . . . , } . Thus the set d RR approximately represents the distribution obtained by repeatedly sampling tworandom low-energy conformations from the reference sequence and computing the RMSD betweenthose conformations; an analogous interpretation applies to d MM for the mutated sequence. Mean-while, d RM considers pairwise RMSDs between one random conformation from the reference se-quence and one random conformation from the mutated sequence. Visual diﬀerences between thehistograms of d RR and d RM (or d MM and d RM ) would thus suggest that the low-energy conforma-tions for the reference and mutated sequences lie in distinct regions of 3-D space.Plots for d RR , d RM , and d MM for each of the nine segments are shown in the panels of Figure5, normalized to be probability densities and smoothed via kernel density estimation (Botev et al.,2010), labeled as Reference/Reference, Reference/Mutated, and Mutated/Mutated respectively.These RMSD distributions are largely indistinguishable for most segments, suggesting that there islittle discernible impact to the local conformational space of the protein as a result of the mutation,regardless of which sampling algorithm is used. Three of the nine panels do have some visiblediﬀerences between these RMSD distributions when sampled via the SMC algorithm: D574Y,A930V, and D936Y. In each case, the Reference/Mutated probability density visually appears as acompromise between the Reference/Reference and Mutated/Mutated densities, which is sensible.In addition to the overall RMSD distributions, we may also speciﬁcally examine the lowest energyconformation, as is often done in protein structure prediction applications. First, we consideredconformations sampled for the reference sequences, where the true structure is known from thePDB. For both DiSGro and SMC methods, we computed the RMSD between the true structureand the lowest energy conformation sampled by the algorithm. These results are shown in theRMSD R columns of Table 5, which may be interpreted as the prediction accuracy if the algorithmsare tasked with reconstructing the 3-D structure for each of these segments. Overall, these resultsshow reasonable accuracies, with the segments 567–581 and 470–484 being the most diﬃcult topredict correctly for both algorithms. Second, we considered conformations sampled for the mutatedsequences, again taking the lowest energy conformation sampled by both algorithms. Here, thetrue structures are unknown (except for D614G) so prediction accuracy cannot be assessed ingeneral. Thus, in the RMSD RM columns of Table 5, we instead compute the RMSD betweenthe lowest energy sampled conformations for the reference and mutated sequences, as a way toquantify whether location of the mode of the energy distribution (as approximated by the samples)has shifted signiﬁcantly after mutation. Here, both algorithms agree in predicting that mutationsD614G, A222V, H655Y, P272L, and A930V result in relatively small local 3-D structural changes(RMSD <

2) in the lowest energy conformation, while larger local structural changes are predictedby one or both algorithms for D574Y, S477N, T632N, and D936Y.For D614G, we may validate the sampling results as this mutation has been studied in thelaboratory with a determined 3-D structure in the PDB (accession code 6XS6, Yurkovetskiy et al.,15020). The actual RMSD between the reference structure (6XM0) and 6XS6 computed over thepositions 607-620 corresponding to the sampled segment is 0.38 (coordinates for position 621 aremissing in 6XS6); in contrast, the RMSD when computed over the larger structural unit frompositions 531 to 620 is 2.61. This result indicates that the local structural change as a resultof the D614G mutation is indeed quite small, which is in agreement with the predictions of thesampling algorithms. The D614G mutation does however lead to more substantive global changesto the S-protein structure, which would be very diﬃcult to predict computationally; general proteinstructure prediction remains a highly challenging problem, despite recent progress (Kryshtafovychet al., 2019).

In this paper, we presented statistical approaches to tackle the challenges associated with theanalysis of S-protein sequence and structure data. First, to better understand the evolution ofS-protein sequences, we grouped the S-protein sequences into hierarchical clusters, and studied thespatial and temporal trend of these mutant clusters using a Bayesian hierarchical model. Second,we used sampling algorithms to investigate the possible changes in the local 3-D structure of theS-protein in the segments where the most frequent mutations occurred.Based on our model estimates, we found that on average the reference sequence and its closely-related variants will diminish, while variants with the co-occurring mutations of D614G togetherwith S477N or A222V tend to increase most strongly in prevalence over time. Our estimates oftrend not only examined individual mutations as was analyzed in Korber et al. (2020), but alsocaptured the prevalence of some co-occurring mutations that have so far received limited attentionin the literature. Nonetheless, our ﬁndings on the reference sequence do align with Korber et al.(2020), where the authors showed that a transition of position 614 from D to G occurred in manyregions around the world with varying levels of statistical signiﬁcance. Our estimates of S477N andA222V are in agreement with the trends observed by the Los Alamos National Laboratory, andA222V in particular is also consistent with Hodcroft et al. (2020) where the authors reported itspresence in the majority of sequences in Europe by the fall of 2020. In addition, we found spatialdependence in COVID-19 transmission across country boundaries to be low in general, but higherwithin Europe. This could be related to their relatively loose travel policies within EU membersduring COVID-19 (European Commission, 2020). Finally, a useful feature of our Bayesian approachis the ability to make projections of cluster proportions in countries where data is scarce or missing.The result of the sequence analysis suggests potential ﬁtness advantages or higher infectiousnessfor the co-occurring mutations D614G + S477N and D614G + A222V. In reality, while higherinfectiousness may fully explain their growth in prevalence, other epidemiological factors may alsoplay a role, for example, the characteristics of the infected population and the founder eﬀect (Korberet al., 2020). Although Li et al. (2020) conﬁrmed that D614G combined with other mutations (e.g.,L5F, V341I, K458R, etc.) are more infectious than the reference sequence, the infectivities ofD614G + S477N or D614G + A222V have not yet been mentioned and examined. Therefore,further experimental evidence is needed to conﬁrm the increased infectivity of these co-occurrentmutations.Having identiﬁed diﬀerences in the relative growth rates of the ﬁve mutant clusters considered,we examined whether the most common sequence mutations in these clusters were associated withchanges in the 3-D structure of the protein near the mutation location. Based on two diﬀerentsampling algorithms, we conclude that evidence for large local structure changes is generally weak.This computational result is consistent with the ground truth in the PDB for the one mutation16 P r obab ili t y D en s i t y SMC:Reference/ReferenceDiSGro:Reference/Reference SMC:Reference/MutatedDiSGro:Reference/Mutated SMC:Mutated/MutatedDiSGro:Mutated/Mutated

Figure 5: Probability densities of the pairwise RMSD distributions d RR (Reference/Reference), d RM (Reference/Mutated), and d MM (Mutated/Mutated) for each of the nine segments in Table 5 basedon the sampled conformations from SMC (solid lines) and DiSGro (dotted lines). The x -axes areRMSDs in units of Angstroms. 17D614G) that has been studied in the laboratory thus far. For the mutations S477N and T632Nwhich are associated with cluster IV, both algorithms agree that a shift in the local 3-D conformationwith lowest energy might be possible (with change in RMSD greater than ∼ Acknowledgements

This work was partially supported by a Discovery Grant from the Natural Sciences and EngineeringResearch Council of Canada.

References

Aitchison, J. (1982). The statistical analysis of compositional data.

Journal of the Royal StatisticalSociety: Series B (Methodological) , 44(2):139–160.Aitchison, J. (1999). Logratios and natural laws in compositional data analysis.

MathematicalGeology , 31(5):563–580.Amanat, F. and Krammer, F. (2020). Sars-cov-2 vaccines: status report.

Immunity , 52(4):583–589.Bernstein, F. C., Koetzle, T. F., Williams, G. J., Meyer, E. F., Brice, M. D., Rodgers, J. R.,Kennard, O., Shimanouchi, T., and Tasumi, M. (1977). The protein data bank.

EuropeanJournal of Biochemistry , 80(2):319–324.Botev, Z. I., Grotowski, J. F., Kroese, D. P., et al. (2010). Kernel density estimation via diﬀusion.

The Annals of Statistics , 38(5):2916–2957.Cargnoni, C., M¨uller, P., and West, M. (1997). Bayesian forecasting of multinomial time seriesthrough conditionally gaussian dynamic models.

Journal of the American Statistical Association ,92(438):640–647.Carpenter, B., Gelman, A., Hoﬀman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M.,Guo, J., Li, P., and Riddell, A. (2017). Stan: A probabilistic programming language.

Journal ofstatistical software , 76(1):1–32.Chen, A. T., Altschuler, K., Zhan, S. H., Chan, Y. A., and Deverman, B. E. (2020a). COVID-19CG: Tracking SARS-CoV-2 mutations by locations and dates of interest. bioRxiv Prepr. Serv.Biol. bioRxiv:2020.09.23.310565. 18hen, J., Wang, R., Wang, M., and Wei, G.-W. (2020b). Mutations strengthened sars-cov-2 infec-tivity.

Journal of Molecular Biology , 432(19):5212–5226.Diehl, W. E., Lin, A. E., Grubaugh, N. D., Carvalho, L. M., Kim, K., Kyawe, P. P., et al. (2016).Ebola virus glycoprotein with increased infectivity dominated the 2013–2016 epidemic.

Cell ,167(4):1088–1098.Dong, E., Du, H., and Gardner, L. (2020). An interactive web-based dashboard to track covid-19in real time.

The Lancet infectious diseases , 20(5):533–534.Duﬀy, S. (2018). Why are rna virus mutation rates so damn high?

PLoS biology , 16(8):e3000003.European Commission (2020). Coronavirus: Commission proposes more clarity and predictabilityof any measures restricting free movement in the european union. https://ec.europa.eu/commission/presscorner/detail/en/ip 20 1555 . Last checked on Dec 20, 2020.Hodcroft, E. B., Zuber, M., Nadeau, S., Comas, I., Candelas, F. G., Stadler, T., and Neher, R. A.(2020). Emergence and spread of a SARS-CoV-2 variant through Europe in the summer of 2020. medRxiv . medRxiv:2020.10.25.20219063.Huddleston, J., Barnes, J. R., Rowe, T., Xu, X., Kondor, R., Wentworth, D. E., Whittaker, L.,Ermetal, B., Daniels, R. S., McCauley, J. W., et al. (2020). Integrating genotypes and phenotypesimproves long-term forecasts of seasonal inﬂuenza a/h3n2 evolution.

Elife , 9:e60067.Korber, B., Fischer, W., Gnanakaran, S., Yoon, H., Theiler, J., Abfalterer, W., Hengartner, N.,Giorgi, E., Bhattacharya, T., Foley, B., and Hastie, K. (2020). Tracking changes in SARS-CoV-2Spike: evidence that D614G increases infectivity of the COVID-19 virus.

Cell , 182(4):812–827.Krammer, F. (2020). Sars-cov-2 vaccines in development.

Nature , 586(7830):516–527.Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K., and Moult, J. (2019). Critical assessment ofmethods of protein structure prediction (casp)—round xiii.

Proteins: Structure, Function, andBioinformatics , 87(12):1011–1020.Lauer, S. A., Grantz, K. H., Bi, Q., Jones, F. K., Zheng, Q., Meredith, H. R., Azman, A. S., Reich,N. G., and Lessler, J. (2020). The incubation period of coronavirus disease 2019 (covid-19)from publicly reported conﬁrmed cases: estimation and application.

Annals of internal medicine ,172(9):577–582.Lauring, A. S. and Andino, R. (2010). Quasispecies theory and the behavior of rna viruses.

PLoSPathog , 6(7):e1001005.Li, Q., Wu, J., Nie, J., Zhang, L., Hao, H., Liu, S., et al. (2020). The impact of mutations insars-cov-2 spike on viral infectivity and antigenicity.

Cell , 182(5):1284–1294.Ning, T., Nie, J., Huang, W., Li, C., Li, X., Liu, Q., et al. (2019). Antigenic drift of inﬂuenza a(h7n9) virus hemagglutinin.

The Journal of infectious diseases , 219(1):19–25.Onuchic, J. N., Luthey-Schulten, Z., and Wolynes, P. G. (1997). Theory of protein folding: theenergy landscape perspective.

Annual review of physical chemistry , 48(1):545–600.Phan, T. (2020). Novel coronavirus: From discovery to clinical diagnostics.

Infection, Genetics andEvolution, , 79:104211. 19chaefer, C. and Rost, B. (2012). Predict impact of single amino acid change upon protein structure.

BMC genomics , 13(S4):1–10.Sedova, M., Jaroszewski, L., Alisoltani, A., and Godzik, A. (2020). Coronavirus3d: 3d structuralvisualization of covid-19 genomic divergence.

Bioinformatics , 36(15):4360–4362.Tang, K., Zhang, J., and Liang, J. (2014). Fast protein loop sampling and structure predictionusing distance-guided sequential chain-growth monte carlo method.

PLoS computational biology ,10:e1003539.Toyoshima, Y., Nemoto, K., Matsumoto, S., Nakamura, Y., and Kiyotani, K. (2020). Sars-cov-2 genomic variations associated with mortality rate of covid-19.

Journal of human genetics ,65:1075–1082.Wan, Y., Shang, J., Graham, R., Baric, R. S., and Li, F. (2020). Receptor recognition by the novelcoronavirus from wuhan: an analysis based on decade-long structural studies of sars coronavirus.

Journal of virology , 94(7):e00127–20.Ward, J. H. (1963). Hierarchical grouping to optimize an objective function.

Journal of the Americanstatistical association , 58(301):236–244.Webb, B. and Sali, A. (2017). Protein structure modeling with modeller. In

Functional Genomics ,pages 39–54. Springer.WHO (2020a). Coronavirus disease (covid-19) situation dashboard. https://who.sprinklr.com/ .Last checked on Dec 19, 2020.WHO (2020b). Draft landscape of covid-19 candidate vaccines. . Last checked onDec 20, 2020.Wong, S. W. (2020). Assessing the impacts of mutations to the structure of covid-19 spike proteinvia sequential monte carlo.

Journal of Data Science , 18(3):511–525.Wong, S. W., Liu, J. S., and Kou, S. (2018). Exploring the conformational space for protein foldingwith sequential monte carlo.

The Annals of Applied Statistics , 12(3):1628–1654.Wrapp, D., Wang, N., Corbett, K. S., Goldsmith, J. A., Hsieh, C.-L., Abiona, O., Graham, B. S., andMcLellan, J. S. (2020). Cryo-em structure of the 2019-ncov spike in the prefusion conformation.

Science , 367(6483):1260–1263.Wu, F., Zhao, S., Yu, B., Chen, Y. M., Wang, W., Song, Z. G., et al. (2020). A new coronavirusassociated with human respiratory disease in china.

Nature , 579(7798):265–269.Yurkovetskiy, L., Wang, X., Pascal, K. E., Tomkins-Tinch, C., Nyalile, T. P., Wang, Y., Baum, A.,Diehl, W. E., Dauphin, A., Carbone, C., et al. (2020). Structural and functional analysis of thed614g sars-cov-2 spike protein variant.

Cell , 183(3):739–751.Zhang, J., Kou, S. C., and Liu, J. S. (2007a). Biopolymer structure simulation and optimizationvia fragment regrowth monte carlo.

The Journal of Chemical Physics , 126(22):06B605.20hang, J., Lin, M., Chen, R., Liang, J., and Liu, J. S. (2007b). Monte carlo sampling of near-nativestructures of proteins with applications.

Proteins: Structure, Function, and Bioinformatics ,66(1):61–68.Zhou, P., Yang, X. L., Wang, X. G., Hu, B., Zhang, L., Zhang, W., et al. (2020a). A pneumoniaoutbreak associated with a new coronavirus of probable bat origin.

Nature , 579(7798):270–273.Zhou, T., Tsybovsky, Y., Olia, A. S., Gorman, J., Rapp, M., Cerutti, G., Chuang, G.-Y., Katsamba,P. S., Nazzari, A., Sampson, J. M., et al. (2020b). Cryo-em structures delineate a ph-dependentswitch that mediates endosomal positioning of sars-cov-2 spike receptor-binding domains. bioRxivPrepr. Serv. Biol.bioRxivPrepr. Serv. Biol.