[PDF] Assessing the impacts of mutations to the structure of COVID-19 spike protein via sequential Monte Carlo

Abstract

Proteins play a key role in facilitating the infectiousness of the 2019 novel coronavirus. A specific spike protein enables this virus to bind to human cells, and a thorough understanding of its 3-dimensional structure is therefore critical for developing effective therapeutic interventions. However, its structure may continue to evolve over time as a result of mutations. In this paper, we use a data science perspective to study the potential structural impacts due to ongoing mutations in its amino acid sequence. To do so, we identify a key segment of the protein and apply a sequential Monte Carlo sampling method to detect possible changes to the space of low-energy conformations for different amino acid sequences. Such computational approaches can further our understanding of this protein structure and complement laboratory efforts.

Full PDF

AAssessing the impacts of mutations to the structure ofCOVID-19 spike protein via sequential Monte Carlo

Samuel W.K. Wong ∗ Department of Statistics and Actuarial Science, University of WaterlooJune 8, 2020

Abstract

Proteins play a key role in facilitating the infectiousness of the 2019 novel coronavirus. A spe-ciﬁc spike protein enables this virus to bind to human cells, and a thorough understanding of its 3-dimensional structure is therefore critical for developing effective therapeutic interventions. However,its structure may continue to evolve over time as a result of mutations. In this paper, we use a datascience perspective to study the potential structural impacts due to ongoing mutations in its amino acidsequence. To do so, we identify a key segment of the protein and apply a sequential Monte Carlo sam-pling method to detect possible changes to the space of low-energy conformations for different aminoacid sequences. Such computational approaches can further our understanding of this protein structureand complement laboratory efforts.

The COVID-19 disease is an ongoing public health concern since the 2019 novel coronavirus, formallyknown as SARS-CoV-2 (Gorbalenya et al., 2020), has spread throughout the world. Much research hasbeen devoted to understand the mechanism by which the virus attacks human cells (e.g., Hoffmann et al.,2020; Zhang et al., 2020). Such a scientiﬁc understanding is critical to the ongoing development of ther-apeutic interventions and vaccines for COVID-19, see, e.g., Amanat and Krammer (2020) for a reviewof efforts to date. Proteins play key roles in facilitating the entry of viruses into cells, and speciﬁcallyfor SARS-CoV-2, the spike (S) glycoprotein that protrudes from its viral membrane (Walls et al., 2020).Genome sequencing of SARS-CoV-2 has shown that this new coronavirus has moderate genetic similarityto the SARS-associated coronavirus (formally SARS-CoV) that caused the 2002-2003 outbreak, havingbetween 75-80% of their genetic material in common (Lu et al., 2020). In particular, both SARS-CoV andSARS-CoV-2 utilize a spike protein to bind to the ACE2 (Angiotensin Converting Enzyme 2) receptor inhuman cells (Ou et al., 2020). Laboratory work has now shown that the 3-dimensional (3-D) structures ofthese two spike proteins are broadly similar; however, they are sufﬁciently different such that antibodiesfor SARS-CoV are not effective against SARS-CoV-2 (Wrapp et al., 2020).As a result of Anﬁnsen’s Nobel prize-winning work in the 1970s, it has been known that a proteingenerally has a stable 3-D structure that is largely determined by its amino acid sequence (Anﬁnsen, 1973).A broad consequence of this discovery, that has been veriﬁed in the subsequent decades of laboratory ∗ Address for correspondence: Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, ON,Canada. E-mail: [email protected] a r X i v : . [ q - b i o . B M ] J un xperimental work, is that similarities in amino acid sequences often correspond to similarities in 3-Dstructure (Krissinel, 2007). However, laboratory techniques for determining protein structure are laboriousand cannot be applied to all possible amino acid sequences of interest, e.g., in drug design applications(Khoury et al., 2014). Further, the atomic coordinates of amino acids sometimes cannot be ascertained,and will thus be missing data in the published 3-D structures (Brandt et al., 2008). Thus, to complementlaboratory work, there has been much interest in developing computational methods to tackle the proteinfolding problem, that is, to predict the 3-D structure of a protein based on its amino acid sequence (Friesneret al., 2002). While much progress has been made in the past ﬁve decades towards improving the accuracyof such computation-based structure predictions (Dill and MacCallum, 2012), the protein folding problemis still considered unsolved in general. The most successful competitor to date, as assessed by the mostrecent biannual blinded protein folding competition in 2018 known as CASP13 (Critical Assessment ofStructure Prediction) , was the deep learning method AlphaFold developed by Google’s DeepMind team(AlQuraishi, 2019).We brieﬂy summarize the timeline of key milestones in the scientiﬁc understanding of the SARS-CoV-2 spike protein, to provide context for the contribution of this paper. Using modern RNA sequencingtechnology, scientists were quickly able to determine the original ﬁrst known SARS-CoV-2 genome (Wuet al., 2020), including the amino acid sequence of its S protein, by early January 2020. On the basisof this sequence and its similarities with related proteins, including previously-determined 3-D structuresof SARS-CoV and MERS, preliminary 3-D structure predictions of the SARS-CoV-2 S protein could begenerated using computational methods . Subsequently, thanks to worldwide attention and rapid researchefforts, a UT Austin laboratory was the ﬁrst to release a 3-D structure of the SARS-CoV-2 S protein in mid-February 2020, using cyro-EM techniques (Wrapp et al., 2020). A limitation was that several segments ofthe protein were not successfully determined, and thus missing from the published structure. Nonetheless,this ground-breaking work veriﬁed that it is a key segment of the S protein, known as the receptor-bindingdomain (RBD), that has the ability to bind with human ACE2 when the RBD is in its “open” state. Further,Wrapp et al. (2020) showed that in comparison with the corresponding RBD of the SARS-CoV S protein,the two have noticeable similarity in their overall structures, but they have signiﬁcant deviations in somelocal segments. In particular, their binding sites are sufﬁciently different such that SARS-CoV antibodiescannot recognize SARS-CoV-2. By early March 2020, the 3-D structure of the RBD of the SARS-CoV-2S protein, in a structural complex bound together with ACE2, was also published via laboratory efforts(Yan et al., 2020).These 3-D structures are publicly available in the Protein Data Bank (PDB) (Bernstein et al., 1977),and are a vital piece of the puzzle in the ongoing development of interventions to neutralize the SARS-CoV-2 virus. It is also well-known, however, that viral genomes mutate over time with varying rates(Drake, 1993). Mutations underlie, for example, why the prevalent strain of inﬂuenza may be difﬁcult topredict when designing the annual ﬂu vaccine (Cohen, 2017). Mutations are changes in the viral genomethat occur as the virus replicates; possible results of mutations include changes in the amino acid sequenceof its proteins – via additions, deletions, or substitutions of individual amino acids (Sanju´an et al., 2010).Thus, scientists continue to pay close attention to variations in sequenced SARS-CoV-2 genomes overtime (Tang et al., 2020). At the time of this writing, among sequenced SARS-CoV-2 genomes therehave already been two amino acid substitutions observed in a key segment of its S protein RBD, as weshall subsequently detail. Therefore, there is concern about the impact of such mutations on the proteinstructure, and in turn on the binding ability and infectiousness of SARS-CoV-2 as it continues to evolveworldwide (Jia et al., 2020). It is infeasible to use laboratory-based structure determination to study allpotential sequence mutations, thus we may instead use computational methods to assess the potential http://predictioncenter.org The SARS-CoV-2 S protein consists of a linear sequence of 1273 amino acids, and our focus is its RBDwhich consists of the amino acid positions 329 to 522 (Wrapp et al., 2020), and we shall use this SARS-CoV-2 sequence to reference positions for the remainder of the paper. This sequence may be comparedto that the SARS-CoV S protein, which is overall slightly shorter at 1255 amino acids long, with itscorresponding RBD located from amino acids 316 to 508. The two RBDs share 144 (74%) of their aminoacids in common, and the SARS-CoV-2 RBD has one extra amino acid (Valine) inserted at position 483.It is also worth noting that there are existing known protein sequences with higher degree of similarityto SARS-CoV-2 than SARS-CoV: the S protein of the bat coronavirus known as RaTG13 (Zhou et al.,2020) has an amino acid sequence that matches 97% of the SARS-CoV-2 S protein. Thus RaTG13 Smay be even more structurally similar to SARS-CoV-2 S; however, there is no 3-D laboratory-determinedstructure currently available for RaTG13 S in the PDB. To visualize, Figure 1 lays out the RBDs of thesethree sequences in an optimally aligned fashion that shows their matching, or conserved , amino acidsusing the blue shaded colors. There are 20 amino acid types, with each being represented by its one-letterabbreviation. The dash symbol (–) indicates the position where SARS-CoV has one fewer amino acidcompared to SARS-CoV-2 and RaTG13. At the sequence level, it can be seen that SARS-CoV-2 S andSARS-CoV S differ the most in the segments 437 to 461 and 469 to 487, the latter being indicated by thered box.

A standard metric for comparing two 3-D protein structures is the root-mean-square deviation (RMSD)between the corresponding pairs of atomic Cartesian coordinates, after applying the optimal translationand rotation to superimpose the structures. Speciﬁcally, it is standard practice to use backbone atoms forRMSD calculations, which are the N, C, C α , and O atoms that are common to all 20 amino acid typesand link adjacent amino acids together. The side chains are functional groups extending from the C α atomand are unique for each of the 20 amino acid types. That is, we may equivalently say that an amino acidsubstitution implies that a change in the side chain has occurred at that position.Based on the SARS-CoV-2 S protein sequence and previously known structures in the PDB, re-searchers at University of Washington used their Robetta structure prediction server (Kim et al., 2004)to build a preliminary 3-D structure prediction for the entire SARS-CoV-2 S protein . Subsequently, 3-D Available for download from http://new.robetta.org/results.php?id=15652

We now speciﬁcally examine the available 3-D structures for the RBDs of SARS-CoV-2 and SARS-CoV, using public data from the PDB. For this purpose we also download the structural complex of theRBD bound together with the ACE2 receptor for SARS-CoV (PDB code: 2AJF) to compare with that ofSARS-CoV-2 (PDB code: 6LZG). These structural complexes have more complete atomic coordinates forthe RBD, compared to structures of the full standalone S protein which are missing key RBD segments(PDB codes: 5X58 for SARS-CoV, 6VSB for SARS-CoV-2). The RMSD may be computed on the two4BDs after alignment; positions 389 to 394 of SARS-CoV-2 are excluded from the calculation as theatomic coordinates atoms of the corresponding positions in the SARS-CoV 3-D structure are missing,and position 483 is also excluded as SARS-CoV-2 has an extra amino acid at that position. The RMSDbetween the two RBDs calculated in this way is 1.69, indicating a strong level of similarity in the overallstructures.To examine how this variation is distributed over the length of the RBD, in Figure 2 we plot the RMSDsof the backbone atoms for each amino acid position separately. Positions where the two protein sequencesare conserved are indicated by the circles, while the square markers indicate positions where the aminoacid types differ. From the plot, it can be seen that there are two distinct segments with high RMSDvalues between the two RBDs: the ﬁrst is the ﬁve amino acids before the gap of missing coordinatesin the SARS-CoV structure, and the second is the segment 469 to 487 as indicated by the dashed lines,where we previously noted a low level of sequence similarity. It can also be observed that while there aremany sequence differences in the segment 437 to 461 as well, these have less of an effect on structuraldifference. In Yan et al. (2020), the authors also note that while there is overall similarity between thetwo RBDs, there are non-trivial differences in the positions where the RBDs forms chemical bonds withACE2. In particular, positions 474 and 486 of the SARS-CoV-2 RBD were identiﬁed as binding sites,where there is substantial structural dissimilarity with the SARS-CoV RBD. llllllllllllllllllllllllllllllll llllllllllllll lllllll llllllllllllllllllllllllllllll ll lllllll l llllllll ll lllll lll lllllllllllllll

350 400 450 500

Sequence position (in SARS−CoV−2) R M S D be t w een SA R S − C o V and SA R S − C o V − l Mutated a.a.Conserved a.a.

Figure 2: RMSDs between the 3-D structures of the SARS-CoV-2 and SARS-CoV RBDs in their boundcomplex with ACE2, calculated at each amino acid position. Black circles indicate positions where thesequences are conserved (i.e., having the same amino acid type), while red squares indicate positions withdifferent amino acid types. The dashed lines indicate positions 469 and 487, the key segment of interestin this paper.Since this structural comparison reveals that 469 to 487 is likely a critical segment affected by aminoacid substitutions, we will focus in this paper on the potential impact of mutations to this segment. Indeed,this is also a segment where the SARS-CoV-2 sequence is noticeably distinct from its closest known relatedsequence RaTG13, with 4 of the 22 substitutions in the RBD occurring this segment; resulting structuraldifferences here may partially explain its inability to infect humans. Using BLAST (Altschul et al., 1997)5n the nr sequence database on April 23, 2020, we ﬁnd that there are already two amino acid substitutionsobserved within the segment 469 to 487 among existing SARS-CoV-2 genomes. These are shown in Table1, along with the RaTG13 and SARS-CoV sequences for this segment. Further, the BLAST output showsthat there are otherwise no prior sequences in the entire database, besides these four, that share more than10 amino acids in common with this segment of SARS-CoV-2. Thus, we may emphasize that the 3-Dstructure of this segment would be especially challenging to predict using computational methods, e.g.,as seen in the Robetta results (Section 2.2), as the known structures of existing sequences in the PDB canprovide very limited guidance in this case.Table 1: Amino acid sequences for positions 469 to 487 in the spike proteins of SARS-CoV-2 and itsknown mutants, RaTG13, and SARS-CoV. The Identiﬁer column shows the sequence ID as given in the nr database searched via BLAST; ‘PDB’ indicates that a 3-D structure is available for that protein. Thebold letters in the Sequence column indicate amino acid differences compared to the ﬁrst known SARS-CoV-2 sequence. STEIYQAGSTPCNGVEGFN

STEIYQA S STPCNGVEGFN

STEIYQAGSTPCNG A EGFN

STEIYQAGS K PCNG QT G L N S NVPFSPDGK PC TP-PAL N For these challenging segments, sampling-based methods can be used to more effectively explore thespace of plausible 3-D structures. Thus in what follows, we assess and quantify the structural uncertaintiesassociated with mutations of this segment, assuming the rest of the SARS-CoV-2 S protein is held ﬁxed.To do so, we will use an energy function to guide the sampling, for each of the ﬁve sequences identiﬁed inTable 1. A powerful method for this purpose is via sequential Monte Carlo, which we motivate and reviewnext.

The energy landscape theory of Onuchic et al. (1997) suggests that a protein structure stabilizes at the 3-Darrangement of atoms, or conformation , with the lowest potential energy. This principle can be leveragedby sampling methods designed to explore the space of possible low energy conformations for a givenamino acid sequence. Suppose H is a given energy function, that takes a conformation x as input, andoutputs a scalar value for energy. Then for a given amino acid sequence, we may conceptualize samplingconformations from the Boltzmann distribution: π ( x ) ∝ exp {− H ( x ) /T } , (1)where T is the effective temperature that may be set to 1 by appropriately scaling the energy function.In practice, natures true energy function is not known, and so various approximate energy functionshave been developed, often with parameters trained on structure data such that realistic conformations(i.e., those closer to the ‘truth’ as determined by laboratory techniques) generally have lower energies thanthose that are not (Zhang et al., 2007b). All trained energy functions are imperfect, in the sense that aconformation with larger RMSD from the truth may sometimes have lower energy than a conformation6ith smaller RMSD. For this study, we adopt an energy function that has been used with reasonablesuccess in Wong et al. (2018), and our previous work shows that it is useful for the purpose of quantifyingthe space of low-energy conformations.Protein geometric contraints (including bond lengths and angles) allow x to be more simply parametrizedin terms of free dihedral angles, rather than Cartesian coordinates. Each amino acid in the sequence hasthree such angles ( φ, ψ, ω ) that determine the placements of its backbone atoms, along with 0-4 additionalangles denoted χ governing the placement of its side chain atoms. The goal then is to draw samples(conformations) from the high-density regions of π ( x ) , or equivalently from the low-energy regions. Con-sidering the ﬁve amino acid sequences identiﬁed in Table 1, we note that the sampling problem is difﬁcult:the segment of interest is 19 amino acids long, which is a high dimensional space with > geometric de-grees of freedom (backbone plus side chain for each amino acid). Further, the energy landscape is highlymultimodal and rough due to the numerous pair-wise atomic interactions within the protein. Due to thedifﬁculty of this sampling problem, most previous sampling methods have focused on shorter segments,e.g., of lengths 12 to 17 (Tang et al., 2014).A powerful approach that can be leveraged for this sampling problem comes from Monte Carlo meth-ods, and speciﬁcally sequential Monte Carlo (SMC). The original conformation sampling algorithm forprotein segments based on sequential sampling techniques was proposed in Zhang et al. (2007a), whichoutperformed other approaches on 2-D and 3-D lattice models. In subsequent work, sampling methodsinspired by SMC were also shown to be successful on real protein structures (Tang et al., 2014; Wonget al., 2018). This paper adopts the SMC methodology presented in Wong et al. (2018). A brief overviewof the method is provided here and the interested reader may refer to that paper for details. The key ideais to sequentially sample the angles of x one amino acid at a time, and this provides a natural incrementaldistribution for SMC where each particle is a partially constructed conformation. At the i -th SMC step,we sample from the conditional distribution of amino acid i in the sequence, given the previously sampledamino acids, , , . . . , i − . That is, we sample ( φ i , ψ i , ω i , χ i ) , conditional on ( φ i − , ψ i − , ω i − , χ i − ) when i > , according to an incremental energy function ∆ H ( φ i , ψ i , ω i , χ i | φ i − , ψ i − , ω i − , χ i − ) .The incremental energy helps to (a) rule out the possibility of atomic clashes, and (b) ensure that thesegment connects properly with the two ﬁxed ends of the rest of the protein. The pool of partial confor-mations is expanded and then ﬁltered at each step to ensure that a diverse set of low-energy particles ismaintained. A secondary ﬁltering step is embedded to handle the sampling of the amino acid side chains.The ﬁnal output of the SMC sampler is a set of sampled conformations for the input sequence, that can beconsidered to represent the low-energy space of (1). Thus analysis of these SMC samples can yield insightinto the differences between the conformational spaces corresponding to the different input sequences.We close this section with some technical details. To obtain our main results, we will apply the SMCmethod to each of the ﬁve amino acid sequences identiﬁed in Table 1. Since the focus is on potential localstructural impacts of amino acid substitutions, we hold the rest of the S protein ﬁxed at the coordinatesin the PDB structure 6VSB. We note that this is a simplifying assumption, as we cannot rule out thepossibility that local substitutions can have more global impacts on the protein structure. Nonetheless,this is common practice in the bioinformatics literature on loop modeling (e.g., Soto et al., 2008; Tanget al., 2014; Wong et al., 2017; Marks et al., 2017), and likewise we expect this approach to yield usefulinformation here. Following, we then examine the sampled conformations for the ﬁve sequences, and thecomparison of their distributions can provide insight into the impacts of the amino acid substitutions. Toprepare the data, we take the coordinates from the PDB ﬁle 6VSB as the base 3-D structure for the SARS-CoV-2 S protein. Recall that the 6VSB PDB structure is missing the coordinates of many amino acidsin the RBD, while 6LZG is overall complete. By superimposing the coordinates of the 150 RBD aminoacids present in both 6LZG and 6VSB, we ﬁnd a RMSD of 0.95. Thus, given their overall similarity, weproceed with an approximation by imputing the missing RBD coordinates from 6VSB with those from7LZG after optimal superposition. Using this base structure, we then create mutants by substituting thesequences in Table 1 for the segment 469 to 487. We then use the SMC method to generate conformationsfor this segment, in all of the sequence variants as well as the base structure, as a basis for comparisons. We present our main results in Section 4.1: the SMC method is applied to the sequences listed in Table 1,and the potential differences in their 3-D conformational spaces are quantiﬁed. Following in Section 4.2,the protein segment prediction accuracies of state-of-the-art methods are evaluated on the known struc-tures of SARS-CoV and SARS-CoV-2, highlighting the need for further advances in structure predictionalgorithms.

For each of the ﬁve sequences in Table 1, the SMC method was run multiple (six) times, each with60000 particles to ensure good coverage of the low-energy conformational space. This required a totalruntime of approximately 3 hours per sequence, on an 8-core Xeon 3.7GHz CPU. In the subsequentcomparative analysis, we kept the 20000 lowest energy conformations among the SMC samples as therepresentatives for each sequence. We shall denote these samples as x ( k ) i , for sequences i ∈ { , , , , } and conformations k ∈ { , , . . . , } .To gain insight into the distributions of these conformations in 3-D space, we performed RMSDcalculations on the segment of interest, namely positions 469 to 487. Speciﬁcally, we computed pair-wise RMSDs between conformations, where we deﬁne sets of RMSDs grouped according to sequences i, j ∈ { , , , , } : d ij . = (cid:110) RMSD ( x ( k ) i , x ( l ) j ) (cid:111) for all k, l ∈ { , , . . . , } if i (cid:54) = j, and d ii . = (cid:110) RMSD ( x ( k ) i , x ( l ) i ) (cid:111) for k, l ∈ { , , . . . , } such that k (cid:54) = l. (2)Thus the set d ij approximately represents the distribution that would be obtained by repeatedly drawingone random conformation from the low-energy space of sequence i , one random conformation from thelow-energy space of sequence j , and computing the RMSD between those conformations. Histograms of d ij then provide a simple way to compare these distributions among the ﬁve sequences: if the distributionof d ii is very different from that of d ij for j (cid:54) = i , then that suggests that the plausible low-energy conforma-tions for the two sequences are located in very distinct regions of 3-D space. Equivalently, that suggeststhe amino acid differences between the two sequences can potentially lead to signiﬁcant changes in thecorresponding 3-D structure, including its binding capacity.We plot d ij for all pairs i, j in Figure 3, where the i -th panel includes all the distributions d ij , j =1 , , , , to facilitate visual comparison. Kernel density estimation (KDE) was applied to obtain theprobability densities shown, using the KDE implementation from Botev et al. (2010). We may drawseveral key observations from these results: • First, the SARS-CoV amino acid sequence is the most dissimilar from the others, and the bottompanel shows that its low-energy conformational space is also very distinct from the others: the long-dashed curve shows that most conformations sampled for SARS-CoV are within 2 to 10 RMSD ofeach other, but mostly > RMSD away from conformations sampled for the other four sequences.This relates to Figure 2: the segment 469 to 487 had large structural differences between the PDB8tructures of SARS-CoV and SARS-CoV-2, likely due to the large number of amino acid differencesin that segment as well as other parts of the sequence. Our analysis shows that if only this segment ofSARS-CoV was substituted into SARS-CoV-2, keeping the rest of the sequence ﬁxed, there wouldlikely still be a large difference between the 3-D structures within that segment. • Second, the bat coronavirus RaTG13 sequence appears to have a conformational space that is muchmore similar to SARS-CoV-2 compared to SARS-CoV, as seen in the fourth panel. With four aminoacid substitutions in the segment 469 to 487 compared to the three SARS-CoV-2 variants, its result-ing RMSD distribution could be somewhat distinct from SARS-CoV-2: RaTG13 has a larger modeat ∼ RMSD and a smaller mode at ∼ RMSD compared to all three SARS-CoV-2 variants. • Third, the SARS-CoV-2 mutation at position 483 (Mutant 2) appears to possibly have a more sub-stantive effect compared than the mutation at position 478 (Mutant 1). Mutant 2 involves the sub-stitution of Valine (V) with Alanine (G), which has a less bulky side chain with two fewer methyl(CH ) groups. The resulting RMSD distributions comparing this mutant to the other SARS-CoV-2variants and RaTG13 (third panel) become nearly indistinguishable. Any biological basis for thisphenomenon would, of course, require further investigation.To close this section, in Figure 4 we visualize the ﬁve lowest energy conformations among the SMCsamples, for each of the sequences. Since the segment from 469 to 487 is sampled with the rest of theprotein held ﬁxed, the panels show close-ups focusing on the protein backbone for that region of the 3-Dstructure, along with the nearby amino acids to which this segment is connected. The ﬁve lowest energyconformations are shown with different color strands, as indicated in the ﬁgure. Some structural variabil-ity can be observed within these low-energy samples for each sequence, indicating the SMC sampler issuccessful at discovering distinct conformations with similarly low energy values. Much greater structuralvariability can be seen between SARS-CoV-2, RaTG13, and SARS-CoV overall than between the threeSARS-CoV-2 variants. The segment of the RBD studied in this paper poses a challenging test for structure prediction algorithmsdesigned to tackle protein segments. We evaluate and compare three such recent methods: the SMCmethod of Wong et al. (2018); the DiSGro method of Tang et al. (2014); the next-generation KIC (NGK)method in the Rosetta suite (Stein and Kortemme, 2013). All of these methods are sampling-based and donot rely on the knowledge of existing sequences in the PDB. We use the known structures of SARS-CoV-2and SARS-CoV to provide two speciﬁc test cases: • Segment 469 to 487 of the SARS-CoV-2 RBD, the main segment of interest in this paper. We usethe true structure provided by PDB code 6LZG, as it has a complete set of atomic coordinates forthis segment. • Segment 456 to 473 of the SARS-CoV RBD, which is the corresponding segment in SARS-CoVas shown in Figure 1 in the red box. The true structure with complete atomic coordinates for thissegment is provided by PDB code 5X5B.To prepare a test case, the segment is deleted from the protein while holding the rest of the 3-D structureﬁxed at the truth. To assess a method, we use it to draw 500 samples representing the conformational The Robetta server described in Section 2.2 is the automated structure prediction pipeline for entire proteins, built on theRosetta modeling suite. . . . . SARS−CoV−2 PDB structure sequence (Original) . . . . SARS−CoV−2 mutation at position 476 (Mutant 1) . . . . SARS−CoV−2 mutation at position 483 (Mutant 2) . . . . Bat coronavirus sequence (RaTG13) . . . . SARS−CoV sequence (SARS−CoV) P r obab ili t y D en s i t y Pairwise RMSDs from:OriginalMutant 1Mutant 2RaTG13SARS−CoV

Figure 3: Probability densities of the pairwise RMSDs d ij , comparing the ﬁve sequences in Table 1. Forexample, in the top panel, the solid curve shows the distribution of pairwise RMSDs among the 20000low-energy conformations sampled for the original SARS-CoV-2 sequence, while the long-dashed curvein the same panel shows the distribution of pairwise RMSDs when comparing the samples of the originalSARS-CoV-2 sequence with those of SARS-CoV. 10 ARS-CoV-2 original

STEIYQAGSTPCNGVEGFN

SARS-CoV-2 mutant 1

STEIYQA S STPCNGVEGFN

SARS-CoV-2 mutant 2

STEIYQAGSTPCNG A EGFN

Bat coronavirus RaTG13

STEIYQAGS K PCNG QT G L N SARS-CoV S NVPFSPDGK PC TP-PAL N Figure 4: For each sequence in Table 1, the conformations of the ﬁve SMC samples with the lowestenergy are displayed in these close-ups (colored in order of ascending energy: grey, turquoise, yellow,blue, orange). A portion of the ﬁxed part of the protein is visible: the white strands where this segmentconnects to the rest of the protein, along with some nearby helices and sheets.11pace of the given segment, and the lowest-energy conformation is the method’s structure prediction forthe missing segment. Accuracy is evaluated using the RMSD of the reconstructed segments to the knowntruth. For the SMC method (Wong et al., 2018), we used 60000 particles as in Section 4.1, and outputted500 ﬁnal conformations. For the DiSGro method (Tang et al., 2014), we used the authors’ program andincreased the default setting of 5000 generated conformations to 100000 to obtain the best possible results;the lowest-energy 500 conformations were kept as the representative samples. For the NGK method (Steinand Kortemme, 2013), we used the version included in the most recent release of Rosetta 3.12 on April9, 2020 along with the optimal settings as recommended by the online guide , and ran the program 500times to obtain the desired samples.We note that the DiSGro and SMC methods can each complete the entire sampling and prediction taskfor a test case in under 30 minutes on an 8-core workstation, with DiSGro requiring only about 10 CPUminutes. However, NGK is signiﬁcantly more computationally intensive: generating one sample requiresabout 45 CPU minutes, so ∼ A. Smallest RMSD sampled B. RMSD of prediction

SMC DiSGro NGK SMC DiSGro NGKSARS-CoV-2 2.55 2.64 1.90 6.97 10.06 8.26SARS-CoV 3.24 4.02 2.18 5.02 9.60 4.36

In this paper, we used sequential Monte Carlo to investigate the possible effects to the 3-D structure of theSARS-CoV-2 spike protein due to mutations in its amino acid sequence. SMC is potentially a powerfultechnique for this purpose, as it is effective at sampling conformations with low energy for the segmentof interest. Thus, by comparing the sampled conformations for several sequences, the method can helpdetect changes to the low-energy regions of the 3-D conformational space as a result of the amino aciddifferences among the sequences. Our results are consistent with the observed differences between theknown 3-D structures for the speciﬁc variants of SARS-CoV-2 and SARS-CoV in the Protein Data Bank,and also provide some preliminary intuition about the potential role of mutations that are being observed in https://guybrush.ucsf.edu/benchmarks/benchmarks/loop modeling Acknowledgements

This work was partially supported by a Discovery Grant from the Natural Sciences and Engineering Re-search Council of Canada.

References

AlQuraishi, M. (2019). Alphafold at casp13.

Bioinformatics , 35(22):4862–4865.Altschul, S. F., Madden, T. L., Sch¨affer, A. A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D. J.(1997). Gapped blast and psi-blast: a new generation of protein database search programs.

Nucleicacids research , 25(17):3389–3402.Amanat, F. and Krammer, F. (2020). Sars-cov-2 vaccines: status report.

Immunity , 52(4):583–589.Anﬁnsen, C. (1973). Principles that govern the folding of protein chains.

Science , 181:223–230.Bernstein, F. C., Koetzle, T. F., Williams, G. J., Meyer, E. F., Brice, M. D., Rodgers, J. R., Kennard, O.,Shimanouchi, T., and Tasumi, M. (1977). The protein data bank.

European Journal of Biochemistry ,80(2):319–324.Botev, Z. I., Grotowski, J. F., Kroese, D. P., et al. (2010). Kernel density estimation via diffusion.

TheAnnals of Statistics , 38(5):2916–2957.Brandt, B. W., Heringa, J., and Leunissen, J. A. (2008). Seqatoms: a web tool for identifying missingregions in pdb in sequence context.

Nucleic acids research , 36(suppl 2):W255–W259.Cohen, J. (2017). Why is the ﬂu vaccine so mediocre?

Science , 357:1222–1223.Dill, K. A. and MacCallum, J. L. (2012). The protein-folding problem, 50 years on.

Science ,338(6110):1042–1046.Drake, J. W. (1993). Rates of spontaneous mutation among rna viruses.

Proceedings of the NationalAcademy of Sciences , 90(9):4171–4175. 13riesner, R. A., Prigogine, I., and Rice, S. A. (2002).

Computational methods for protein folding , volume120. John Wiley & Sons.Gorbalenya, A. E., Baker, S. C., Baric, R. S., de Groot Raoul, J., Drosten, C., Gulyaeva, A. A., Haagmans,B. L., Lauber, C., Leontovich, A. M., Neuman, B. W., et al. (2020). The species severe acute respiratorysyndrome-related coronavirus: classifying 2019-ncov and naming it sars-cov-2.

Nature Microbiology ,5(4):536–544.Hoffmann, M., Kleine-Weber, H., Schroeder, S., Kr¨uger, N., Herrler, T., Erichsen, S., Schiergens, T. S.,Herrler, G., Wu, N.-H., Nitsche, A., et al. (2020). Sars-cov-2 cell entry depends on ace2 and tmprss2and is blocked by a clinically proven protease inhibitor.

Cell , 181(2):271–280.Jia, Y., Shen, G., Zhang, Y., Huang, K.-S., Ho, H.-Y., Hor, W.-S., Yang, C.-H., Li, C., and Wang, W.-L.(2020). Analysis of the mutation dynamics of sars-cov-2 reveals the spread history and emergence ofrbd mutant with lower ace2 binding afﬁnity. bioRxiv .Khoury, G. A., Smadbeck, J., Kieslich, C. A., and Floudas, C. A. (2014). Protein folding and de novoprotein design for biotechnological applications.

Trends in biotechnology , 32(2):99–109.Kim, D. E., Chivian, D., and Baker, D. (2004). Protein structure prediction and analysis using the robettaserver.

Nucleic acids research , 32(suppl 2):W526–W531.Krissinel, E. (2007). On the relationship between sequence and structure similarities in proteomics.

Bioin-formatics , 23(6):717–723.Lu, R., Zhao, X., Li, J., Niu, P., Yang, B., Wu, H., Wang, W., Song, H., Huang, B., Zhu, N., et al. (2020).Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus originsand receptor binding.

The Lancet , 395(10224):565–574.Marks, C., Nowak, J., Klostermann, S., Georges, G., Dunbar, J., Shi, J., Kelm, S., and Deane, C. M.(2017). Sphinx: merging knowledge-based and ab initio approaches to improve protein loop prediction.

Bioinformatics , 33(9):1346–1353.Onuchic, J. N., Luthey-Schulten, Z., and Wolynes, P. G. (1997). Theory of protein folding: the energylandscape perspective.

Annual review of physical chemistry , 48(1):545–600.Ou, X., Liu, Y., Lei, X., Li, P., Mi, D., Ren, L., Guo, L., Guo, R., Chen, T., Hu, J., et al. (2020). Character-ization of spike glycoprotein of sars-cov-2 on virus entry and its immune cross-reactivity with sars-cov.

Nature communications , 11(1):1620.Sanju´an, R., Nebot, M. R., Chirico, N., Mansky, L. M., and Belshaw, R. (2010). Viral mutation rates.

Journal of Virology , 84(19):9733–9748.Soto, C. S., Fasnacht, M., Zhu, J., Forrest, L., and Honig, B. (2008). Loop modeling: Sampling, ﬁltering,and scoring.

Proteins: Structure, Function, and Bioinformatics , 70(3):834–843.Stein, A. and Kortemme, T. (2013). Improvements to robotics-inspired conformational sampling in rosetta.

PloS one , 8(5).Tang, K., Wong, S. W., Liu, J. S., Zhang, J., and Liang, J. (2015). Conformational sampling and structureprediction of multiple interacting loops in soluble and β -barrel membrane proteins using multi-loopdistance-guided chain-growth monte carlo method. Bioinformatics , 31(16):2646–2652.Tang, K., Zhang, J., and Liang, J. (2014). Fast protein loop sampling and structure prediction usingdistance-guided sequential chain-growth monte carlo method.

PLoS computational biology , 10(4).Tang, X., Wu, C., Li, X., Song, Y., Yao, X., Wu, X., Duan, Y., Zhang, H., Wang, Y., Qian, Z., et al. (2020).On the origin and continuing evolution of sars-cov-2.

National Science Review .Walls, A. C., Park, Y.-J., Tortorici, M. A., Wall, A., McGuire, A. T., and Veesler, D. (2020). Structure,function, and antigenicity of the sars-cov-2 spike glycoprotein.

Cell , 181:281–292.Wong, S. W., Liu, J. S., and Kou, S. (2017). Fast de novo discovery of low-energy protein loop conforma-tions.

Proteins: Structure, Function, and Bioinformatics , 85(8):1402–1412.Wong, S. W., Liu, J. S., Kou, S., et al. (2018). Exploring the conformational space for protein folding with14equential monte carlo.

The Annals of Applied Statistics , 12(3):1628–1654.Wrapp, D., Wang, N., Corbett, K. S., Goldsmith, J. A., Hsieh, C.-L., Abiona, O., Graham, B. S., andMcLellan, J. S. (2020). Cryo-em structure of the 2019-ncov spike in the prefusion conformation.

Sci-ence , 367(6483):1260–1263.Wu, F., Zhao, S., Yu, B., Chen, Y.-M., Wang, W., Song, Z.-G., Hu, Y., Tao, Z.-W., Tian, J.-H., Pei,Y.-Y., et al. (2020). A new coronavirus associated with human respiratory disease in china.

Nature ,579(7798):265–269.Yan, R., Zhang, Y., Li, Y., Xia, L., Guo, Y., and Zhou, Q. (2020). Structural basis for the recognition ofsars-cov-2 by full-length human ace2.

Science , 367(6485):1444–1448.Zhang, H., Penninger, J. M., Li, Y., Zhong, N., and Slutsky, A. S. (2020). Angiotensin-converting enzyme2 (ace2) as a sars-cov-2 receptor: molecular mechanisms and potential therapeutic target.

Intensive caremedicine , 46:586–590.Zhang, J., Kou, S. C., and Liu, J. S. (2007a). Biopolymer structure simulation and optimization viafragment regrowth monte carlo.

The Journal of Chemical Physics , 126(22):06B605.Zhang, J., Lin, M., Chen, R., Liang, J., and Liu, J. S. (2007b). Monte carlo sampling of near-nativestructures of proteins with applications.

Proteins: Structure, Function, and Bioinformatics , 66(1):61–68.Zhou, P., Yang, X.-L., Wang, X.-G., Hu, B., Zhang, L., Zhang, W., Si, H.-R., Zhu, Y., Li, B., Huang, C.-L.,et al. (2020). A pneumonia outbreak associated with a new coronavirus of probable bat origin.