Global analysis of more than 50,000 SARS-Cov-2 genomes reveals epistasis between 8 viral genes
Hong-Li Zeng, Vito Dichio, Edwin Rodríguez Horta, Kaisa Thorell, Erik Aurell
GGlobal analysis of more than 50,000 SARS-Cov-2 genomes reveals epistasis between 8viral genes
Hong-Li Zeng,
1, 2, ∗ Vito Dichio,
3, 2, 4
Edwin Rodr´ıguez Horta, Kaisa Thorell,
6, 7 and Erik Aurell † School of Science, and New Energy Technology Engineering Laboratory of Jiangsu Province,Nanjing University of Posts and Telecommunications, Nanjing 210023, China Nordita, Royal Institute of Technology, and Stockholm University, SE-10691 Stockholm, Sweden Department of Physics, University of Trieste, Strada Costiera 11 34151, Trieste, Italy KTH – Royal Institute of Technology, AlbaNova University Center, SE-106 91 Stockholm, Sweden Group of Complex Systems and Statistical Physics. Department ofTheoretical Physics, Physics Faculty, University of Havana, Cuba Department of Infectious Diseases, Institute of Biomedicine,Sahlgrenska Academy of the University of Gothenburg, Gothenburg, Sweden Center for Translational Microbiome Research, Department of Microbiology,Cell and Tumor biology, Karolinska Institutet, Stockholm, Sweden (Dated: October 6, 2020)Genome-wide epistasis analysis is a powerful tool to infer gene interactions, which can guidedrug and vaccine development and lead to deeper understanding of microbial pathogenesis. Wehave considered all complete SARS-CoV-2 genomes deposited in the GISAID repository until four different cut-off dates, and used Direct Coupling Analysis together with an assumption of Quasi-Linkage Equilibrium to infer epistatic contributions to fitness from polymorphic loci. We find eight interactions, of which three between pairs where one locus lies in gene ORF3a, both lociholding non-synonymous mutations. We also find interactions between two loci in gene nsp13, bothholding non-synonymous mutations, and four interactions involving one locus holding a synonymousmutation. Altogether we infer interactions between loci in viral genes ORF3a and nsp2, nsp12 andnsp6, between ORF8 and nsp4, and between loci in genes nsp2, nsp13 and nsp14. The paper opensthe prospect to use prominent epistatically linked pairs as a starting point to search for combinatorialweaknesses of recombinant viral pathogens.
I. INTRODUCTION
The pandemic of the disease COVID-19 has so far ledto the confirmed deaths of more than 991,224 people [1]and has hurt millions. As the health crisis has been metby Non-Pharmacological Interventions [2, 3] there hasbeen significant economic disruption in many countries.The search for vaccine or treatment against the new coro-navirus SARS-CoV-2 is therefore a world-wide priority.The GISAID repository [4] contains a rapidly increas-ing collection of SARS-CoV-2 whole-genome sequences,and has already been leveraged to identify mutationalhotspots and potential drug targets [5]. Coronavirusesin general exhibit a large amount of recombination [6–9].The distribution of genotypes in a viral population cantherefore be expected to be in the state of Quasi-LinkageEquilibrium [10–12], and directly related to epistatic con-tributions to fitness [13, 14]. We have determined a list ofthe largest such contributions from 51,676 SARS-CoV-2genomes by a Direct Coupling Analysis (DCA) [15, 16].This family of techniques has earlier been used to in-fer the fitness landscape of HIV-1 Gag [17, 18] to con-nect bacterial genotypes and phenotypes through co-evolutionary landscapes [19] and to enhance models ofamino acid sequence evolution [20]. We apply a recentenhancement of this technique to eliminate predictionsthat can be attributed to phylogenetics (shared inher-itance) [21]. We find that eight predictions stand outbetween pairs of polymorphic sites located in genes nsp2 ∗ [email protected] † [email protected] and ORF3a, nsp4 and ORF8, and between genes nsp2,nsp6, nsp12, nsp13, nsp14 and ORF3a. Most of thesesites have been documented in the literature when itcomes to single-locus variations [22–27]. The nsp4-ORF8pair was additionally found to be strongly correlated inan early study [28]. It does not show prominent corre-lations today, but is ranked second in our global analy-sis. The epistasis analysis of this paper brings a differentperspective than correlations, and highlights pairwise as-sociations that have remained stable as orders of moreSARS-CoV-2 genomes have been sequenced. II. DATA AND METHODSA. Genome Data of SARS-CoV-2
We analyzed the consensus sequences deposited in theGISAID database [4] with high quality and full lengths(number of bps ≈ , B. Multiple-Sequence Alignment (MSA)
Multiple sequence alignments were constructed withthe online alignment server MAFFT [30, 31] for the twosmaller data sets with cut-off dates 2020-04-01 and 2020- a r X i v : . [ q - b i o . Q M ] O c t S = { σ ni | i = 1 , ..., L, n =1 , ..., N } , composed of N genomic sequences which arealigned over L positions [16, 21]. Each entry σ ni of ma-trix S is either one of the 4 nucleotides (A,C,G,T), or“not known nucleotide” (N), or the alignment gap ‘-’ in-troduced to treat nucleotide deletions or insertions, orsome minorities. C. MSA filtering
Before filtering, we transform the MSA in two differentways as follows: • The gaps ‘-’ are transformed to ‘N’ while the minors‘KFY...’ are mapped to ‘N’. There thus 5 states re-mains, where ‘NACGT’ are represented by ‘12345’; • The minors ‘KFY...’ are mapped to ‘N’. Thenthere are 6 states, with ‘-NACGT’ represented by‘012345’.The following criteria are used for pre-filtering of theMSA from the 2020-08-08 data-set. If for one locus thesame nucleotide is found in more than 96.5% of this col-umn, or if the sum of the percentages of A, C, G and Tat this position is less than 20%, then this locus will bedeleted. For each sequence, if the percentage of a cer-tain nucleotide is more than 80%, or if the sum of thepercentages of A, C, G and T in this sequence is lessthan 20%, then this sequence will be deleted. With thisfiltering criteria, many loci but no sequences are deleted.There are left 51,676 sequences and 689 loci.
D. B-effective number
To mitigate the effects of dependent samplings, it isstandard practice to attach to each collected genome se-quence σ ( b ) a weight w b [15, 16, 32], which normalizes itsimpact on the inference procedure. An efficient way tomeasure the similarity between two sequences σ ( a ) and σ ( b ) is to compute the fraction of identical nucleotidesand compare it with a preassigned threshold value x inthe range 0 ≤ x ≤
1. The weight of a sequence σ ( b ) canbe set as w b = m b , with m b the number of sequences inthe MSA that are similar to σ ( b ) : m b = |{ a ∈ { , ..., B }} :overlap ( σ ( a ) , σ ( b ) ) ≥ x | ; (1)here overlap is the fraction of loci where the two se-quences are identical. The B-effective number of thetransformed sequences is defined as B eff = B (cid:88) b =1 w b . (2) We compare the B eff value with different x for the fil-tered MSA with q = 5 and q = 6 respectively. As shownin Fig. 1, the data-set with 6 states shows larger B eff number for all tested x . We thus perform our analy-sis on the data-set with q = 6 states, where ‘-NACGT’represented by ‘012345’. x B e ff q=6q=5 FIG. 1. B eff number of the 2020-08-08 pre-filtered data-setwith threshold x . Red for q = 6 states (‘-NACGT’) andblack for q = 5 states (‘NACGT’). The number of states aredetermined by the transform criteria of the pre-filtered MSA. The re-weighting procedure partially addresses a pointraised [33], that sequenced viral genomes are not a ran-dom sample of the global population. That is, even ifsequencing is biased by the country they occur in andby contact tracing, sufficiently similar genomes will havelower weight and so each will contribute less to predic-tions.
E. Elements of Quasi-Linkage-Equilibrium (QLE)
The phenomenon of QLE was discovered byM. Kimura while investigating the steady-state distri-bution over two bi-allelic loci evolving under mutation,recombination and selection, with both additive andepistatic contributions to fitness [10]. In the absence ofepistasis such a system evolves toward Linkage Equilib-rium (LE) where the distribution of alleles at the two lociare independent. The covariance of alleles at the two locithen vanishes. In the presence of pairwise epistasis andsufficiently high rate of recombination, the steady-statedistribution takes form of a Gibbs-Boltzmann form P ( σ , ..., σ L ) = 1 Z exp {− H ( σ , ..., σ L ) } , (3)with an ”energy function” H ( σ , ..., σ L ) = (cid:88) i h i ( σ i ) + (cid:88) ij J ij ( σ i , σ j ) (4)In above J ij can be related to the epistatic contributionto fitness between loci i and j with alleles σ i and σ j [11–13]. The quantity h i is similarly a function of allele σ i which depends on both additive and epistatic contribu-tions to fitness involving locus i . It has been verified in insilico testing that when the terms in (4) can be recoveredthis is a means to infer epistatic fitness from samples ofgenotypes in a population [14]. In the bacterial realmthis approach was used earlier to infer epistatic contri-butions to fitness in the human pathogens Streptococcuspneumonia [34] and
Neisseria gonorrhoeae [35], both ofwhich are characterized by a high rate of recombination.The method was also tested on data on the bacterialpathogen
Vibrio parahemolyticus [36]. In that study theresults from DCA were not superior from an analysisbased on Fisher exact test, see Appendix G for a discus-sion. This is consistent with the approach taken here, as
V. parahemolyticus has low rate of recombination. Fur-ther details on the QLE state of evolving populations aregiven in Appendix A.
F. Inference Method for epistasis between loci
The basic assumption of modeling the filtered MSAis that it is composed by independent samples that fol-lows the Gibbs-Boltzmann distribution (3) with H as in(4) Higher order interactions are also possible to include,but we ignore them here [37]. This assumption is a sim-plification of the biological reality, however provides anefficient way to extract information from massive data.On the other hand, in the context of inference fromprotein sequences, it has been argued that the one en- coded in Eq. (3-4) is the minimal generative model i.e.capable not only to reproduce the empirical frequenciesand correlations but also to generate new sequences in-distinguishable from natural sequences [16, 38, 39].Many techniques have been developed to infer the di-rect couplings in Eq. (3), as reviewed in [40] and refer-ences therein, see also Appendix C. We employ the maxi-mum pseudo-likelihood (PLM) method [13, 32, 41–44] toinfer the epistatic effects between loci from the alignedMSA. PLM estimates parameters from conditional prob-abilities of one sequence conditioned on all the others.For Potts model with multiple states q >
2, this condi-tional probability is P ( σ i | σ \ i ) = exp (cid:16) h i ( σ i ) + (cid:80) j (cid:54) = i J ij ( σ i , σ j ) (cid:17)(cid:80) u exp (cid:16) h i ( u ) + (cid:80) j (cid:54) = i J ij ( u, σ j ) (cid:17) , (5)with u = { , , , , , } the possible state of σ i . Eq.(5) depends on a much smaller parameter set comparedwith that in Eq. (3). This leads a much faster inferenceprocedure of parameters compared with the maximumlikelihood method. With a given independent samplesets, one can maximize the corresponding log-likelihoodfunction P L i ( h i , { J ij } ) = 1 N (cid:88) s h i (cid:16) σ ( s ) i (cid:17) + 1 N (cid:88) s (cid:88) j (cid:54) = i J ij ( σ ( s ) i , σ ( s ) j ) − N (cid:88) s log (cid:88) u exp (cid:0) h i ( u ) + (cid:88) j (cid:54) = i J ij ( u, σ ( s ) j ) (cid:1) , (6)where s labels the sequences (samples), from 1 to N .With the filtered MSA, we then run the asymmetric ver-sion of PLM [32] in the implementation PLM availableon [45] with regularization parameter λ = 0 .
1. The in-ferred interactions between loci i and j are scored by theFrobenius norm. G. Relation to correlation analysis
In LE the distributions of alleles over different lociare independent. Given unlimited data and unlimitedcomputational resources, the terms J ij in (4) inferredfrom the data would then be zero. The locus-locus co-variances, defined as c ij ( a, b ) = (cid:10) σ i ,a σ j ,b (cid:11) − (cid:104) σ i ,a (cid:105) (cid:10) σ j ,b (cid:11) (7)would also be zero. The Frobenius norm of c ij ( a, b ) overindices ( a, b ) as a score of strength of correlations wouldbe zero as well. The qualitative difference between cor-relation analysis and global model inference based on (3)and (4) is that two loci i and j may be correlated (”in-directly coupled”) even if their interaction J ij is zero,provided they both interact with a third locus k . Datain Table IV and Fig. 5 show that the leading interactionsretrieved by DCA cannot be stably recovered in correla-tion analysis. A different score of statistical dependency between two categorical random variables is mutual in-formation (MI). Appendix G shows that the result doesnot substantially change if using MI instead of Frobenuisnorm of correlation matrices. Circos plots of interactionsbased on correlation scores are available on [29]. H. Epistasis analysis with PLM scores
PLM procedure yields a fully connected paradigm be-tween pairwise loci. To extract important informationform massive SARS-CoV-2 genomic sequences, we focuson the significant scores between loci, the top-200 pairs.With a reference sequence “Wuhan-Hu-1”, we identifythe positions of the corresponding nucleotides. The vi-sualization of these epistasis is performed by ‘circos’ soft-ware [46].
I. Randomized background distributions
A way to assess the validity of a small number ofleading retained predictions among a much larger set ofmostly discarded predictions is to compare to random-ized backgrounds. The retained predictions are then inany case large (by some measure) and would also be re-tained if selection would be made according to some cut-off, or an empirical p -value. The problem is thus howto distinguish the case where a small sub-set of retainedvalues are large because they are different, from the casewhen in a large number of samples such values wouldappear at random. This problem can be addressed bycomparing the retained values to the largest values fromthe same procedure applied to randomized data, as wasdone for predicted RNA-RNA binding energies in a non-coding RNA discovery pipeline [47]. In the context ofDCA (PLM) applied to genome-scale MSAs, two earlierstudies relying on randomized background distributionsare [48] and [13]. J. PLM scores with randomization
To understand the nature of the top-200 PLM scoreswe perform two distinct randomization strategies onthe MSA, such that its conservation patterns and (or)phylogenetic relations are preserved, while intrinsic co-evolutionary couplings (epistatic interactions) are re-moved [49]. Running DCA on artificial sequences ensem-bles generated by these strategies, and comparing themto the results obtained from original MSA allows to dis-entangle spurious couplings given by finite-size effects orby phylogeny. The first strategy, we refer as ’profile’, ran-domizes the input MSA by random but independent per-mutation of all its columns conserving the single-columnsstatistics for all sites. This destroys all kind of correla-tions and DCA couplings inferred from such samples areonly non-zero due to the noise caused by finite sample-size. In the second strategy referred as ’phylogeny’, theoriginal MSA is randomized by a simulated annealingschedule where columns and rows are changed simulta-neously but so that inter-sequence distances are kept in-variant. Phylogeny inferred from inter-sequence distanceinformation would therefore be unchanged. Conversely,if the predicted epistatic interactions are due to phy-logeny, they should also show up in terms recovered byPLM from MSAs scrambled by ’phylogeny’. More de-tails on the randomization strategies can be found inAppendix D.
III. RESULTS
The predicted effective interactions between loci wereobtained from Pseudo-Likelihood Maximization (PLM)scores, a standard computational method to performDCA. Manual inspection shows that about half of thetop-50 links and most of the top-200 involve noncodingsites in the 5’ or 3’ region on the “Wuhan-Hu-1” [50]reference sequence, many of them have very short rangeand most of them with a large fraction of the gap orN (unknown nucleotide) symbols (data available on [29]for other data-set). We present the links with both ter-minal loci located in coding regions and the mutationsexcluding gaps or ‘N’s.In Table I we list the significant links for the 2020-08-08 data-set. The first column is the index of eachpairwise interaction in the top 200s. The second col-umn indicates the locus with lower genomic position in the pair and the name of the SARS-CoV-2 proteins itbelongs to. The third column lists the major / minor al-lele (most prevalent, second most prevalent nucleotide)and the mutation type at that locus. The following twocolumns provide similar information on the locus withhigher genomic position in the pair. The last columncontains the PLM scores indicating the strength of ef-fects between pairs of loci. The pairwise epistasis listedin Table I for 2020-08-08 dataset are visualized by circossoftware in Fig. 2, where the red ones for the close effects(the distance between two loci is less or equal to 3 locus)while blue for distant effects. Analogous results for the2020-05-02 dataset is shown in Appendix L, and for the2020-04-01 and 2020-04-08 data-sets on [29].To check if the interactions can be explained by phy-logeny (inherited variations) we used two randomizationstrategies ‘profile’ and ‘phylogeny’ of the Multiple Se-quence Alignments (MSAs). Profile preserves the distri-bution over alleles at every locus but does so indepen-dently at each locus. Profile hence destroys all system-atic co-variations between loci. Phylogeny additionallypreserves the genetic distance between each pair of se-quences. Viral genealogies inferred from this informationare therefore unchanged under this randomization. PLMscores run on these two types of randomized data (scram-bled MSAs) is a background from which the significanceof the interactions from the original data can be assessed.Each randomization strategy is repeated 50 times withdifferent realizations of the scrambling, see Appendix Eand [29]. As shown in Fig. 3 the distribution of PLMscores using phylogeny and profile are qualitatively dif-ferent from PLM scores of the original MSA, with pro-gressively fewer interactions at high score values. Withprofile randomization, no interactions predicted by PLMappear with scores standing out from the background.Phylogeny randomization on the other hand preservessome interactions found by PLM in a fraction of the re-alizations of the random background. Table II lists inter-actions predicted by PLM that appear in some phylogenyrandomizations with scores large compared to the back-ground. In the following analysis we have not retainedthem, see Appendix E for circos visualizations. Table IIIlists the eight interactions found by PLM which either donot appear in any phylogeny randomization with scoresthat stand out from the background, or, in the case of(1059-25563), shows up three times in top-200 out of 50samples. We retain these eight predicted epistatic in-teractions in the sampled populations of SARS-CoV-2genomes. The top ones listed in Table III are marked byred arrows in Fig. 3(a).Epistatic interactions obtained from DCA reflect pair-wise statistical associations, but not correlations. Asreviewed in [40], and described in Appendix C, DCAis based on a global probabilistic model, and there-fore ranks inter-dependency differently than correlations.Fig. 4 compared to Fig. 3 shows that the distribution ofcorrelation scores is qualitatively different from the dis-tributions of DCA scores in the GISAID data set. Fig. 5further shows that the rank of the epistatic interactionpredicted in Table III have remained stable, while thecorresponding correlations have merged into the back-ground.The first-ranked interaction between 1059 and 25563 isbetween a (C/T), resulting in the T85I non-synonymousmutation in gene nsp2 and a (G/T), resulting in theQ57H non-synonymous mutation in gene ORF3a. nsp2,expressed as part of the ORF1a polyprotein, binds tohost proteins prohibitin 1 and prohibitin 2 (PHB1 andPHB2) in SARS-CoV [51]. The variations in the site1059 have been predicted to modify nsp2 RNA sec-ondary structure [52] and have previously been reportedto co-occur together with the Q57H variant in ORF3ain a dataset of SARS-CoV-2 genomes from the UnitedStates [53]. ORF3a, also known as ExoN1 hypotheti-cal protein sars3a, forms a cation channel of which thestructure in SARS-CoV-2 is known by Cryo-EM [54]. InSARS-CoV ORF3a been shown to up-regulate expres-sion of fibrinogen subunits FGA, FGB and FGG in hostlung epithelial cells [55], to form an ion channel whichmodulates virus release [56], to activate the NLRP3 in-flammasome [57], and has been found to induce apop-tosis [58]. The Q57H variant was reported early in theCOVID-19 pandemic [59] and occurs in the first trans-membrane alpha helix, TM1 [54], where it changes theamino acid glutamine (Q) with a non-charged polar sidechain to histidine (H), which has a positively chargedpolar side chain. This amino acid is at the interface ofinteraction between the two dimeric subunits of ORF3athat forms the constrictions of the ion channel but theQ57H alteration does not seem to change the ion channelproperties compared to wildtype 3a [54]. Nevertheless,its incidence is increasing in SARS-CoV-2 genomes in theUnited States [53] and the effect of Q57H may thereforeaffect the virulence in other beneficial ways than chang-ing the conductance properties of the ion pore.The association between 8782 and 28144 (rank 5), re-ported early in SARS-CoV-2 studies [28] is between a(C/T) synonymous mutation in the gene nsp4, and a(T/C) non-synonymous mutation resulting in the L84Salteration in the gene ORF8. The first of these genes par-ticipates in the assembly of virally-induced cytoplasmicdouble-membrane vesicles necessary for viral replication.The site 8782 is located in a region annotated as CpG-rich and is the site of a CpG for the major allele (C);it has the minor (T) allele in other related viruses [28].Orf8 has been implicated in regulating the immune re-sponse [60, 61]. The L84S variant is, together with theC8782T nsp4 mutation characterizing the GISAID cladeS [62].The interaction between 14805 and 26144 (rank 9)leads to non-synonymous alterations in nsp12 (T455I,note that the reference is Y) and ORF3a (G251V) respec-tively. The G251V has been reported by many studiesand is defining the GISAID V clade [62] together withthe L37F nsp6 variant (position 11083, rank 47). Thewidely reported G251V variant is unfortunately outsideof the proposed Cryo-EM structure [54] and it is un-known how this glycine to valine substitution affects pro-tein function. nsp12 is the RNA-dependent RNA poly-merase and the T455I substitution is found where thereference Wuhan-Hu-1 has a tyrosine residue in one ofthe alpha helices of the polymerase ”finger” domain [63].Threonine can similarly to tyrosine be phosphorylatedbut also glycosylated, it is polar, uncharged and can form hydrogen bonds that may stabilize the alpha helix.Isoleucine on the other hand is non-polar and unchargedand both the residues are smaller than the aromatic ty-rosine.The second interaction partner of G251V is the nsp6L37F variant. nsp6 has been shown to induce autophago-somes in the host cells in favour for viral replication andpropagation SARS-CoV [51]. There is currently no ex-perimentally validated model of nsp6 structure but anearly model suggest that the L37P variant is situated inan unordered loop between two alpha helices [64].The interaction between 17747 and 17858 (rank 27)is between two non-synonymous mutations (C/T, result-ing in P504L) and (A/G, resulting in T541C) within thegene nsp13 that codes for a helicase enzyme that unwindsduplex RNA [51]. It is the only epistatic interaction inTable III within one protein. These same two loci reap-pear in the list with ranks 26 and 36 as interacting witha C/T synonymous mutation (L7L) in gene nsp14 at po-sition 18060. The P504L and T541C are both located inthe Rec2A part of the protein that is not in direct inter-action with the other members of the RNA-dependentRNA polymerase holoenzyme, in which two moleculesof nsp13 forms a stable complex with nsp12 replicase,nsp7, and nsp8. The nsp14 protein is a bifunctional pro-tein that has a N7-methyltransferase domain and an adomain exonuclease activity, responsible for replicationproof reading (cite Denison et al ”An RNA proofread-ing machine regulates replication fidelity and diversity”).The nsp14/nsp10 proof reading machinery is thought tointeract with the replication-transcription complex butthe exact details of this interaction are not known.The final interaction (rank 21) is a link between a locuscarrying a non-synonymous mutation (C/T, T541C) innsp2 position 1059, with a locus carrying a synonymousmutation (C/T, L280L) in nsp14, position 18877. As theknowledge on nsp2 protein structure is poor there is noevidence for the effect of this mutation. Also, how thesynonymous C/T alterations in nsp14, as well as in thesynonymous mutations of the other interactions affectthe virus are unknown, but can be proposed to changeRNA secondary structure, RNA modification or codonusage.
IV. DISCUSSION
The COVID-19 pandemic is a world-wide public healthemergency caused by the β -coronavirus SARS-CoV-2.A very large and continuously increasing number ofhigh-quality whole-genome sequences are available. Wehave investigated whether these sequences show effectsof epistatic contributions to fitness. In a populationevolving under high rate of recombination, such effectsof natural selection can be detected by Direct CouplingAnalysis, a global model learning technique. The paperopens up the prospect to leverage very large collections ofgenome sequences to find new combinatorial weaknessesof highly recombinant pathogens.In this work we have considered all whole-genome se-quences of SARS-CoV-2 deposited in GISAID up to dif-ferent cut-off dates. As this coronavirus has extensiverecombination we have assumed that the distribution ofgenotypes is well described by Kimura’s Quasi-LinkageEquilibrium, and used Direct Coupling Analysis to in-fer epistatic contributions to fitness from the sequences.After filtering out all but the strongest effects and varia-tions in non-coding regions with many gaps in the MSA,the remaining predictions are few in number, i.e. predictions in Table I.Co-variations between allele distributions at differentloci can be due to epistasis and also to inherited effects(phylogeny). We have tested for the second type by ran-domizing Multiple Sequence Alignment of sequences suchthat pair-wise distances between sequences are left in-variant. We find that the top link 1059-25563 appears3 times in 50 phylogeny randomizing samples, thoughwith much lower rank. The other predicted epistaticcontributions disappear under phylogenetic randomiza-tion, except for pairs in the triple (3037, 14408, 23403)which appear in from 20% to 35% of 50 randomizations.After eliminating these links as well as links between ad-jacent loci (28881, 28882, 28883, which appear in from14% to 16% in 50 samples), we are left with eight pre-dictions listed in Table III. We consider it likely thatthese retained interactions are due to epistasis, and notto inherited co-variation. An analogous investigation ona smaller dataset obtained with an earlier cut-off date(2020-05-02) and reported in Appendix L yield six re-tained predictions, involving however the same eight viralgenes. The question on epistasis vs. effects of inheritance(phylogeny) clearly merits further investigation and test-ing as more data will become available.Biological fitness is a many-sided concept and can alsoinclude aspects of game and cooperation [65–67]. A fit-ness landscape describes the propensity of an individualto propagate its genotype in the absence of strategic in-teractions with other genotypes, and has traditionallybeen used to model the evolution of pathogens coloniz-ing a host, for earlier use relating to HIV and using DCAtechniques, see [68]. The additive and epistatic contri-butions to fitness of the virus which we find describe thevirus in its human host and therefore likely reflect host-pathogen interactions to a large extent. A conceptualsimplification made is that all hosts have been assumedequivalent. In future methodological studies it would beof interest to consider possible effects of evolution in acollection of landscapes, representing different hosts, andto correlate such dynamics to host genotypes. As thisrequires other data than available on GISAID, and lessabundant at this time, we leave this for future work. Onthe other hand, it is unlikely that the inferred couplingsinvolve the host as a temporal variable, due to the muchfaster time scale of the evolution of the virus.Epistatic interactions are pairwise statistical associa-tions, but are not simply correlations. The interactionbetween sites 8782 and 28144, which is the second largestin Table III, was identified as a very strong correlation ina very early study [28]. As shown in Table 5 this correla-tion has generally decreased over time (using data withsuccessively later cut-off dates). In the alternative globalmodel learning method of DCA which we use in thepresent work, the score of statistical inter-dependency ofthis pair has remained large, and the pair is consistently ranked first or second over four different cut-off dates, seeFig. 5. While our data hence supports the observation ofstatistical inter-dependency in this pair first made in [28],it does not support the interpretation made in the samework that the effect is due to phylogeny. The later crit-icism in [33] therefore does not apply to our work sincean epistatic interaction, recovered through DCA and aQuasi-Linkage Equilibrium assumption in a populationthoroughly mixed by recombination, is different in na-ture from a phylogenetic effect.DCA techniques have been applied to find candidatetargets for vaccine development. In a series of studies itwas found that combinations of mutations implied by se-quence variability in the HIV-1 Gag protein correlate wellwith in vitro fitness measurements, and clinical observa-tions on escape strains (HIV strains that tend to dom-inate in one patient over time) and the immune systemof elite controllers (HIV-positive individuals progressingslowly towards AIDS) [18, 68, 69]. While this may bea promising future avenue in COVID-19 research, in thepresent study we have not found any epistatic interac-tions involving Spike, only pairs that also show up underphylogeny randomization or that are quite weak, see Ap-pendix J. The Spike protein has been the main target ofcoronovirus vaccine development to date [70], includingagainst SARS-CoV-2 [71–73].An epistatic interaction means that loss of fitness by amutation at one locus is enhanced (positive epistasis) orcompensated (sign epistasis) by a mutation at another lo-cus. Suppose there are drugs that act on targets aroundboth loci, modulating the fitness of the respective vari-ants. Epistasis then points to the possibility that usingboth drugs simultaneously may have a more than addi-tive effect. To search if our analysis offers such a guideto combinatorial drug treatment, we scanned the recentcomprehensive compilation of drugs known or predictedto target SARS-CoV-2 [74]. Five out of the eight predic-tions in Table III involve either one synonymous muta-tions or are between two mutations in the same gene. Forall the three remaining pairs of non-synonymous muta-tions, (1059, 25563), (11083, 26144) and (14805, 26144),the second locus lies in ORF3a for which no potentialdrugs are listed in [74]. The first locus in the same threepairs lie respectively in genes nsp2, nsp6 and nsp12. Oneor more already approved and practical drugs targetingnsp2 and nsp6 are listed in [74]. Ponatinib, listed fornsp12, is not appropriate against a pandemic disease likeCOVID-19 on account of its large cost. Potential drugsfor the proteins listed in Table III are summarized inTable IX in Appendix K, following [74].Nevertheless, the number of combinations of potentialdrug targets, in COVID-19 and many other diseases, isvery large. Direct Coupling Analysis applied to manysampled sequences predicts which genes/loci have mu-tual dependencies in fitness, and can be used to rankcombinations for further more detailed investigation. Wenote that one can also start a search for drug treatmentfrom conserved positions, assuming these to be uncondi-tionally necessary for the virus. If so, all potential pairswould however be ranked equal based on sample infor-mation, and there would be no analogous short-cut tothe combinatorial explosion of possibilities. Even if thescan discussed above did not lead to any direct sugges-tions based on the lists of potential drugs in [74], wehope the general approach could have value given thecontinuing increase and availability of genome sequencesof both viral and bacterial pathogens. We finally notethree out of eight of our list of predictions involve lociin viral gene ORF3a, the action of which is related tosevere manifestations of COVID-19 disease [56–58]. FIG. 2. Top-200 significant pairwise epistasis from the 2020-08-08 data-set between loci in coding regions. Colored linesindicate for top 50s, grey lines top 50-200. Red lines showshort-distance links (distance less than or equal to 3 bp) bluelines show links of longer distance. The colourful links are thesame pairs as listed in Table I. Analogous circos plots for the2020-05-02 data set is shown in Appendix L, and for the 2020-04-01 and 2020-04-08 data-sets on the GitHub repository [29].
ACKNOWLEDGMENTS
We thank Profs Martin Weigt and Roberto Mulet fornumerous discussions, and Martin Weigt for sharing theuse of the ‘phylogeny’ developed with ERH. EA thanksArne Elofsson and other participants of the Science forLife Labs (Solna, Sweden) ”Viral sequence evolution re-search program” for input and suggestions. The work ofHLZ was sponsored by National Natural Science Foun-dation of China (11705097), Natural Science Foundationof Jiangsu Province (BK20170895), Jiangsu GovernmentScholarship for Overseas Studies of 2018, and ScientificResearch Foundation of Nanjing University of Posts andTelecommunications (NY217013). The work of Vito Di-chio was supported by Extra-Erasmus Scholarship (Uni-versity of Trieste) and Collegio Universitario ’LucianoFonda’. ERH acknowledges funding by the EU H2020 re-search and innovation programme MSCARISE-2016 un-der grant agreement No. 734439 INFERNET.
Appendix A: Quasi-linkage equilibrium (QLE) andits range of validity
Quasi-linkage equilibrium was discovered by Kimura[75] and investigated more recently by Neher and (cid:12)(cid:12) (cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12) (a)(b) ↓ (c) ↓ FIG. 3. Histograms of PLM scores for (a) original 2020-08-08 data-set, (b) a phylogenetic randomized sample and (c)a profile randomized sample. The blue bars for all scoreswhile the red ones for top-50 largest scores. Red arrows in(a) indicate links listed in Table III. The largest PLM score ispointed to by red arrows for random samples in (b) and (c).None of them is located inside a coding region, and none ofthem appear in Table I and Table III.
Shraiman [11, 12]. The connection to inference was madein [13] where the theory was also extended from Booleanvariables (bi − allelic loci) to categorical data (arbitrarynumber of alleles per locus).QLE refers to the state of a population comprising N individuals. That can be characterized by the list ofgenotypes present, or equivalently by an empirical prob- TABLE I. Significant links with rank within top-200s between pairwise loci for the 2020-08-08 data-set.Rank a Locus 1 b mutation c Locus 2 mutation PLM-protein -type -protein -type score1 1059-nsp2 C | T-non. 25563-ORF3a G | T-non. 1.71912 28882-N G | A-syn. 28883-N G | C-non. 1.49963 28881-N G | A-non. 28882-N G | A-syn. 1.48164 28881-N G | A-non. 28883-N G | C-non. 1.47835 8782-nsp4 C | T-syn. 28144-ORF8 T | C-non. 1.44719 14805-nsp12 C | T-syn. 26144-ORF3a G | T-non. 1.139212 3037-nsp3 T | C-syn. 14408-nsp12 T | C-non. 1.029113 18877-nsp14 C | T-syn. 25563-ORF3a G | T-non. 1.013114 3037-nsp3 T | C-syn. 23403-S G | A-non. 1.011417 14408-nsp12 T | C-non. 23403-S G | A-non. 0.991721 1059-nsp2 C | T-non. 18877-nsp14 C | T-syn. 0.919726 17858-nsp13 A | G-non. 18060-nsp14 C | T-syn. 0.862427 17747-nsp13 C | T-non. 17858-nsp13 A | G-non. 0.855336 17747-nsp13 C | T-non. 18060-nsp14 C | T-syn. 0.778047 11083-nsp6 G | T-non. 26144-ORF3a G | T-non. 0.734063 20268-nsp15 A | G-syn. 25563-ORF3a G | T-non. 0.6474134 11083-nsp6 G | T-non. 14805-nsp12 C | T-syn. 0.5040147 11083-nsp6 G | T-non. 28144-ORF8 T | C-non. 0.4928168 8782-nsp4 C | T-syn. 11083-nsp6 G | T-non. 0.4770 a Indices of significant links in the top 50s with both terminals located inside a coding region, inferred by PLM. The analogous table forthe 2020-05-02 data-set is shown in Appendix L b Information on locus 1: index in the reference sequence, the protein it belongs to. The convention used is that locus 1 (“startinglocus”) is the locus of lowest genomic position in the pair. c Information on mutations of locus 1: the first and second prevalent nucleotide at this locus, mutation type: synonymous(syn.) /non-synonymous(non.).
TABLE II. Top 200s that appeared (with an appearance ra-tio ≥
10% ) in samples with phylogeny randomization strat-egy based on the 2020-08-08 data-set. phylogeny samplesare considered in total.Hit a Locus 1 mutation Locus 2 mutationratio -protein -type -protein -type14% 28881-N G | A-non. 28882-N G | A-syn.16% 28881-N G | A-non. 28883-N G | C-non.20% 28882-N G | A-syn. 28883-N G | C-non.22% 3037-nsp3 T | C-syn. 14408-nsp12 T | C-non.20% 3037-nsp3 T | C-syn. 23403-S b G | A-non.34% 14408-nsp12 T | C-non. 23403-S G | A-non. a The indices of samples with phylogeny randomization whichpreserve the links listed in Table I are shown here. The circosplots for the significant epistatic links of all randomizedsamples are available in SI b In amino acid notation this mutation is D614G in Spike. ability distribution P ( e ) ( σ ) = 1 N (cid:88) s σ,σ ( s ) (A1)In above σ ( s ) is the genotype of individual ( s ) and σ,σ ( s ) means that individuals of genotype σ are counted in thesum. Each genotype can hence appear zero, once ormany times; if more than zero such a group is referredto as a clone , and if more than once a multi-individualclone, or a proper clone. We will in the following for sim-plicity take N to be fixed. In QLE there are few properclones, i.e. most individuals in the population have aunique genotype. TABLE III. Potentially significant epistatic links in Table I,and corresponding amino acid mutationsRank a Locus 1- amino acid Locus 2- amino acidprotein mutation protein mutation1 b c ) 25563-ORF3a Q57H(Q)5 8782-nsp4 S76S(S) 28144-ORF8 L84S(L)9 14805-nsp12 T455I(Y) 26144-ORF3a G251V(G)21 1059-nsp2 T85I(T) 18877-nsp14 L280L(L)26 17858-nsp13 T541C(Y) 18060-nsp14 L7L(L)27 17747-nsp13 P504L(P) 17858-nsp13 T541C(Y)36 17747-nsp13 P504L(P) 18060-nsp14 L7L(L)47 11083-nsp6 L37F(L) 26144-ORF3a G251V(G) a Main prediction: eight epistatic links. The links preserved byphylogeny randomization in Table II are not listed here. b This link appears in 3 out of 50 (6%) phylogenyrandomizations; once (experiment 23) with rank 34, and twice(experiments 29 and 47) with ranks in 51 − c Amino acid in the reference sequence Wuhan-Hu-1 at theposition specified by the number between major and minoralleles.
In models of evolution where the QLE concept is per-tinent populations change in time due to mutations, re-combination, fitness and genetic drift. Mutation and re-combination are random events characterized by averagerates. Fitness is the propensity of an individual to haveoffspring in the next generation and genetic drift is therandomness associated with this process. The popula-tion history is hence given by a sequence of empiricalprobability distributions indexed by time ( t ), P ( e ) ( σ, t ).The development in time can alternatively be formu-lated on the level of ensemble probability distributions . In ↓ FIG. 4. Frobenius norm of pairwise correlations between locifor the original 2020-08-08 data-set. The score pointed by thered arrow corresponds to the link of 1059-25563.
Date of Data-sets R a nk s Corr_8782_28144PLM_8782_28144Corr_1059_25563PLM_1059_25563Corr_17747_17858PLM_17747_17858Corr_17858_18060PLM_17858_18060Corr_17747_18060PLM_17747_18060
FIG. 5. Ranks for significant epistatic effects with data col-lection date (2020-04-01, 2020-04-08, 2020-05-02 and 2020-08-08) by PLM (dashed lines) and correlation analysis (solidlines). The ranks of the PLM scores are almost constant whilethe ranks of the correlations vary significantly and mostlydrop as more data accumulate (later cut-off dates). general that will be the time-dependent joint probabilitydistributions of N genomes P N ( σ , σ , . . . , σ N , t ). Muta-tion, fitness and genetic drift act on each genotype in thepopulation separately. They can thus be formulated forthe marginalization P N to a one-genome ensemble prob-ability distribution P ( σ, t ). Recombination acts on twogenomes at a time. They can thus be formulated on thelevel of two-genome ensemble probability distributions P ( σ , σ , t ).QLE is characterized by multi-genome distributionfunctions factorizing i.e. by P ( σ , ..., σ N ) ≈ P ( σ ) ...P ( σ N ) . (A2)As a consequence, evolution can be considered on thelevel of one-genome ensemble probability distributionsonly. The resulting equations have been written out anddiscussed in [11–14] and are structurally similar to theBoltzmann equation in gas kinetics, where recombina-tion plays the role of a collision term. It is a reasonablepicture to consider the list of N genotypes given by P ( e ) as N samples from P . However, as follows from above,the relation between the two distributions is not direct: P ( e ) changes according to a stochastic and frequency- TABLE IV. Top-10 links found by correlation analysis in thecoding region for the data-set till 2020-08-08.Rank a Locus 1 Locus 2 Frobenius-protein -protein Score455 3037-nsp3 23403-S 0.3844458 3037-nsp3 14408-nsp12 0.3842460 14408-nsp12 23403-S 0.3837581 28882-N 28883-N 0.3609584 28881-N 28883-N 0.3603585 28881-N 28882-N 0.36031071 1059-nsp2 25563-ORF3a 0.28212394 8782-nsp4 28144-ORF8 0.18033969 23403-S 28144-ORF8 0.14873980 3037-nsp3 28144-ORF8 0.1486 a Rank for top-10 links as ranked by correlation analysis.Correlations between loci of which at least one outside codingregions are omitted. independent evolution law while P obeys a deterministicnonlinear partial differential equation.The form of QLE, for only bi-allelic loci and only one-locus fitness terms (additive fitness) and two-loci fitnessterms (epistatic fitness) is that of the Ising model of sta-tistical mechanics i.e. for a genome σ = ( s , . . . , s L ) P ( σ ) = 1 Z exp (cid:88) i h i s i + (cid:88) ij J ij s i s j (A3)In above both sums go over the loci of variability on thegenome. In some applications they could be all or al-most all positions, but in the application to SARS-CoV-2genomes they are but a small fraction. At most genomicpositions all samples carry the same variant; those posi-tions are not included in the sums in (A3) which focuson the variable portions. Further, s i = 1 in (A3) encodesone of the alleles (typically the major allele) and s i = − h i and J ij parametrize the distribution and Z is a normal-ization constant (partition function). The extension ofthe above to more the bi-allelic loci (”Potts genomes”)is given in [13], and restated in main body of the paper,Eqs. (3) and (4). Although we in this work only findproperties of almost bi-allelic pairs of loci, the interme-diate analysis allows multi-allelic states (see below).Known sufficient conditions for QLE are that recombi-nation is sufficiently high everywhere compared to selec-tion, and mutation rate is non-zero. The part of the ar-gument concerning recombination was made on the phys-ical level of rigor in [12], by estimating terms in a Taylorseries in inverse recombination rate. The part of the ar-gument concerning mutation is implicit in [12], and basedon the observation that without mutations, the most fitgenotype will eventually take over in a finite population.This means that mutation must be non-zero, as otherwiseQLE will only be a (possibly long-lived) transient [14].On a qualitative level the argument for QLE is analogousto the accuracy of Boltzmann equation and the station-ary state being of Gibbs-Boltzmann type when collisionrate is high enough [13].Many realistic models encode that recombination actsmore weakly between closely spaced loci. Taken at face0value, the conditions in [11–13] would then imply thatthe conditions for QLE can be at hand for loci sufficientlyfar apart on the genome, but not for closely spaced loci.A quantitative theory of this effect as it relates to QLEis at this point lacking. Numerical results in [11, 12]showed that the characteristics of QLE may be present,and numerical results in [14] showed that QLE can beused for inference in this setting. Necessary conditions for the validity of QLE are atthis point not known. It may be that states similar toQLE can also be found in other parameter ranges thanat high recombination. An argument in favor is the re-cent in silico result in [76] where a modified form of QLEinference was shown to work also for moderate recombi-nation, given instead high enough mutation rates.
Appendix B: Approximate inference in QLE
QLE gives quantitative relations between evolutionarydynamics and distributions over genotypes in a popula-tion. In this section we state and briefly describe whatthese relations are, and how we use them in this work.We are only here concerned with the most impor-tant relation, which connects the parameters J ij in (A3)to parameters describing how fitness and recombinationshape the population. Fitness is here taken as a propen-sity of a genotype to propagate in the next generation.It is parametrized as F ( σ ) = (cid:88) i f i s i + (cid:88) ij f ij s i s j (B1)where the f i and f ij are respectively the additive andthe epistatic components of fitness and the genome isassumed bi-allelic as in (A3); for the extension to multi-allelic sites see [13]. The meaning of (B1) is that indi-viduals with higher total fitness (higher F ( σ )) will havehigher expected number of offspring. Recombination isdescribed by an overall rate r and a locus-pair-dependentquantity c ij which is the probability that alleles at loci i and j are inherited from different parents if a recombina-tion event has taken place. In many cases it is reasonable to assume that this quantity is small (close to zero) when i and j lie sufficiently closely on the genome, and approx-imate if they are distant.The relation first found by Kimura for a two-locusproblem is J ij = f ij r · c ij (B2)In a world abundant in sequenced genomes the left handside of above can be inferred from samples. It is thuswhat can be considered known, while what causes it, f ij and the influence of recombination, are quantities notknown directly. It is therefore useful to re-state (B2) as f ∗ ij = J ∗ ij · rc ij (B3)where the star indicates that these quantities are inferredfrom data.Finally, for sites far enough apart on the genome theabove can be summarized that epistatic term in fitness( f ij ) is proportional to pair-wise term in the distribu-tion ( J ij ). A ranking of pairs in descending order of J ij is hence also a ranking of pairs in descending orderof epistatic fitness, if closely spaced pairs are excluded.This is hence the theoretical basis of the analysis in themain body of the paper, where we rank pairs as to thevalues of J ∗ ij , as inferred from samples. The extension ofthe above to Potts genomes (multi-allelic states) is givenin [13] and [14]. Appendix C: Methods of DCA
Determining the coefficients in a distribution of thetype (A3) from data has been called a problem of in-ference in exponential families in statistics [77], an in-verse Ising problem in statistical physics [40, 43, 78] and direct coupling analysis (DCA) in computational biol-ogy [15, 16]. The benchmark method for such problemsis maximum likelihood (ML). This is built from the as-sumption that N observed genotypes are independentdraws from a distribution of type (A3). The logarithmof the probability of the joint distribution (log-likelihood)is then L ( { h i } , { J ij } ) = (cid:88) i h i (cid:32) N (cid:88) s s si (cid:33) + (cid:88) ij J ij (cid:32) N (cid:88) s s si s sj (cid:33) − log Z (C1)The averages on the right hand side multiplying h i and J ij are empirical averages . Since these are the only prop-erties of the samples which enter into the log-likelihoodthey are sufficient statistics for ML inference. The parti-tion function ( Z ) is computationally difficult to evaluate,and a large number of other inference procedures to cir-cumvent this problem are therefore widely used, reviewedrecently in [16, 40].One family of widely used approaches is variational methods. This is based on minimizing the distance be-tween the empirical distribution and a suitable trial dis-tribution, in practice mostly factorized distribution [77],although later distribution of the type (A3) have alsobeen used [40] (and references therein). The most widelyused variational inference is naive mean-field which isequivalent to treating the Ising (or Potts) distributionover discrete variables as if it were a Gaussian distri-bution over continuous variables. While naive mean-1field inference was the basis for important advances inprotein structure prediction [15, 79], it has more re-cently been overtaken by pseudo-likelihood maximization(PLM) [16, 40–43]. PLM is the DCA method used in thiswork. The principles are described in Materials & Meth-ods section of the main body of the paper, text aroundEqs. (5) and (6). The implementation of PLM which wehave used (also called PLM), due to Chen-Yi Gao [80] isavailable on [45]. Appendix D: Phylogenetic randomization of DCA:principles
This section describes the two means of phylogeneticrandomization used in the main paper. A full-lengthtechnical description of both methods is available (fromco-author ERH) at [49].
Profile randomization:
This approach is built onthe principle of randomizing the input multi-sequencealignment (MSA) by conserving the single-columnsstatistics w i ( σ ), for all sites i = 1 , ..., L and all nu-cleotides or gaps σ ∈ {− , N, A, C, G, T } . This is done byindependent columns shuffling which destroys all kindsof correlations (both coevolutionary ones and phyloge-netic ones) present in the alignment, only the residueconservation patterns of the original MSA are preserved.The randomized sequences become an independently andidentically distributed sample from the profile model P ( σ , ..., σ L ) = L (cid:89) i =1 w i ( σ i ) . Profile randomization hence serves to verify that the ef-fects found are not due to random sampling. An alterna-tive way to reach the same goal, and high-lighted in themain body of the paper, is to repeat the analysis withsuccessively larger collections of genomes obtained fromusing successively later cut-off dates.
Phylogeny randomization : This more advancedapproach is built on the principle of randomizing theinput MSA preserving both single columns statistics andpairwise Hamming distances between sequences repre-senting the genotypes. This method was first presentedin detail in [49]. Computational methods to infer phy-logeny which rely on such pair-wise Hamming distanceswould by design be insensitive to phylogeny randomiza-tion, i.e. the give the same result as using input MSA.Phylogeny randomization hence serves to distinguish co-variation between loci due to phylogeny (inheritance)from co-variation due to co-evolution (epistasis).The method is initialized with an alignment resultingfrom the “profile” randomization, eliminating all preex-isting correlations, to then start a simulated annealingbased method to construct a sequence alignment withinter-sequences distances of the original MSA as target.Single variables σ ( i ) k are permuted in the following way:at each move t, a column k and two rows m and n arechosen at random, and an attempt to exchange σ ( m ) k and σ ( n ) k is made. The probability of the exchange to takeplace is the Metropolis-Hastings acceptance probability: P acc ( t, β ) = min (cid:0) , exp (cid:0) β (cid:0) || h t − h target || − || h t − − h target || (cid:1)(cid:1)(cid:1) (D1)where h t − and h t are the Hamming distance matricesof the current alignment before and after the exchange, h target the Hamming distance matrix corresponding tothe original alignment, || . . . || stands for the Frobeniusmatrix norm, and β is an inverse temperature parameter.Thus, a move is more likely to be accepted if it makesthe Hamming distance matrix of the alignment closerto that of the target. Parameter β is initialized at a lowvalue and then slowly increased as more moves are made,When β goes to infinity and annealing has proceededslowly enough h → h target .This procedure never changes the single site statisticsof the target alignment, since exchanges are made insideone column. On other hand all the epistatic correlationsare destroyed. If we assume that the target Hammingdistance matrix is a measure of the phylogenetic informa-tion in the original MSA, as is done in several tree build-ing methods (distance-based methods) [81, 82], then weexpect that resulting alignment present this hierarchicalsignal. Appendix E: Phylogenetic randomization of DCA:results and visualization
Starting from the filtered MSA for the 2020-08-08data-set, we apply the two randomization strategies de-scribed above. For profile randomization 50 samples aregenerated and the PLM procedure is implemented to in-fer the epistasis between pairwise loci.As stated in main body of the paper, for further anal-ysis we only retained pairs where both the inferred lociare located in the coding region and the first and sec-ond prevalent nucleotides of these loci in the analyzeddata-set are entries in { A,C,G,T } . We observed no suchpairs in the top-200 PLM scores, for any profile random-ized samples. Therefore, the examined ranks of PLMscores are extended to top-2000s. In Fig. 6, the inferredtop-2000 epistasis for each sample are shown. There are24 out of 50 samples which contain interactions found byPLM. However, as they all have low rank , none ofthese links show up in Table I in the main body of thepaper.For phylogeny randomization we also generated 50random samples. The inferred epistasis for these sam-ples are shown in Fig. 7 and continued in Fig. 8. Unlike2profile strategy, a subset of the putative epistatic links ofTable I in the main body of the paper also show up afterphylogeny randomization. This subset of predictions islisted in Table II in the main body of the paper, andhas been eliminated in the list of retained predictions inTable III. Appendix F: Robustness as to thresholds
In the analysis in main body of the paper, the follow-ing criteria are used for pre-filtering of the MSA. First, ifin one locus the same nucleotide is found in a percentage P major ≥ .
5% of the samples, or if the sum of the per-centages of A, C, G and T at this position is P sum ≤ P major at a certain locusbut not to P sum . We chose the criterion P major ≥ . P major , we performed PLM inference over the 2020-08-08dataset with P major = 90% and P major = 95% respec-tively. The inferred epistasis for these two P major s areshown in the left and right panel of Fig. 9 respectively.By increasing the threshold, more epistasis effects areinferred. In summary, most of the links presented withlower P major remain in the results with higher thresh-olds.For the visualization of epistasis effects, we focus onlinks with both terminals located in coding regions andmutation types are gaps and ’N’s excluded. As shown inFig. 9, different colors mean links with different ranks.Red (links with terminals located close to each other) andblue (the terminals are far away from each other) are forthe epistasis links with ranks less than 50 while grey forlinks with ranks within 51 to 200. Both panels clearlyshow that with PLM method, most significant links arein top-50s. This indicates that the choice of visualizationthreshold are reasonable in the presentation of epistasiseffects. Appendix G: Different quantifications of correlations
The correlation (co-variance) of two Boolean variablesis given by one real number. If the Boolean variablesare represented as spin variables s i and s j taking values1 and −
1, this number is the physical correlation χ ij = (cid:104) s i s j (cid:105) − χ i χ j , where χ i = (cid:104) s i (cid:105) and χ j = (cid:104) s j (cid:105) . Since everyjoint distribution on two Boolean variables can be writ-ten p ( s i , s j ) = (1 + χ i s i + χ j s j + ( χ i χ j + χ ij ) s i s j )with marginals p ( s i ) = (1 + χ i s i ) and p ( s j ) = (1 + χ j s j ), the contrast is c ( s i , s j ) = p ( s i , s j ) p ( s i ) p ( s j )= 1 + χ i s i + χ j s j + ( χ i χ j + χ ij ) s i s j χ i s i + χ j s j + χ i χ j s i s j . (G1) The mutual information (MI), recently used in genome-scale epistasis analysis in [83], is the expected log-contrast I ij = (cid:88) s i ,s j p ( s i , s j ) log c ( s i , s j )= 14 (cid:18) (1 + χ i + χ j + a ij ) log 1 + χ i + χ j + a ij χ i + χ j + χ i χ j +(1 − χ i + χ j − a ij ) log 1 − χ i + χ j − a ij − χ i + χ j − χ i χ j +(1 + χ i − χ j − a ij ) log 1 + χ i − χ j − a ij χ i − χ j − χ i χ j + (1 − χ i − χ j + a ij ) log 1 − χ i − χ j + a ij − χ i − χ j + χ i χ j (cid:19) . (G2)with a ij = χ i χ j + χ ij .For given magnetizations ( χ i and χ j ), correlations andmutual information of Boolean variables are hence inone-to-one correspondence. In particular, zero correla-tion implies zero MI. For categorical data (more thantwo states per variable), correlation is conveniently de-fined as a matrix f ij ( a, b ) = (cid:104) s i ,a s j ,b (cid:105) − f i ( a ) f j ( b ) (G3)where f i ( a ) = (cid:104) s i ,a (cid:105) and f j ( b ) = (cid:104) s j ,b (cid:105) . Mutual infor-mation is defined in the same way as in (G2), except thatthe sums go over the ranges of indices a and b . Frobeniusnorm (or score) of a correlation matrix is defined as s ij = (cid:115)(cid:88) a,b f ij ( a, b ) (G4)Mutual information and Frobenius norm of correla-tions of categorical variables are not generally related.It could therefore be the case that the information Fig.(3) and Table IV in main body of the paper would bedifferent if the assessment would be done for mutual in-formation. Fig. 10 and Table V show that this is sub-stantially not the case.Another way to quantify interdependence between tworandom variables is the p -value of Fisher’s exact test [84],recently used in the present context in [36]. If in n samples in total outcome ab is found n ab time thenthe p -value of Fisher’s exact test is the probability thatthese outcomes would have been observed in indepen-dent draws of independently distributed variables. ForBoolean variables that has the exact expression3 profile 2 profile 4 profile 5 profile 6 profile 7 profile 11 profile 17 profile 18 profile 20 profile 23 profile 25 profile 30 profile 31 profile 32 profile 36 profile 37 profile 39 profile 40 profile 41 profile 46 profile 47 profile 43 profile 49 profile 50 FIG. 6. Links within top-2000 PLM scores of samples with profile randomization. The links located in coding regions andterminals have mutations excluding gaps and ’N’s are retained in these circos plots. There are 24 out of 50 samples containlinks. However, all of these are ranked low starting from the original MSA, and none of them show up in Table I in the mainbody of the paper. p ij ( n, { n ab } ) = ( n + n − )! · ( n − + n − − )! · ( n + n − )! · ( n − + n − − )! n ! · n − ! · n − ! · n − − ! · n ! (G5)In the limit of many samples n ab = n · p ij ( a, b ) almost surely, and by Stirling’s formulalog p ij ( n ) ≈ n (cid:0) ( p ij (1 ,
1) + p ij (1 , − p ij (1 ,
1) + p ij (1 , − p ij ( − ,
1) + p ij ( − , − p ij ( − ,
1) + p ij ( − , − p ij (1 ,
1) + p ij ( − , p ij (1 ,
1) + p ij ( − , p ij (1 , −
1) + p ij ( − , − p ij (1 , −
1) + f ij ( − , − − p ij (1 ,
1) log p ij (1 , − p ij (1 , −
1) log p ij (1 , − − p ij ( − ,
1) log p ij ( − , − p ij ( − , −
1) log p ij ( − , − (cid:1) , (G6)which is mutual information, up to a factor − n . In a more general setting, this result follows from Sanov’s4 phylogeny 1 phylogeny 2 phylogeny 3 phylogeny 4 phylogeny 5 phylogeny 6 phylogeny 7 phylogeny 8 phylogeny 9 phylogeny 10 phylogeny 11 phylogeny 12 phylogeny 13 phylogeny 14 phylogeny 15 phylogeny 16 phylogeny 17 phylogeny 18 phylogeny 19 phylogeny 20 phylogeny 21 phylogeny 22 phylogeny 23 phylogeny 24 phylogeny 25 phylogeny 26 phylogeny 27 phylogeny 28 phylogeny 29 phylogeny 30 FIG. 7. Links within top-200 PLM scores of the 1st to the 30th samples with phylogenetic randomization. The links locatedin coding regions are retained in these circos plots. Each phylogenetic sample contains interactions effects with a rank cut-offthreshold of 200. Blue shows links with lengths d ij longer than 3 nucleotide. Red shows links with d ij ≤ nt for links withrank in top-50s. Black shows links with d ij ≤ nt for links with rank in top 51-200. lemma in Information Theory. For the data we considerwe are always in the range of very large n , and Fisher’sexact test therefore does not give additional information. Appendix H: On putative couplings between nsp7(11843..12091) and nsp8 (12092..12685)
As described above, the set of loci remaining afterfiltering depends on the percentage of the a major nu-5 phylogeny 31 phylogeny 32 phylogeny 33 phylogeny 34 phylogeny 35 phylogeny 36 phylogeny 37 phylogeny 38 phylogeny 39 phylogeny 40 phylogeny 41 phylogeny 42 phylogeny 43 phylogeny 44 phylogeny 45 phylogeny 46 phylogeny 47 phylogeny 48 phylogeny 49 phylogeny 50
FIG. 8. Links within top-200 PLM scores of the 31st to 50th samples with phylogenetic randomization.TABLE V. Top-10 significant links found by mutual information (MI) analysis in the coding region for the dataset with cut-offdate 2020-08-08.Rank a Locus 1 - Locus 2 - Mutualprotein protein Information782 28882-N 28883-N 0.5752792 28881-N 28882-N 0.5698794 28881-N 28883-N 0.5689824 3037-nsp3 23403-S 0.5552836 3037-nsp3 14408-nsp12 0.5478840 14408-nsp12 23403-S 0.54551643 1059-nsp2 25563-ORF3a 0.32611775 8782-nsp4 28144-ORF8 0.311818484 17858-nsp13 18060-nsp14 0.170519364 17747-nsp13 17858-nsp13 0.1679 a Rank for top-10 coding links highlighted by mutual information (MI) analysis. Epistatic links with terminals having synonymousmutations or located in the non-coding regions are omitted. cleotide P major used as threshold. As the two viral genesnsp7 and nsp8 are known to interact [85] we looked forepistatic interactions between loci in these two genes.None appear using the threshold P major = 96 .
5% em-ployed in the main body of the paper. All loci locatedin nsp7 and nsp8 are deleted during filtering procedureusing this threshold. To nevertheless consider possible epistasis effects within these two regions, we increasedthe value of P major . As P major grows more loci remainafter filtering, in all regions, which increases the compu-tational burden. To mitigate this effect, we further fil-tered the MSA matrix by considering the mutation typeof each remaining loci. In the following we have filteredout loci where one of the dominant mutation type are6 FIG. 9. Top-200 significant pairwise epistasis from the 2020-08-08 data-set. Upper: the filtering threshold is 90% (for agiven locus, if the percentage of a certain main nucleotide(A, C, G, T, -, N) exceeds the threshold, this locus will bedeleted.); Lower: the threshold value is 95%. Only epistaticlinks with both terminals located in the coding region andno gaps and ’N’s included in mutations are presented here.Colored lines indicate for top 50s, grey for links with ranksin top 50-200s. Red lines show short-distance links ( ≤ gaps (’-’) or unknown nucleotide (’N’).With these alternative filtering criteria, we tested val-ues of P major from 97% to 99.9% and found that theloci in nsp7 and nsp8 show up when P major > . P major = 99 , P major = 99 , P major = 99 . P major = 99 . ↓ FIG. 10. Distribution of Mutual information (MI) betweenpairwise loci for the original data-set with cut-off date 2020-08-08. The score pointed by the red arrow corresponds tothe link with most significant score (1059 to 25563) by PLManalysis. number of inferred links is 167,910. The epistasis analy-sis results provided by PLM procedure are summarizedin Table VI. Some significant links with high ranks areincluded for comparison. Both ranks and PLM scoresshow that the epistasis effects between nsp7 and nsp8 aremuch weaker than the significant ones. The ranks for thelisted links in Table VI are obtained without consideringgaps ’-’ and not recognized notation ’N’. However, theyfit well with the results that including the effects of gapsand ’N’s, at least for the links with the rank of top-50s.In summary, links between loci in genes nsp7 and nsp8only appear using much more permissive filtering crite-ria and with much lower rank than the top-200 listed inTable I in the main body of the paper.
Appendix I: On putative couplings between nsp10(13025..13441) and nsp14 (18040..19620)
As the two viral genes nsp10 and nsp14 are also knownto interact [86, 87] we looked for epistatic interactionsbetween loci in these two genes, applying the same pro-cedure as above. With P major = 99 .
8% and P major =99 . P major = 96 .
5% is used, theseloci are filtered out and do not appear.
Appendix J: On putative couplings involving loci inSpike
D614G in Spike is a well known mutation [88] of SARS-CoV-2, and has become the most prevalent form in the7
TABLE VI. Links between loci in genes nsp7 and nsp8 obtained after the modified filtering procedure applied to the 2020-08-08data set. Loci without gaps ’-’, without not recognized ’N’ and satisfying permissive threshold criterion P major = 99 .
8% and P major = 99 .
9% have been retained and reported here.Locus 1 mutation Locus 2 mutation PLM score Rank PLM score Rank-protein -type -protein -type P major = 99 . P major = 99 . | T-non. 25563-ORF3a G | T-non. 1.9526 3 2.0365 18782-nsp4 C | T-syn. 28144-ORF8 T | C-non. 1.4479 6 1.9739 617747-nsp13 C | T-non. 17858-nsp13 A | G-non. 0.8553 27 1.0522 2217858-nsp13 T | C-non. 18060-nsp14 C | T-syn. 0.8624 26 0.9787 2717747-nsp13 C | T-non. 18060-nsp14 C | T-syn. 0.7780 36 0.8529 3911083-nsp6 G | T-non. 14805-nsp12 C | T-syn. 0.5040 134 0.7109 5811916-nsp7 C | T-non. 12199-nsp8 A | G-syn. 0.0259 19781 0.0113 8650411916-nsp7 C | T-non. 12478-nsp8 G | A-non. 0.0211 25861 0.0097 10960012025-nsp7 C | T-syn. 12550-nsp8 G | A-syn. 0.0325 2093312025-nsp7 C | T-syn. 12199-nsp8 A | G-syn. 0.0264 2949612025-nsp7 C | T-syn. 12534-nsp8 C | T-non. 0.0237 3579712025-nsp7 C | T-syn. 12478-nsp8 G | A-non. 0.0236 3593612025-nsp7 C | T-syn. 12400-nsp8 C | T-syn. 0.0202 4846411916-nsp7 C | T-non. 12400-nsp8 C | T-syn. 0.0119 8350111916-nsp7 C | T-non. 12513-nsp8 C | T-syn. 0.0117 8456211916-nsp7 C | T-non. 12534-nsp8 C | T-non. 0.0110 8910211916-nsp7 C | T-non. 12503-nsp8 T | C-non. 0.0106 9283412025-nsp7 C | T-syn. 12513-nsp8 C | T-non. 0.0099 10374611916-nsp7 C | T-non. 12550-nsp8 G | A-syn. 0.0098 10553412025-nsp7 C | T-syn. 12503-nsp8 T | C-non. 0.0098 106633TABLE VII. Links between loci in genes nsp10 and nsp14 obtained after the modified filtering procedure applied to the 2020-08-08 data set. Loci without gaps ’-’, without not recognized ’N’ and satisfying permissive threshold criterion P major = 99 . P major = 99 .
9% have been retained and reported here. Only links appears in the filtering by both values of threshold arepresented here.Locus 1 mutation Locus 2 mutation PLM score Rank PLM score Rank-protein -type -protein -type P major = 99 . P major = 99 . | T-non. 25563-ORF3a G | T-non. 1.9526 3 2.0365 18782-nsp4 C | T-syn. 28144-ORF8 T | C-non. 1.4479 6 1.9739 617747-nsp13 C | T-non. 17858-nsp13 A | G-non. 0.8553 27 1.0522 2217858-nsp13 T | C-non. 18060-nsp14 C | T-syn. 0.8624 26 0.9787 2717747-nsp13 C | T-non. 18060-nsp14 C | T-syn. 0.7780 36 0.8529 3911083-nsp6 G | T-non. 14805-nsp12 C | T-syn. 0.5040 134 0.7109 5813216-nsp10 T | G-non. 18060-nsp14 C | T-syn. 0.0407 10743 0.0201 4893013216-nsp10 T | G-non. 18788-nsp14 C | T-non. 0.0258 19879 0.0115 8552213216-nsp10 T | G-non. 18877-nsp14 C | T-syn. 0.0222 23695 0.0120 83143 global pandemic COVID19. This mutation is at po-sition of 23403 with respect to the reference genomicsequence (Wuhan-Hu-1). Through PLM procedure onthe whole genomic sequences, we observed two pairwisecouplings 3037-23403 and 14408-23403 in top-50 PLMscores. These two pairwise links however showed fairlyoften in phylogeny randomization tests, and have there-fore been interpreted as effects of shared inheritance(phylogeny). In above we show all PLM links involvingloci in Spike up to rank 5000, all of them significantlybelow top-200. A notable observation as shown in TableVIII is that the locus 23403 appears in all these links.As the epistatic inference is built on retaining the linkswith highest PLM scores that also do not also appear af-ter randomization, none of the entries in Table VIII areretained as predicted epistatic interactions in Table IIIof the main body of the paper.
Appendix K: Potential drugs for proteins in TableIII of the main body of the paper
There are eight proteins listed in Table III in mainbody of the paper. Except ORF3a, all these viral pro-teins have potential drugs listed in [74], sorted by humaninteractors of these proteins, see Table IX. As indicatedin the table, several of these drugs are approved, for dif-ferent purposes listed in [74], while some are still in pre-clinical or clinical trials.
Appendix L: Results from data set until 20200502
As more data accumulates the predictions obtainedfrom DCA may change. In the main body of the paperwe show in Fig. 4 that the leading predictions are stableusing four different cut-off dates: (April 1, April 8, May2, August 8).8
TABLE VIII. Links involving loci in Spike protein obtained after the filtering procedure described in the main context appliedto the 2020-08-08 data set with P major = 96 . P major = 96 . | C-syn. 23403-S G | A-non. 1.0114 1414408-nsp12 T | C-non. 23403-S G | A-non. 0.9917 1723403-S G | A-non. 25563-ORF3a G | T-non. 0.3440 36711083-nsp6 G | T-non. 23403-S G | A-non. 0.3246 42823403-S G | A-non. 28144-ORF8 T | C-non. 0.3240 4308782-nsp4 C | T-syn. 23403-S G | A-non. 0.3022 52214805-nsp12 C | T-syn. 23403-S G | A-non. 0.2787 67220268-nsp15 A | G-syn. 23403-S G | A-non. 0.2496 96423403-S G | A-non. 26144-ORF3a G | T-non. 0.2349 116623403-S G | A-non. 28881-N G | A-non. 0.2108 157123403-S G | A-non. 28882-N G | A-syn. 0.2093 160723403-S G | A-non. 28883-N G | C-non. 0.2083 16191059-nsp2 C | T-non. 23403-S G | A-non. 0.2000 179818060-nsp14 C | T-syn. 23403-S G | A-non. 0.1603 284417858-nsp13 A | G-non. 23403-S G | A-non. 0.1532 307717747-nsp13 C | T-non. 23403-S G | A-non. 0.1392 361818877-nsp14 C | T-syn. 23403-S G | A-non. 0.1186 4523TABLE IX. Potential drugs for interactors of proteins in Table III, as listed in [74]. Approved drugs for any purpose in boldface.Bait Interactor(s) Drug Statusnsp2 FKBP15 Rapamycin b Approved
EIF4E2/H Zotatifin b Clinical trialsORF3a None in [74] - -nsp4 NUPs RAE1 Selinexor b Approved
ORF8 DNMT1 Azacitidine a Approved
LOX CCT 365623 a Pre-clinicalFKBP7/10 Rapamycin b Approved
FKBP7, FKBP10 FK-506 b Approved
PLOD1/2 Minoxidil b Approved nsp14 IMPDH2 Merimepodib a Clinical TrialGLA Migalastat a Approved
IMPDH2 Mycophenolic acid a Approved
IMPDH2 Ribavirin a Approved
IMPDH2 Sanglifehrin A b Pre-clinicalnsp12 RIPK1 Ponatinib a Approved nsp13 PRKACA H-89 a Pre-clinicalMARK3,TBK1 ZINC95559591 a Pre-clinicalCEP250 WDB002 b Clinical Trialnsp6 ATP6AP1 Bafilomycin A1 a Pre-clinicalSIGMAR1 E-52862 a Clinical trialsSIGMAR1 PD-144418 a Pre-clinicalSIGMAR1 RS-PPCC a Pre-clinicalSIGMAR1 PB28 a Pre-clinicalSIGMAR1 Haloperidol a Approved
SLC6A15 Loratadine a Approved
SIGMAR1 Chloroquine b Approved a Entry taken from [74], Supplementary Table 4, ”Literature-derived drugs and reagents that modulate SARS-CoV-2 interactors”. b Entry taken from [74], Supplementary Table 5, ”Expert-identified drugs and reagents that modulate SARS-CoV-2 interactors”.
Here we show as a further robustness test the otherfigures and tables of the main body of the paper, but forthe second largest data set (cut-off date May 2, 2020).Table X presents the top-50 significant links for the20200502 data set, all of them appear in Table I in themain context for the 20200808 data set. Table XI showsthe top 10 significant links provided by the correlation analysis for the 20200502 data, which could be comparedwith Table IV in the main body of the paper. The Fig. 11shows the histogram of PLM scores for the original MSAs(Fig. 11(a)) as well as that for phylogeny (Fig. 11(b)) andprofile (Fig. 11(c)) randomization respectively. This plotcan be compared with the Fig. 2 in the main context.9 -1 PLM scores C o un t s original_q6Top 50s (a) ↓ (cid:38) ↓ ↓ (cid:38) -1 PLM scores C o un t s phylo_3_q6Top 50s (b) ↓ -1 PLM scores C o un t s profile_4_q6Top 50s (c) ↓ FIG. 11. Histograms of PLM scores for (a) 2020-05-02 data-set, (b) a phylogeny randomized sample and (c) a profile ran-domized sample. The blue bars for all scores while the redones for top-50 largest scores. Red arrows in (a) indicate linkslisted in Table III, with the 29th link almost overlapping withthe 22nd. The largest PLM score is pointed to by red arrowsfor random samples in (b) and (c). None of them is locatedinside a coding region, and none of them appear in Table Xand Table III in the main context. TABLE X. Top-50 significant epistatic links between loci for the 2020-05-02 data-set. To be compared to Table I in main bodyof paper.Rank a Locus 1 - b mutation- c Locus 2 - mutation- PLMprotein type protein type score1 8782-nsp4 C | T-syn. 28144-ORF8 T | C-non. 2.06492 1059-nsp2 C | T-non. 25563-ORF3a G | T-non. 1.94803 28882-N G | A-syn. 28883-N G | C-non. 1.91164 28881-N G | A-non. 28882-N G | A-syn. 1.87745 28881-N G | A-non. 28883-N G | C-non. 1.85949 17747-nsp13 C | T-non. 17858-nsp13 A | G-non. 1.479813 11083-nsp6 G | T-non. 14805-nsp12 C | T-syn. 1.387616 3037-nsp3 T | C-syn. 23403-S G | A-non. 1.337417 3037-nsp3 T | C-syn. 14408-nsp12 T | C-non. 1.276620 14408-nsp12 T | C-non. 23403-S G | A-non. 1.210122 17858-nsp13 T | C-non. 18060-nsp14 C | T-syn. 1.197329 17747-nsp13 C | T-non. 18060-nsp14 C | T-syn. 1.1392 a Indices of significant links in the top 50s with both terminals located inside a coding region. Analogous tables for 2020-04-01 and2020-04-08 data-sets are available on [29] b Information on the starting locus: index in the reference sequence, the protein it belongs to. c Information on mutations of the starting locus: the first and second prevalent nucleotide at this locus, mutation type:synonymous(syn.), non-synonymous(non.).
TABLE XI. Top-10 significant links found by correlation anal-ysis in the coding region for the data set till 2020-05-02. Tobe compared to Table 4 in main body of paper.Rank a Locus 1 - Locus 2 - Frobeniusprotein protein Score1 14408-nsp12 23403-S 0.47262 3037-nsp3 23403-S 0.47113 3037-nsp3 14408-nsp12 0.4706492 1059-nsp2 25563-ORF3a 0.2894570 8782-nsp4 28144-ORF8 0.2803776 28882-N 28883-N 0.255779 28881-N 28882-N 0.2543782 28881-N 28883-N 0.25421309 14408-nsp12 25563-ORF3a 0.20861044 23403-S 25563-ORF3a 0.2081 a Rank for top-10 coding links highlighted by correlationanalysis. Links located within coding regions and thecorresponding nucleotide mutation excluding gaps or ‘N’s areconsidered. [1] World Health Organization, “Coronavirus disease (covid-19) pandemic,” (2020), accessed September 28, 2020.[2] K. Moritz, Y. Chia-Hung, G. Bernardo, W. Chieh-Hsi,K. Brennan, P. D. M., du Plessis Louis, F. N. R.,L. Ruoran, H. W. P., B. J. S., L. Maylis, V. Alessan-dro, T. Huaiyu, D. Christopher, P. O. G., and S. S. V.,Science , 493 (2020).[3] H. Salje, C. T. Kiem, N. Lefrancq, N. Courtejoie,P. Bosetti, J. Paireau, A. Andronico, N. Hoz´e, J. Richet,C.-L. Dubost, et al. , Science (2020).[4] Y. Shu and J. McCauley, Eurosurveillance (2017).[5] M. Pachetti, B. Marini, F. Benedetti, F. Giudici,E. Mauro, P. Storici, C. Masciovecchio, S. Angeletti,M. Ciccozzi, R. C. Gallo, D. Zella, and R. Ippodrino, JTransl Med. , 179 (2020).[6] M. M. Lai and D. Cavanagh, Adv Virus Res , 1100(1997).[7] R. L. Graham and R. S. Baric, Journal of Virology , 3134 (2010).[8] J. Gribble, A. J. Pruijssers, M. L. Agostini, J. Anderson-Daniels, J. D. Chappell, X. Lu, L. J. Stevens, A. L.Routh, and M. R. Denison, “The coronavirus proofread-ing exoribonuclease mediates extensive viral recombina-tion,” bioRxiv (2020).[9] X. Li, E. E. Giorgi, M. H. Marichannegowda, B. Foley,C. Xiao, X.-P. Kong, Y. Chen, S. Gnanakaran, B. Kor-ber, and F. Gao, Science Advances (2020).[10] M. Kimura, Genetics , 875 (1965).[11] R. A. Neher and B. I. Shraiman, Proc. Natl. Acad. Sci. , 6866 (2009).[12] R. A. Neher and B. I. Shraiman, Rev. Mod. Phys. ,1283 (2011).[13] C.-Y. Gao, F. Cecconi, A. Vulpiani, H.-J. Zhou, andE. Aurell, Phys. Biol. , 026002 (2019).[14] H.-L. Zeng and E. Aurell, Phys. Rev. E , 052409(2020). [15] F. Morcos, A. Pagnani, B. Lunt, A. Bertolino, D. S.Marks, C. Sander, R. Zecchina, J. N. Onuchic, T. Hwa,and M. Weigt, PNAS , E1293 (2011).[16] S. Cocco, C. Feinauer, M. Figliuzzi, R. Monasson, andM. Weigt, Reports on Progress in Physics (2018).[17] K. Shekhar, C. F. Ruberman, A. L. Ferguson, J. P. Bar-ton, M. Kardar, and A. K. Chakraborty, Phys. Rev. E , 062705 (2013).[18] J. K. Mann, J. P. Barton, A. L. Ferguson, S. Omarjee,B. D. Walker, A. Chakraborty, and T. Ndung’u, PLOSComputational Biology , 1 (2014).[19] R. R. Cheng, O. Nordesj¨o, R. L. Hayes, H. Levine, S. C.Flores, J. N. Onuchic, and F. Morcos, Molecular Biologyand Evolution , 3054 (2016).[20] J. A. de la Paz, C. M. Nartey, M. Yuvaraj, and F. Mor-cos, PNAS , 5873 (2020).[21] R. Horta, P. Barrat-Charlaix, and M. Weigt, Entropy , 1 (2019).[22] P. Forster, L. Forster, C. Renfrew, and M. Forster,PNAS , 9241 (2020).[23] R. A. Khailany, M. Safdar, and M. Ozaslan, Gene re-ports et al , “A GenomicSurvey of SARS-CoV-2 Reveals Multiple Introductionsinto Northern California without a Predominant Lin-eage,” medRxiv (2020).[26] J. Phelan, W. Deelder, D. Ward, S. Campino, M. L. Hib-berd, and T. G. lark, “Controlling the SARS-CoV-2 out-break, insights from large scale whole genome sequencesgenerated across the world,” biorxiv (2020).[27] P. Sashittal, Y. Luo, J. Peng, and M. El-Kebir, “Char-acterization of SARS-CoV-2 viral diversity within andacross hosts,” biorxiv (2020).[28] X. Tang, C. Wu, X. Li, Y. Song, X. Yao, X. Wu, Y. Duan,H. Zhang, Y. Wang, Z. Qian, J. Cui, and J. Lu, NationalScience Review (2020).[29] H.-L. Zeng, “hlzeng/Filtered MSA SARS CoV 2,”Github (2020), https://github.com/hlzeng/Filtered_MSA_SARS_CoV_2 .[30] K. Katoh, J. Rozewicki, and K. D. Yamada, Briefings inBioinformatics , 1160 (2017), https://mafft.cbrc.jp/alignment/server/ .[31] S. Kuraku, C. M. Zmasek, O. Nishimura, and K. Katoh,Nucleic Acids Research , W22 (2013).[32] M. Ekeberg, T. Hartonen, and E. Aurell, J. Comput.Phys. , 341 (2014).[33] O. A. MacLean, R. J. Orton, J. B. Singer, and D. L.Robertson, Virus Evolution (2020).[34] M. J. Skwark, N. J. Croucher, S. Puranen,C. Chewapreecha, M. Pesonen, Y. Y. Xu, P. Turner,S. R. Harris, S. B. Beres, J. M. Musser, J. Parkhill,S. D. Bentley, E. Aurell, and J. Corander, PLos Genet. , e1006508 (2017).[35] B. Schubert, R. Maddamsetti, J. Nyman, M. R. Farhat,and D. S. Marks, Nat. Microbiol. , 328 (2019).[36] Y. Cui, C. Yang, H. Qiu, H. Wang, R. Yang, andD. Falush, eLife , e54136 (2020).[37] E. Schneidman, M. J. Berry, R. Segev, and W. Bialek,Nature , 1007 (2006).[38] W. P. Russ, D. M. Lowery, P. Mishra, M. B. Yaffe, andR. Ranganathan, Nature , 579 (2005).[39] M. Socolich, S. W. Lockless, W. P. Russ, H. Lee, K. H.Gardner, and R. Ranganathan, Nature , 512 (2005).[40] H. C. Nguyen, R. Zecchina, and J. Berg, Advances inPhysics , 197 (2017).[41] J. Besag, The Statistician , 179 (1975).[42] P. Ravikumar, M. J. Wainwright, and J. D. Lafferty, Ann. Stat. , 1287 (2010).[43] E. Aurell and M. Ekeberg, Phys. Rev. Lett. , 090201(2012).[44] M. Ekeberg, C. L¨ovkvist, Y. Lan, M. Weigt, and E. Au-rell, Phys. Rev. E , 012707 (2013).[45] C.-Y. Gao, “gaochenyi/cc-plm,” Github (2018), http://github.com/gaochenyi/CC-PLM .[46] M. I. Krzywinski, J. E. Schein, I. Birol, J. Connors,R. Gascoyne, D. Horsman, S. J. Jones, and M. A. Marra,Genome Res. , 1639 (2009).[47] P. Mandin, F. Repoila, M. Vergassola, T. Geissmann,and P. Cossart, Nucleic Acids Research , 962 (2007).[48] Y. Xu, S. Puranen, J. Corander, and Y. Kabashima,Phys. Rev. E , 062112 (2018).[49] E. R. Horta and M. Weigt, “Phylogenetic correlationshave limited effect on coevolution-based contact predic-tion in proteins,” bioRxiv (2020).[50] F. Wu, S. Zhao, B. Yu, Y.-M. Chen, W. Wang, Z.-G.Song, Y. Hu, Z.-W. Tao, J.-H. Tian, Y.-Y. Pei, et al. ,Nature , 265 (2020).[51] F. K. Yoshimoto, Protein J , 198 (2020).[52] R. A. Hosseini and M. A. Donald, “Implications of sars-cov-2 mutations for genomic rna structure and host mi-crorna targeting,” (2020).[53] W. Rui, C. Jiahui, G. Kaifu, H. Yuta, Y. Changchuan,and W. Guo-Wei, “Characterizing SARS-CoV-2 muta-tions in the United States,” (2020).[54] D. M. Kern, B. Sorum, C. M. Hoel, S. Sridharan, J. P.Remis, D. B. Toso, and S. G. Brohawn, “Cryo-EM struc-ture of the SARS-CoV-2 3a ion channel in lipid nan-odiscs,” bioRxiv (2020).[55] Y.-J. Tan, P.-Y. Tham, D. Z. L. Chan, C.-F. Chou,S. Shen, B. C. Fielding, T. H. P. Tan, S. G. Lim, andW. Hong, J Virol , 1008310087 (2005).[56] W. Lu, B.-J. Zheng, K. Xu, W. Schwarz, L. Du, C. K. L.Wong, J. Chen, S. Duan, V. Deubel, and B. Sun, PNAS , 1254012545 (2006).[57] K.-L. Siu, K.-S. Yuen, C. C. no Rodriguez, Z.-W. Ye, M.-L. Yeung, S.-Y. Fung, S. Yuan, C.-P. Chan, K.-Y. Yuen,L. Enjuanes, and D.-Y. Jin, FASEB J , 88658877(2019).[58] Y. Ren, T. Shu, D. Wu, J. Mu, C. Wang, M. Huang,Y. Han, X.-Y. Zhang, W. Zhou, Y. Qiu, and X. Zhou,Cellular and Molecular Immunology , 881 (2020).[59] I. Elio, M. Georgi, P. Balig, S. Tamara, and T. Sima,Msystems , e00266 (2020).[60] Y. Zhang, J. Zhang, Y. Chen, B. Luo, Y. Yuan,F. Huang, T. Yang, F. Yu, J. Liu, B. Liu, Z. Song,J. Chen, T. Pan, X. Zhang, Y. Li, R. Li, W. Huang,F. Xiao, and H. Zhang, bioRxiv (2020).[61] J.-Y. Li, C.-H. Liao, Q. Wang, Y.-J. Tan, R. Luo, Y. Qiu,and X.-Y. Ge, Virus research , 198074 (2020).[62] M. Daniele and G. F. M., Frontiers in Microbiology ,1800 (2020).[63] J. Chen, B. Malone, E. Llewellyn, M. Grasso, P. M. Shel-ton, P. D. B. Olinares, K. Maruthi, E. T. Eng, H. Vatan-daslar, B. T. Chait, et al. , Cell (2020).[64] D. Benvenuto, S. Angeletti, M. Giovanetti, M. Bianchi,S. Pascarella, R. Cauda, M. Ciccozzi, and A. Cassone,J Infect. , e24e27 (2020).[65] J. M. Smith and J. M. M. Smith, Evolution and the The-ory of Games (Cambridge university press, 1982).[66] M. A. Nowak and K. Sigmund, Science , 793 (2004).[67] J. C. Claussen and A. Traulsen, Physical review letters , 058104 (2008).[68] A. Ferguson, J. Mann, S. Omarjee, T. Ndungu,B. Walker, and A. Chakraborty, Immunity , 606(2013). [69] V. Dahirel, K. Shekhar, F. Pereyra, T. Miura, M. Arty-omov, S. Talsania, T. M. Allen, M. Altfeld, M. Carring-ton, D. J. Irvine, B. D. Walker, and A. K. Chakraborty,PNAS , 11530 (2011).[70] L. V. Tse, R. M. Meganck, R. L. Graham, and R. S.Baric, Frontiers in Microbiology , 658 (2020).[71] F. Amanat and F. Krammer, Immunity , 583 (2020).[72] T. T. Le, Z. Andreadakis, A. Kumar, R. G. Rom´an,S. Tollefsen, M. Saville, and S. Mayhew, Nat Rev DrugDiscov , 305 (2020).[73] T. T. Le, J. P. Cramer, R. Chen, and S. Mayhew, NatRev Drug Discov (2020).[74] D. Gordon, G. Jang, M. Bouhaddou, and et al. , Nature , 026002 (2020).[75] M. Kimura, Evolution , 278 (1956).[76] H.-L. Zeng, E. Mauri, S. C. Vito Dichio, R. Monasson,and E. Aurell, “Inferring epistasis from genomic data byGaussian closure,” arXiv:2006.16735 (2020).[77] M. J. Wainwright and M. I. Jordan, Foundations andTrends in Machine Learning , 1 (2008).[78] Y. Roudi, E. Aurell, and J. A. Hertz, Front. Comput.Neurosci. , 1 (2009).[79] J. S¨oding, Science , 248 (2017).[80] C.-Y. Gao, H.-J. Zhou, and E. Aurell, Phys. Rev. E , 032407 (2018).[81] J. Felsenstein, Inferring Phylogenies , Vol. 2 (MA:Sinauerassociates Sunderland, 2004).[82] N. Saitou and M. Nei, Molecular biology and evolution , 406 (1987).[83] J. Pensar, S. Puranen, B. Arnold, N. MacAlasdair,J. Kuronen, G. Tonkin-Hill, M. Pesonen, Y. Xu,A. Sipola, L. Snchez-Bus, J. A. Lees, C. Chewapreecha,S. D. Bentley, S. R. Harris, J. Parkhill, N. J. Croucher,and J. Corander, Nucleic Acids Research , e112 (2019).[84] Wikipedia Community, “Fisher’s exact test,” Wikipedia(2020).[85] Q. Peng, R. Peng, B. Yuan, J. Zhao, M. Wang, X. Wang,Q. Wang, Y. Sun, Z. Fan, J. Qi, et al. , Cell Reports ,107774 (2020).[86] Y. Ma, L. Wu, N. Shaw, Y. Gao, J. Wang, Y. Sun,Z. Lou, L. Yan, R. Zhang, and Z. Rao, Proceedingsof the National Academy of Sciences , 9436 (2015).[87] M. Romano, A. Ruggiero, F. Squeglia, G. Maga, andR. Berisio, Cells , 1267 (2020).[88] B. Korber, W. M. Fischer, S. Gnanakaran, H. Yoon,J. Theiler, W. Abfalterer, N. Hengartner, E. E. Giorgi,T. Bhattacharya, B. Foley, et al. , Cell182