[PDF] Aligning biological sequences by exploiting residue conservation and coevolution

Abstract

Sequences of nucleotides (for DNA and RNA) or amino acids (for proteins) are central objects in biology. Among the most important computational problems is that of sequence alignment, i.e. arranging sequences from different organisms in such a way to identify similar regions, to detect evolutionary relationships between sequences, and to predict biomolecular structure and function. This is typically addressed through profile models, which capture position-specificities like conservation in sequences, but assume an independent evolution of different positions. Over the last years, it has been well established that coevolution of different amino-acid positions is essential for maintaining three-dimensional structure and function. Modeling approaches based on inverse statistical physics can catch the coevolution signal in sequence ensembles; and they are now widely used in predicting protein structure, protein-protein interactions, and mutational landscapes. Here, we present DCAlign, an efficient alignment algorithm based on an approximate message-passing strategy, which is able to overcome the limitations of profile models, to include coevolution among positions in a general way, and to be therefore universally applicable to protein- and RNA-sequence alignment without the need of using complementary structural information. The potential of DCAlign is carefully explored using well-controlled simulated data, as well as real protein and RNA sequences.

Full PDF

AAligning biological sequences by exploiting residue conservation and coevolution

Anna Paola Muntoni,

1, 2, 3

Andrea Pagnani,

1, 4, 5

Martin Weigt, and Francesco Zamponi Department of Applied Science and Technology (DISAT),Politecnico di Torino,Corso Duca degli Abruzzi 24, I-10129 Torino, Italy Laboratoire de Physique de l’Ecole Normale Sup´erieure, ENS, Universit´e PSL,CNRS, Sorbonne Universit´e, Universit´e de Paris, F-75005 Paris, France Sorbonne Universit´e, CNRS, Institut de Biologie Paris Seine,Biologie Computationnelle et Quantitative LCQB, F-75005 Paris, France Italian Institute for Genomic Medicine, IRCCS Candiolo, SP-142, I-10060 Candiolo (TO) - Italy INFN, Sezione di Torino, Via Giuria 1, I-10125 Torino, Italy

Aligning biological sequences belongs to the most important problems in computational sequenceanalysis; it allows for detecting evolutionary relationships between sequences and for predictingbiomolecular structure and function. Typically this is addressed through proﬁle models, whichcapture position-speciﬁcities like conservation in sequences, but assume an independent evolutionof diﬀerent positions. RNA sequences are an exception where the coevolution of paired bases inthe secondary structure is taken into account. Over the last years, it has been well establishedthat coevolution is essential also in proteins for maintaining three-dimensional structure and func-tion; modeling approaches based on inverse statistical physics can catch the coevolution signal andare now widely used in predicting protein structure, protein-protein interactions, and mutationallandscapes. Here, we present DCAlign, an eﬃcient approach based on an approximate message-passing strategy, which is able to overcome the limitations of proﬁle models, to include generalsecond-order interactions among positions and to be therefore universally applicable to protein- andRNA-sequence alignment. The potential of our algorithm is carefully explored using well-controlledsimulated data, as well as real protein and RNA sequences.

I. INTRODUCTION

In the course of evolution, proteins undergo substan-tial changes in their amino-acid sequences, while keepingtheir three-dimensional fold structure and their biologicalfunction remarkably conserved. In computational biol-ogy, this structural and functional conservation is exten-sively used: when we can establish that two proteins arehomologous, i.e. they share some common evolutionaryancestor, properties known for one protein can be used tocomputationally annotate its homolog. Even at the ﬁneramino-acid scale, two positions detected to be homolo-gous are expected to have the same positioning insidethe protein structure, and share common functionality(e.g. active sites or binding interfaces).Detecting homology is, however, not an easy task.First, homologous proteins may share only 20% or evenless of their amino acids, the others being substituted inevolution, making the detection of similarity rather in-volved. Even worse, proteins may change their length,amino acids may be inserted into a sequence, or deletedfrom it. Just looking to a single sequence, we have noinformation on which positions might be insertions ordeletions, and which positions might be inherited fromancestral proteins, possibly undergoing amino-acid sub-stitutions.To solve this problem, sequence alignments have to beconstructed [1]. The objective of sequence alignmentsis to identify homologous positions, also called matches,along with insertions and deletions, such that the alignedsequences become as similar as possible. In this context,three frequently used, but distinct alignment problems can be identiﬁed: • Pairwise alignments compare two sequences. Un-der some simplifying assumptions, cf. below, thisproblem can be solved eﬃciently using dynamicprogramming (i.e. an iterative method similar totransfer matrices or message passing in statisticalphysics) [2, 3]. Detecting homology by pairwisealignment is limited to rather close homologs. • More distant homology can be detected using multiple-sequence alignments (MSA) of more thantwo proteins [4], which minimize the global se-quence similarities by constructing a rectangularmatrix formed by amino acids and gaps, represent-ing both insertions and deletions. The rows of thismatrix are the individual proteins, the columnsaligned positions. Besides being able to detectmore distant homology, MSA allow for identify-ing conserved positions, i.e. columns which do not(or rarely) change the amino acid. These posi-tions are typically known to be important, eitherfor the functionality (e.g. active sites in proteins)or for the thermodynamic stability of the proteinfold. Although dynamic programming methods canbe generalized from pairwise to multiple-sequencealignments, their running time becomes exponen-tial in the number of sequences to be aligned. Manyheuristic strategies have been proposed followingthe seminal ideas of [5], cf. [6–9], but the construc-tion of accurate MSA of more than about 10 se-quences remains challenging. • For larger alignments, a simple strategy is widely a r X i v : . [ q - b i o . Q M ] M a y used. Instead of constructing a globally optimalsequence alignment, sequences are aligned one byone to a well-constructed seed MSA [10, 11]. As inthe case of MSA, this strategy allows for detect-ing distant homologs and for exploiting amino-acidconservation. If the seed MSA is reasonably large( ≥ sequences) and of high quality, very largeand rather accurate alignments of up to 10 se-quences can be constructed easily. This strategyis currently the best choice for constructing largefamilies of homologous proteins [12], and is also thesubject of our work.Up to one major exception discussed below, almostall sequence-alignment methods are based on the simpli-fying assumption of independent-site evolution . In thissetting, global sequence similarity can be expressed asa sum over similarities of individual columns. In termof sequence statistics, this accounts for assuming thatthe global probability of observing some full-length se-quence can be factorized over site-speciﬁc but indepen-dent single-site probabilities, also known as proﬁle mod-els , cf. [1], which are able to capture amino-acid conserva-tion, but no correlations between positions. A successfulvariant are proﬁle Hidden Markov Models [13], which as-sume independence of matched positions, but take intoaccount that gap stretches are more likely than many in-dividual gaps to reﬂect the tendency of homologous pro-teins to accumulate in the course of evolution large-scalemodular gene rearrangements.A major advantage of proﬁle models is their compu-tational eﬃciency, as they allow for determining opti-mal alignments in polynomial time using dynamic pro-gramming , cf. [1]. They also allow to take beneﬁt fromconserved positions, which serve intuitively as anchor-ing point in aligning new sequences to the seed MSA.Variable positions contribute much less to proﬁle-basedsequence alignment.An important exception are so-called covariance mod-els for functional RNA [14]. RNA sequences are charac-terized by low sequence conservation, making alignmentvia proﬁle models unreliable, but highly conserved sec-ondary structures, due to base pairing inside the single-stranded RNA molecules, and the formation of local he-lices, the so-called stems. Base pairing does not poseconstraints on the individual bases, but on the correctpairing in Watson-Crick pairs A:U and G:C, or wobblepairs G:U, which consequently have to be described bya non-factorisable pair distribution. In the case of RNA,the planar structure of the graph formed by the RNAchains and the base-pairings still enable the applicationof exact but computationally eﬃcient dynamic program-ming [14–17].In the last decades, amino-acid covariation, or coevo-lution, has been established as a statistically importantfeature of protein evolution [18]. Modeling MSA statis-tics via methods related to the so-called Direct-CouplingAnalysis (DCA) [19–22], inspired by inverse statisticalphysics [23–26], has found widespread applications in protein-structure prediction from sequence [27–30], de-tection of protein-protein interactions [19, 31–37], of mu-tational eﬀects in proteins [38–42] or even in data-drivensequence optimization and design [39, 43, 44]. DCA isbased on the assumption of a generalized Potts modelwith disordered pairwise couplings between sites.The resulting situation is somewhat paradoxical: pair-wise Potts models are inferred from MSA, but MSAare constructed using an independent-site assumption.Important structural and functional information is con-tained in the covariation of amino-acids, but it is ne-glected in the alignment procedures used for proteins.Our work aims at overcoming this paradox, by in-cluding the information contained in amino-acid (or nu-cleotide) covariation in aligning sequences to a seed MSA.This idea shows important similarity to that of covariancemodels and RNA alignment, but the lack of planarity ofthe underlying fully connected DCA models makes an ap-plication of dynamic programming impossible. We copewith this problem by proposing an approximate messagepassing strategy based on Belief Propagation [45], fur-ther simpliﬁed in a high-connectivity mean-ﬁeld limit forlong-range couplings [46].The plan of the paper is the following: we ﬁrst for-malize the problem and its statistical-physics descrip-tion. The latter allows us to derive DCAlign, a combinedbelief-propagation / mean-ﬁeld algorithm for aligning asequence to a Potts model constructed by DCA from theseed MSA. The eﬃciency of our algorithm is ﬁrst testedin the case of artiﬁcial data, which allow us to evaluatethe inﬂuence of conservation (i.e. single-site statistics)and covariation (i.e. two-site couplings) in the alignmentprocedure. Extensive tests and positive results are givenfor a number of real protein and RNA families. Technicaldetails of the derivation of the algorithm are provided inthe Supplementary Information (SI). II. SETUP OF THE PROBLEMA. Alignment

The method we are going to describe can be applied toalign diﬀerent types of biological sequences (viz. DNA,RNA, proteins). We discuss here the protein case, butthe extension to the other cases is straightforward and itwill be considered below. Let us consider an amino-acidsequence A = ( A , . . . , A N ) containing a protein domain S = ( S , ..., S L ) of a known family, which we want toidentify. Note that S may contain amino acids and gaps,while the original sequence A is composed exclusively byamino acids. We assume that the protein family is welldescribed by a Direct Coupling Analysis (DCA) model,or Potts Hamiltonian, or simply “energy”, H DCA ( S | J , h ) = − ,L (cid:88) i

Example of an alignment.

On the top, we showa full length sequence A , and on the bottom its alignment S ,in which both gap and insertion events occur. The domainto be aligned is highlighted in blue. We show explicitly threematched states S = A , S = A and S = A and a gapinsertion in S =“–”. In this example we also show a possibleway of skipping some amino-acids in the original sequence,that is to assign three insertions, highlighted in red. being H gap ( µ ) a penalty for adding gaps that de-pends on the number and position of gaps andon the hyper-parameters µ , and H ins ( λ ) being apenalty on insertions, parametrized by the hyper-parameters λ .We ﬁrst consider here the case where N (cid:38) L , i.e. weare trying to align a domain in a longer sequence, or N < L when we are trying to align a fragment. Thecase N (cid:29) L , i.e. when we search for a hit of the DCAmodel in a long sequence, may be computationally hardfor the approach proposed here, because the alignmenttime scales roughly as L N , as discussed below. Cur-rent state-of-the-art alignment methods like BLAST useheuristics to approximately locate possible hits, and per-form accurate alignment search only in these restrictedregions, to speed up search. It might be necessary to dothis before running DCAlign, but this is not the objec-tive of the current work. In general, it will be better if N is not too diﬀerent from L .An example of a full sequence and its alignment is givenin Fig. 1. The sequence on the top is a full sequenceof N = 19 amino-acids, and the highlighted part is thetarget domain. The aligned sequence of length L = 10,reported on the bottom, consists in one gap in position2 and 9 matched amino-acids. B. Gap and insertion penalties

The hyper-parameters µ , λ determine the cost ofadding a gap or an insertion in the aligned sequence.They must be carefully determined to allow for theseevents without aﬀecting the quality of the alignment. Inother words, we would like to reduce, as much as possi-ble, the number of gaps (when the statistics of gaps ofthe seed we use is biased) and to parsimoniously add in-sertions when energetically favorable, avoiding to pick upisolated amino acids in the alignment.To deal with insertions we use a so-called aﬃne penaltyfunction [1] parametrized by λ oi , the cost of adding a ﬁrstinsertion between positions i − i , and λ ei , the costof extending an existing insertion, with λ oi > λ ei . Thisresults in H ins ( λ ) = L (cid:88) i =2 ϕ i (∆ n i ) ,ϕ i (∆ n i ) = (1 − δ ∆ n i , ) (cid:2) λ io + λ ie (∆ n i − (cid:3) , (3)where ∆ n i is the number of insertions between positions i − i . This set of parameters can be learned froma seed alignment through a Maximum Likelihood (ML)approach as reported in Sec. IV B. Finally, we introducetwo types of gap penalties, denoted by µ int and µ ext ,which are associated with an “internal” gap between twomatched states and with an “external” gap (at the begin-ning and at the end of the aligned sequence), respectively.This gives H gap ( µ ) = L (cid:88) i =1 µ i , (4)where µ i = 0 for match states, µ i = µ int for internal gapsand µ i = µ ext for external gaps.An illustration is given by the aligned sequence inFig. 1. Insertions are highlighted in red, and the totalinsertions penalty is then given by λ o + λ e + λ e . A gap,which increases the total energy by µ int , is highlighted ingreen at position 2 of S . C. Statistical physics model

We now want to construct a discrete statistical-physicsmodel which deﬁnes this alignment. For the positions1 ≤ i ≤ L , the model has to encode the position of thegaps and of the match states, with their correspondingsymbol in the sequence ( A , ..., A N ). We therefore intro-duce two variables per site 1 ≤ i ≤ L . The ﬁrst one is aboolean “spin” x i ∈ { , } , which tells us if S i is a gap( x i = 0) or an amino-acid match ( x i = 1). The secondone is a “pointer” n i ∈ { , ..., N + 1 } , which gives, for thecase of match states x i = 1 and 1 ≤ n i ≤ N , the corre-sponding position in the original sequence ( A , ..., A N );note that this allows for insertions if n i +1 − n i >

1. Ifmatched symbols start to appear only from a position i >

1, we then ﬁll the previous positions { j : 1 ≤ j < i } with gaps having pointer n j = 0. Similarly, if the lastmatched state appears in i < L we ﬁll a stretch of gapsin positions { j : i < j ≤ L } having n j = N + 1. Thisencoding allows one to distinguish the “external” gaps atthe boundary of the aligned sequence, whose total num-ber we denote as N extgap , from the “internal” ones, i.e. be-tween two consecutive matched states, whose total num-ber is N intgap . Formally, the number of gaps and insertions are N ins = L − (cid:88) i =1 ( n i +1 − n i − I [ N + 1 > n i +1 > n i > ,N intgap = L (cid:88) i =1 δ x i , I [0 < n i < N + 1] , (5) N extgap = L (cid:88) i =1 δ x i , ( δ n i , + δ n i ,N +1 ) , where I [ E ] is the indicator function of the event E . Intro-ducing the short-hand notation A = “–” (gap), a modelconﬁguration ( x , ..., x L , n , ..., n L ) results in an alignedsequence ( S , ..., S L ) = ( A x · n , ..., A x L · n L ). The auxil-iary variables ( x , n ) must be additionally assigned suchthat the positional constraints illustrated in Fig. 1 aresatisﬁed, i.e. the target sub-sequence must be ordered,as we now describe.First of all, in order to correctly set the pointers inpresence of gaps in the ﬁrst and last positions, it is suf-ﬁcient to set the state of node i = 1 as n = 0 if x = 0 ,N + 1 > n > x = 1 , (6)and the state of node i = L as n L = N + 1 if x L = 0 ,N + 1 > n L > x L = 1 . (7)These properties can be formally expressed by the fol-lowing two single-position constraints χ in ( x , n ) = δ x , δ n , + δ x , (1 − δ n , )(1 − δ n ,N +1 ) ,χ end ( x L , n L ) = δ x L , δ n L ,N +1 + δ x L , (1 − δ n L , )(1 − δ n L ,N +1 ) . (8)Next, we need to locally impose that, for each position1 < i < L , n i = n i − if x i = 0 and n i < N + 1 ,n i > n i − if x i = 1 or n i = N + 1 , (9)i.e. the pointer n i remains constant when x i = 0,and it jumps to any later position in n i − + 1 , ..., N if x i = 1. This jump, besides determining the amino-acid S i to be placed in position i , also allows for identify-ing inserts according to Eq. (5). A pictorial representa-tion of this constraint is shown in Fig. 2. We can for-mally encode these constraints in a “short-range” func-tion χ sr ( x i − , n i − , x i , n i ) that, for each pair of consecu-tive positions ( i − , i ), indicates the feasible/unfeasibleconﬁgurations of the variables ( x i − , n i − , x i , n i ) and the ---- A ni-1 x i-1 = 1n i-1 = n x i = 0n i = n ---- x i-1 = 1n i-1 = n x i = 1n i > n -A ni-1 A ni GCFGYKPKG x GCFGYKPKG ... (a) (b)

FIG. 2.

Short range constraints.

We plot in (a) a feasibleassignment of two consecutive matched states. If in position i − S i − = A n , we can then align the next position i to one of the possible amino-acids S i ∈ { A n +1 , ..., A N } . Asa consequence ( x i − , x i ) = (1 , n i − = n while n i ∈ { n +1 , ..., N } . In (b) we plot a feasible inclusion of a gap in position i . If the previous site i − n i − = n , we then assign( x i , n i ) = (0 , n ) to keep memory of the aligned sequence. Inthe next position, i +1, we can match an amino-acid in furtherpositions (according to the constraint in (a)) or add anothergap with pointer n i +1 = n . associated cost of insertions, as χ sr (0 , n i − , , n i ) = I ( n i = n i − ) ,χ sr (1 , n i − , , n i ) = I ( n i = n i − ∨ n i = N + 1) ,χ sr (0 , n i − , , n i ) = e − ϕ i (∆ n i ) I ( n i − > ×× I (0 ≤ n i − < n i < N + 1) ,χ sr (1 , n i − , , n i ) = e − ϕ i (∆ n i ) ×× I (0 < n i − < n i < N + 1) , (10)where the function ϕ i (∆ n i ) is the contribution of the i − th position to the aﬃne insertion penalty, as given inEq. (3) with ∆ n i = n i − n i − −

1. Note that combin-ing the constraints in Eq. (8) and Eq. (10), positions { j > } ( { j < L } ) can either have a gap with n j = 0( n j = N + 1) or the ﬁrst (last) match at any position n j > n j < N + 1) with no insert penalty.Finally, gap penalties can be encoded in a single-variable weight, χ gap ( x i , n i ) = e − (1 − x i ) µ ( n i ) , (11)with µ ( n ) = (cid:40) µ ext n = 0 ∨ n = N + 1 ,µ int ≤ n ≤ N . (12)The DCA Hamiltonian can be rewritten in terms ofthe auxiliary variables as H DCA ( x , n | J , h ) = − (cid:88) i I ( N + 1 > n i ) . (14)Collecting all these deﬁnitions together, we can associatea Boltzmann weight W ( x , n ) with each possible align-ment ( x , n ) of a given sequence A , which takes into ac-count all energetic contributions for feasible assignmentsonly, W ( x , n ) = 1 Z e −H DCA ( x , n ) χ in ( x , n ) χ end ( x L , n L ) L (cid:89) i =2 χ sr ( x i − , n i − , x i , n i ) L (cid:89) i =1 χ gap ( x i , n i ) , (15)where Z is the partition function. Note that the “hardconstraints” χ sr , which can set the weight to zero, liveonly on the edges of the linear chain 1 , ..., L , while theinteractions J ij are in principle fully connected.Finally, we can map the original minimization problemin Eq. (2) as the statistical physics problem of ﬁnding thebest assignment of the variables ( x ∗ , n ∗ ) that maximizesthe Boltzmann distribution:( x ∗ , n ∗ ) = argmax ( x , n ) W ( x , n ) . (16)Alternatively, we could obtain an optimal alignment( x ∗ , n ∗ ) from an equilibrium sampling of alignments withweight W ( x , n ). Unfortunately, both sampling from W ( x , n ) and identifying the constrained optimal assign-ment are hard and intractable problems. Note that thespace of possible assignments has dimension scaling as( N + 2) L , which grows extremely quickly with N and L .For comparison, the DCA problem is deﬁned in a spacegrowing “only” as q L . However, some approximations in-spired by statistical physics can be exploited for seekingan approximate solution. III. ADVANCED MEAN-FIELDAPPROXIMATION

A straightforward approach to make this problemtractable is to use message-passing approximations ofthe marginal probabilities P i ( x i , n i ) of Eq. (15), suchas Belief Propagation (BP), which are exact for prob-lems deﬁned on graphs without loops. Note that BP isalso exact on linear chains, for which it coincides withthe transfer matrix method (or dynamic programming /forward-backward algorithm [1]). One can think to BPas treating exactly the linear chain 1 , · · · , L , while thelonger-range interactions are approximated by message-passing. In the case of vanishing couplings J ij ( A, B ) ≡ i, j such that | i − j | > A, B , i.e. in thecase of a model with nearest-neighbor interactions only,this formulation is exact; if all couplings vanish, it is quitesimilar to a proﬁle Hidden Markov Model, which has thesame penalties for opening and extending a sequence ofinsertions. However, our interactions are instead typi-cally very dense (all couplings are non-zero, hence theassociated graph is very loopy), but weak. We can thusconsider a further approximation of BP [46], in whichthe linear chain 1 , · · · , L is still treated exactly, while thecontribution of more distant sites is approximated viamean ﬁeld (MF), in a way similar to Thouless-Anderson-Palmer (TAP) equations [48], also known as ApproximateMessage Passing (AMP) equations. We refer to this ap-proach simply as MF in the following. In the rest of thissection we derive the BP and MF equations. A. Transfer matrix equations for the linear chain

Suppose ﬁrst that only nearest-neighbor couplings J i,i +1 ( A, B ) are non-zero. In this case, the problem isexactly solved by the transfer matrix method, which cor-responds to a set of recursive equations for the “forwardmessages” F i ( x i , n i ) = F i → i +1 ( x i , n i ), i.e. the probabil-ity distribution of site i in absence of the link ( i, i + 1),and “backward messages” B i ( x i , n i ) = B i → i − ( x i , n i ),i.e. the probability distribution of site i in absence of thelink ( i − , i ).We give here the transfer matrix equations for the for-ward and backward messages. For compactness we deﬁnethe single-site weight contribution to Eq. (15): W ( x , n ) = χ in ( x , n ) e h ( A x · n ) − (1 − x ) µ ( n ) , W i ( x i , n i ) = e h i ( A xi · ni ) − (1 − x i ) µ ( n i ) , (17) W L ( x L , n L ) = χ end ( x L , n L ) e h L ( A xL · nL ) − (1 − x L ) µ ( n L ) , where the second line is for i = 2 , · · · , L −

1. We thenhave for the forward messages: F ( x , n ) = 1 f W ( x , n ) ,F i ( x i , n i ) = 1 f i W i ( x i , n i ) F i ( x i , n i ) , F i ( x i , n i ) = (cid:88) x i − ,n i − F i − ( x i − , n i − ) ×× e J i − ,i ( A xi − · ni − ,A xi · ni ) χ sr ( x i − , n i − , x i , n i ) , (18)where F i is deﬁned for i = 1 , · · · , L − F i for i = 2 , · · · , L , and the f i are normalization constants de-termined by the requirement that messages are normal- i = 1 i = 2 , · · · , L − i = LP = z W B P i = z i W i F i B i P L = z L W L F L F = f W F i = f i W i F i —— B i = b i W i B i B L = b L W L TABLE I. Schematic summary of the transfer matrix andmean ﬁeld equations, which are complemented by the recur-rence equations for F i and B i given in Eqs. (18) and (19). Formean ﬁeld one should replace W i → C i . ized to one. For the backward messages we have B L ( x L , n L ) = 1 b L W L ( x L , n L ) ,B i ( x i , n i ) = 1 b i W i ( x i , n i ) B i ( x i , n i ) , B i ( x i , n i ) = (cid:88) x i +1 ,n i +1 B i +1 ( x i +1 , n i +1 ) ×× e J i,i +1 ( A xi · ni ,A xi +1 · ni +1 ) χ sr ( x i , n i , x i +1 , n i +1 ) , (19)where B i is deﬁned for i = L, L − , · · · , B i for i = L − , · · · ,

1, and b i are normalization constants.Finally, the marginal probabilities are given by P ( x , n ) = 1 z W ( x , n ) B ( x , n ) ,P i ( x i , n i ) = 1 z i W i ( x i , n i ) F i ( x i , n i ) B i ( x i , n i ) , (20) P L ( x L , n L ) = 1 z L W L ( x L , n L ) F L ( x L , n L ) . These equations are summarized in compact form in Ta-ble I. They can be easily implemented on a computerand solved in a time scaling as

L N . B. Long range interactions

We now discuss the inclusion of long-range interactionsin the transfer matrix scheme. In order to treat correctlythe long-range interaction in BP, it is important to notethat the same “light-cone” condition expressed by theconstraint χ sr in Eq. (10) holds between any pair ( i, j ).However, this condition would be violated by the mes-sages of BP due to their approximate character on loopygraphs. In order to enforce it, we can introduce a newconstraint χ lr ( x i , n i , x j , n j ) = (21) I [ i > j + 1] { δ x i , I [ n i ≥ n j ] + δ x i , I [ n i > n j ] } + I [ i < j − (cid:8) δ x j , I [ n i ≤ n j ] + δ x j , I [ n i < n j ] (cid:9) . Because this constraint is redundant with respect toEq. (10), it can be added without changing the weight: W ( x , n ) = W ( x , n ) × (cid:89) i W i → C i , with C i ( x i , n i ) = W i ( x i , n i ) exp  (cid:88) j / ∈{ i,i ± } (cid:88) x j ,n j χ lr ( x i , n i , x j , n j ) J i,j ( A x i · n i , A x j · n j ) P j ( x j , n j )  . (23)As a result, the mean ﬁeld equations have the same com-plexity as the transfer matrix equations ( LN ) with anadditional factor LN needed to compute each C i , result-ing in an overall complexity L N . C. Assignment

After solution of the MF equations, from the marginalprobabilities { P ( x , n ) , ..., P L ( x L , n L ) } we have to ﬁndthe most probable assignment ( x ∗ , n ∗ ), as deﬁned inEq. (16). The simplest way to do so is to assign toeach position i the most probable state according to itsmarginal, i.e.( x ∗ i , n ∗ i ) = argmax x i ,n i P i ( x i , n i ) , (24)which is possible whenever the obtained assignment sat-isﬁes all the hard constraints. However, in some cases,the set of locally optimal positions do not satisfy theshort-range constraints due to the approximate nature ofthe MF solution. We then perform a maximization step,in which we select the position i ∗ and the local assign-ment ( x ∗ i ∗ , n ∗ i ∗ ) having the largest probability among allthe marginals, i.e.( i ∗ , x ∗ i ∗ , n ∗ i ∗ ) = argmax i,x i ,n i { P ( x , n ) , ..., P L ( x L , n L ) } . (25)We then set the state of site i ∗ in ( x ∗ i ∗ , n ∗ i ∗ ), and we pro-ceed with a ﬁltering step, in which we set to zero themarginal probabilities of the incompatible states of theﬁrst nearest neighbors of i ∗ . In practice, we multiplythe marginals by the short-range constraints computedat ( x ∗ i ∗ , n ∗ i ∗ ), i.e. we consider the new marginals on sites i ∗ ± P i ∗ − ( x i ∗ − , n i ∗ − ) χ sr ( x i ∗ − , n i ∗ − , n ∗ i ∗ , x ∗ i ∗ ) ,P i ∗ +1 ( x i ∗ +1 , n i ∗ +1 ) χ sr ( x ∗ i ∗ , n ∗ i ∗ , x i ∗ +1 , n i ∗ +1 ) . (26)We can now repeat the maximization step in order to ﬁndthe state ( x ∗ , n ∗ ) that maximizes the joint set of proba-bilities for the positions adjacent to the already alignedpart of the sequence (in this case i ∗ − i ∗ +1, because we only ﬁxed i ∗ ),( x ∗ , n ∗ ) = argmax x,n { P i ∗ − ( x, n ) , P i ∗ +1 ( x, n ) } , (27)and we ﬁx this state, in the alignment, in the right posi-tion (either i ∗ + 1 or i ∗ − i ∗ + 1; we nowﬁlter the probability of i ∗ +2 and repeat the choice for thenext ( x ∗ , n ∗ ) considering the set of (modiﬁed) marginals { P i ∗ − ( x, n ) , P i ∗ +2 ( x, n ) } . The procedure is repeated un-til all the L positions are determined.Note that this scheme is somehow greedy, because theassignments are decided step by step, and are constrainedby the choices made in the previous positions. Still, theassignment is guided by the marginal probabilities ob-tained from considering the global energy function andall the hard constraints. Moreover, this assignment pro-cedure is as fast as the “max-marginals” scheme becauseit does not require to re-run the update of the equationsin Sec. III, and thanks to the step-by-step ﬁltering of themarginals it ensures an outcome that is always compati-ble with the constraints. D. Discussion

In this section we presented a set of approximate equa-tions and an assignment procedure that, together, al-low us to solve the alignment problem in polynomialtime. The BP equations have a computational complex-ity proportional to L N , while the MF equations scaleas L N . In both cases, the equations can be solved at“temperature” equal to one, corresponding to a Boltz-mann equilibrium sampling from the weight in Eq. (15),or at zero temperature, corresponding to ﬁnding (approx-imately) the most likely assignment in Eq. (15) (the fullset of equations at zero temperature is reported in theSI). Of course, any intermediate temperature could alsobe considered, but we do not explore other values of tem-perature in this work.In all cases, one can compute the free energy associ-ated with the BP or MF solution, which gives a “score”measuring the quality of the alignment. This score couldbe used, in long sequences with multiple hits, to decidea “best hit”. The expression of the free energy is givenin the SI.The MF equations are derived from the BP equa-tions by assuming that all couplings with | i − j | > || J ij || < K are treatedin mean ﬁeld, while stronger couplings with || J ij || ≥ K are treated with BP, for a given threshold K . The case K = 0 corresponds to pure BP, while the case K → ∞ corresponds to pure MF. One could check whether anoptimal value of K exists. We leave this for future work. IV. LEARNING THE MODEL

We now discuss the learning of the model parameters,namely the couplings and ﬁelds of the DCA Hamiltonian,and the hyper-parameters µ , λ . A. Potts model

Our alignment method is able to cope with diﬀerentcost functions, because the implementation of the updateequations described in Sec. III B is as general as possible.Introducing a 5-state alphabet, the method is also ableto treat RNA alignments.In this work, we tested several types of Maximum En-tropy models for DCA, which diﬀer in the choice of ﬁttedobservables. The usual Potts model, in which all ﬁrst andsecond moments of the seed MSA are ﬁtted, is labeled as potts , while we also consider a “pseudo” Hidden MarkovModel ( phmm ), with Hamiltonian H phmm ( S | J , h ) == − (cid:88) i,i +1 J i,i +1 δ S i , − δ S i +1 , − − (cid:88) i h i ( S i ) . (28)The phmm can be thought of as a proﬁle model playingthe role of the emission probabilities of a Hidden MarkovModel (HMM), plus a pairwise interactions J i,i +1 be-tween neighboring gaps. This interaction is related, inour mapping to a HMM, to the probability of switch-ing, between two consecutive positions, from a “gap”to a “match” state and vice-versa. We also consideredother variations of the Potts model, such as a model inwhich we do not ﬁt the second moment statistics of non-neighboring gaps (i.e. long-range gap-gap couplings areset to zero). The motivation behind this choice is that, ifDCA couplings are interpreted as predictors for the (con-served) three dimensional structure, gapped states do notcarry any information about co-evolution of far-away po-sitions. However, we found that these other variations do not bring additional insight with respect to the potts and phmm models, so we restrict here the presentationto these two choices.All these models are learned on a seed alignment us-ing a standard Boltzmann machine DCA learning algo-rithm [47]. We used a constant learning rate of 5 · − formost protein families, and 10 − for all RNA families andfor the longest protein families we used. Because the seedoften contains very few sequences, we need to introducea small pseudo-count of 10 − to take into account non-observed empirical second moments. The Boltzmann ma-chine performs a Monte Carlo sampling of the model us-ing 1000 independent chains and sampling 50 points foreach chain (in total the statistics is thus computed us-ing 5 · samples). Equilibration and auto-correlationtests are performed to increase or decrease, if needed, theequilibration or the sampling time of the Monte Carlo.It is important to keep in mind that the models in-ferred from the seeds are “non-generative” because of thereduced number of sequences (samples generated fromthese models, due to a strong over-ﬁtting, are extremelyclose to seed sequences), but nonetheless they are accu-rate enough to be used as proper cost functions for ouralignment tool. We also mention that models inferredfrom Pseudo-Likelihood maximization [21, 49], which arealso known to be non-generative [47], can be equivalentlyused for the alignment method described in this work. B. Insertion penalties

We determine the parameters of the aﬃne insertionpenalties using the statistics of the insertions of the seedalignment. Recall that the number of insertions betweenpositions i and i − n = n i − n i − −

1, as illustrated inFig. 3. Motivated by the empirical statistics of insertionsin true seeds, we model the probability of ∆ n as P i (∆ n ) = (cid:40) z , ∆ n = 0 , e − λio − λie (∆ n − z , ∆ n > , (29)where z = 1 + (cid:88) ∆ n> e − λ io − λ ie (∆ n − = 1 + e − λ io − e − λ ie (30)is the normalization constant and λ io , λ ie are the costs as-sociated with the opening and the extension of an inser-tion as in the score function deﬁned in Eq. (3). Becausethe learning of the parameters is done independently foreach position i , for the sake of simplicity we will drop theindex i in the following.We determine the values of λ o , λ e by maximizing thelikelihood L ( { ∆ n } Ma =1 | λ o , λ e ) of the data, i.e. the M sequences of the seed, given the parameters, and addingL2-regularization terms in order to avoid inﬁnite or un-determined parameters. Imposing the zero-gradient con-dition on the likelihood leads to a closed set of non-linear (a) (b) (c) FIG. 3.

Inference of Insertion penalties . (a) Schematic representation of the ∆ n variables: the number of insertionsbetween two consecutive positions in the alignment can be computed from the pointer variables n . (b),(c) Examples of ﬁttingof the empirical probability of ∆ n using our maximum likelihood approach for (b) position 25 and (c) position 92 for RF00162.In (c) the data distribution does not show an exponential proﬁle but our approximation ﬁts well the empirical probability formost of the observed ∆ n . equations for the maximum likelihood estimators, givenin the SI. These equations can be solved, for example, bya gradient ascent scheme in which we iteratively update λ t +1 ← λ t + η ∂ L t ∂λ , (31)for both λ o and λ e , until numerical convergence (moreprecisely, when the absolute value of the gradient is lessthen 10 − ). The learning rate is η = 10 − . Note that theempirical distribution can diﬀer from that of our model:for instance, we often encounter positions where eitherno insertion is present within the seed or the distributionof the positive ∆ n is not exponential. In the ﬁrst case,our maximum likelihood approach cannot be applied asit is: in order to apply it we pretend that the probabilityof observing at least one insertion is equal to a smallparameter (cid:15) so that we can slightly modify P seed (∆ n =0) = 1 − (cid:15) . In our work this parameter has been set to (cid:15) =10 − . In the second case we notice that the distributiongiven by the ﬁt is anyway a nice approximation of thetrue one. Some examples are reported in Fig. 3. C. Gap penalties

The gap state is treated in DCA models as an addi-tional amino acid but, by construction of the MSA, it isactually an ad hoc symbol used to ﬁll the vacant posi-tions between well-aligned amino acids that are close inthe full-length sequence A and should be more distant inthe aligned sequence S . Thus the proper number of gapsfor each candidate alignment is often sequence dependentand not family dependent: the one-point and two-pointsstatistics of gaps computed from the seed may not berepresentative of the full alignment statistics. Yet, thecouplings and the ﬁelds of the DCA models learned fromthe seed tend to place gaps in the positions mostly oc-cupied by gaps in the seed. This may lead to some bias depending on the seed construction: we notice that ifwe create seeds using randomly chosen subsets of Pfam[12] alignments produced by HMMer [11], our alignmentmethod, DCAlign, is likely to produce very gapped se-quences. In these cases gaps appear very often, more of-ten than any other amino acid, indicating that our costfunction encourages the presence of gaps. Real seeds areinstead manually curated and therefore they generallycontain few gaps. Even though the Potts models learnedfrom this kind of seeds are less biased, we anyway needto check whether the issue exists and, if needed, treat it.To do so, our idea is to introduce additional penalties togap states, µ ext and µ int , as we discussed in Sec. II B inthe deﬁnition of the cost function. Notice that the dis-tinction between “internal” and “external” gap penaltiesallows us to diﬀerentiate between gaps that are artiﬁ-cially introduced (as in the case of the internal ones) andgaps that reﬂect the presence of well aligned but shorterdomains or fragments, of eﬀective length L frag < L . Inthis last case some “external” gaps are needed to ﬁll the L − L frag positions at the beginning or at the end of thealigned sequence.Contrarily to the insertions penalties, the gap penal-ties cannot be directly learned from the seed alignmentvia an unsupervised training (as their statistics is alreadyincluded in the Potts model to begin with), but they canbe learned in a supervised way. A straightforward pro-cedure consists in re-aligning the seed sequences usingthe insertions penalties and the DCA models ( potts or phmm ) described in Sec. IV A for several values of µ ext and µ int . The best values of the gap parameters are thosethat minimize the average Hamming distance betweenthe re-aligned seed and the original seed sequences. Weperformed this supervised learning by setting the valuesof µ ext ∈ [0 . , .

00] and µ int ∈ [0 . , . . , . M > ) but it failscompletely when dealing with “small” seeds. In thiscase, whatever the value of ( µ ext , µ int ), the re-alignedsequences are always identical to the original ones, re-sulting in an average Hamming distance equal to zero.Indeed, the energy landscape of models learned fromfew sequences is populated by very isolated and deeplocal minima centered in the seed sequences. Whenthe algorithm tries to re-align an element of the train-ing set, it is able to perfectly minimize the local energyand re-align the sequence with no error whatever theadditional gap penalty. For short seeds, instead of re-aligning the seed, we thus extract 1500 sequences fromthe full set of unaligned sequences (a validation set),which we align by varying the gap penalties, always inthe range [0 . , . H the DCA Hamil-tonian inferred on the seed, and we infer new Hamilto-nians H type,µ ext ,µ int (with type ∈ { phmm, potts, } ) on allthe multiple sequence alignments of the validation set.We then choose the best parameters µ ext and µ int as thosethat minimize the symmetric Kullback-Leibler distancebetween H seed and H type,µ ext ,µ int (the precise deﬁnitionis given in section V). In other words, we select the bestgap penalties as those that produce a validation MSA asstatistically close as possible to the seed alignment.We underline that the values of the penalty parame-ters also depend on the choice of the gauge for the DCAparameters: in fact, the advanced mean-ﬁeld equationsare not gauge invariant and depending on the choice ofthe gauge we can have diﬀerent (optimal) values for thegap penalties. All results shown in this work use thezero-sum gauge for the DCA parameters. V. EXPERIMENTAL SETUPA. Pipeline

The experimental setting we propose here is the sameadopted by state-of-the-art alignment softwares such asHMMer [50], and Infernal [14]. From the seed alignmentwe learn all the parameters that characterize our scorefunction and we apply our alignment tool to all the un-aligned sequences that contain a domain compatible withthe chosen family. More precisely, once a seed is selected(we used the hmmbuild -O function of the package HM-Mer to obtain the aligned seed), we learn the model, the insertions parameters and the gap penalties as describedin Sec. IV. A scheme of our training method is shown inFig. 4.After training, our cost function is fully determined.Unaligned sequences are then taken from the full lengthsequences of the corresponding family in the Pfamdatabase [12] (we do not face here the problem of de-tecting homologous sequences). Note that, like HMMer,we also include the seed sequences in the sequences tobe aligned, in order to obtain a more homogeneous MSAand test the quality of the re-alignment of the seed. Wedo not consider the entire sequences, whose length N isoften much larger than L , but a “neighborhood” of thehit selected by HMMer. In practice we add 20 amino-acids at the beginning and at the end of the hit resultingin a ﬁnal length N = 20 + L + 20. We performed thesame experiment using N = 50 + L + 50 for PF00684 andthe resulting sequences were identical to those obtainedfrom a shorter hit. The method seems to be stable forreasonable values of N , i.e. N ∼ L . For RNA sequences,this pre-processing is not needed, because the full lengthsequences downloaded from Rfam already have a reason-able length. For most families, we have aligned all thefull length sequences (the size of the test sets is speci-ﬁed in Table II) and only in few cases, for particularlylarge families, we uniformly pick at random N seq = 10 sequences to align.We then apply DCAlign using the approximationswe discussed in Sec. III, namely the ﬁnite temperatureand zero-temperature MF method, to each candidate se-quence and we add to our MSA the aligned sub-sequencethat has the lower energy (insertion and gap penalties ex-cluded). Whenever our algorithms do not converge to anassignment of the variables that satisﬁes all the hard con-straints, we apply the “nucleation” procedure explainedin Sec. III C that, by means of the approximated marginalprobabilities, gives rise to feasible alignments. B. Observables

To assess the quality of the MSAs generated byDCAlign (that diﬀer in the score function being used toalign) and to compare them to the state-of-the-art align-ments provided by HMMer (or Infernal), we consider thefollowing observables. • Sequence-based metrics.

When comparing two can-didate MSAs of the same set of sequences (a “ref-erence” and a “target”), it is possible to computeseveral sequence-wise measures such as the follow-ing metrics (normalized by L , the length of the se-quences): – the Hamming distance between the two align-ments of the same sequence in the referenceand target MSAs; – Gap + : the number of match states in thealigned sequence of the reference MSA that1 accgw.....ari..wcgp........... hmmbuild -O Pfam seeds potts phmm

Q5WH09_BACSK/3-86 TGIIEEMGKV.A..SV..SKKGKAMT...LAVM1E8V5_9FIRM/3-86 TGIIEDVGQI.I..SF..SPQGYGYK...IKIR7MIT0_9CLOT/9-92 TGIVEEIGIV.K..SL..EFKANGAKI..VIG ...

Q3B4X5_CHLL7/3-88 TGIIKDVGRV.Q..GLSRQK..GGLRLRVRHVA5VJW8_LACRD/3-87 SGLVAGVGRI.TEINLQGET......... ... ... >Q5WH09_BACSKMFTGIIEEMGKVASVSKKGKAMTLAVNCKTVLADAAIGDSIAVNGADVMPETYEATTIKQLVPGQFVNLERAMAVGSRFGGHIISGHVDGLFEVEADRELVRYIVQKGSIAVDGTSLTVF ... >A5VJW8_LACRDMFSGLVAGVGRITEINLQGETIELAVTPRIDDFLTTVQLGDSIAVFTTTLMPQTFEKTTFKNLEKGSLVNLERSLQVGARLEGHIVTGHV

Boltzmann ext e x t Full length seed (Meff >> 1) ... ... ... D i s t an c e learning (a)(b) FIG. 4.

Scheme of the training process for DCAlign.

In (a) we show the ﬁrst step of the learning. We build the alignedseed of a Pfam family using hmmbuild -O to detect the matched amino-acids (blue line) and the insertions (shown in lowercase letters and “.”). From these data we learn the DCA Hamiltonian and the aﬃne insertion penalties. In (b) we pictoriallydescribe the learning of the gap penalties. Here we take into account the seed itself (not only the aligned part but the entiresequences) and we try to align it using the parameters inferred in step (a) using all the 81 possible combinations of µ ext and µ int , each spanning the interval [0 . , . have been replaced by a gap in the targetMSA; – Gap − : the number of gaps in the aligned se-quence of the reference MSA that have beenreplaced by match states in the target MSA; – Mismatch: the number of amino-acid mis-matches, that is the number of times we havematch states in both sequences, reference andtarget, but to diﬀerent amino acids (positions)in the full sequence A . • Proximity measure.

Consider two diﬀerent MSAsof the same protein or RNA family. We can com-pute, for each candidate sequence S i of the ﬁrstset, the Hamming distance d H with respect to allthe sequences of the second set. We then collectthe minimum attained value, i.e.ˆ d i = min j d H ( S i , S j ) , (32)which gives the distance to the closest sequence inthe other MSA. The distribution of ˆ d i , or some sta-tistical quantity computed from them (such as theaverage or the median value) provides a good mea-sure of “proximity” between the two sets. We willshow below a few examples using the full alignmentof a protein family as a ﬁrst set, and the seed se-quences as the second one. • Symmetric Kullback-Leibler distance . Another con-venient global measure is the symmetric Kullback-Leibler distance between a Boltzmann equilibriummodel learned from the seed alignment (the “seed”model) and another model learned from a candidateMSA (the “test” model). In general the symmetricKullback-Leibler divergence is a measure of “dis-tance” between two probability distributions and itis deﬁned, for arbitrary densities P ( x ) and P ( x )of the variables x , as D symKL ( P , P ) = D KL ( P ||P ) + D KL ( P ||P ) , (33)where D KL is the Kullback-Leibler divergence D KL ( P ||P ) = (cid:88) x P ( x ) log P ( x ) P ( x ) . (34)In our context, the symmetric KL distance can beeﬃciently computed through averages of energy dif-ferences as D symKL ( P seed , P test ) == D KL ( P test ||P seed ) + D KL ( P seed ||P test )= (cid:104)H seed − H test (cid:105) P test + (cid:104)H test − H seed (cid:105) P seed , (35)where P ( . ) is the Boltzmann distribution associatedwith the energy H ( . ) , and the brackets (cid:104) ... (cid:105) P denotethe expectations with respect to P , which can be2 pottspotts pottspotts (a) (b) pottspotts pottspotts FIG. 5.

Comparison between DCAlign and HMMer for synthetic data.

Panels (a) and (b) show the histograms ofthe normalized metrics (Hamming distance, Gap + , Gap − and Mismatches), respectively, in the case of conserved sited only,and of correlated pairwise columns only. Here, the reference is the ground truth and the target is the alignment of DCAlign(blue) or HMMer (red). In (b) HMMer results are not shown because hmmsearch does not ﬁnd any relevant hit. easily estimated using a Monte Carlo sampling (incontrast to the normal D KL , which depends on theintractable normalization constants, i.e. the parti-tion functions, of the densities P ( . ) ). We run thecomparison using two diﬀerent models for H , thatis a Potts model and a proﬁle model. • Contact map . The couplings of DCA models can beused to detect the presence of physical interactionsbetween pairs of sites, which are distant in the one-dimensional chain but in close contact in the three-dimensional structure. A good score that indicatesa direct contact is the (average-product corrected)Frobenius norm of the coupling matrices, deﬁnedas F AP Ci,j = F i,j − (cid:80) m F i,m (cid:80) n F n,j (cid:80) m,n F m,n , (36)where F i,j = (cid:115) (cid:88) A (cid:54) = (cid:48) − (cid:48) ,B (cid:54) = (cid:48) − (cid:48) J i,j ( A, B ) , (37)and the couplings are in the zero-sum gauge. Wethus compare predicted contact maps obtained us-ing the parameters learned from our alignments andthose obtained by HMMer. For this purpose, weuse the PlmDCA method to learn the couplings from the alignments, because it is faster than theBoltzmann machine and it is known to be reli-able for contact prediction [21]. The ground-truthsdenoting the physical interactions in each proteinare obtained running the Pfam interactions pack-age [51]. Two sites are said to be in contact ifthe minimum atomic distance among all the atomsand among all the available protein structures isless than 8 ˚ A . VI. TEST ON SYNTHETIC DATA

Here we describe two experiments on synthetic data,constructed to compare the performances of DCAlignto state-of-the-art methods in extreme settings: onedataset presents conserved but not coevolving sites (i.e.strong variations in amino-acid frequency on each sitebut no correlation between distinct sites), while the otherpresents not conserved but coevolving sites (i.e. uniformfrequency 1 /q of amino-acids on each site, but strongcorrelations between distinct sites). A. Conservation

The ﬁrst MSA is generated from a non-trivial proﬁlemodel, in which the empirical probability of observing3

FIG. 6.

Energies of synthetic sequences.

We show herethe scatter plots of the DCA energy (or Potts Hamiltonian) in(a) for the true synthetic sequences (x-axis) against the onesaligned by DCAlign( H potts ) (y-axis). In (b), same plot usingthe total cost function E . any of the possible amino acids is position-dependentand it is not uniformly distributed among the possiblestates. The generative model used in this case is theproﬁle model of the PF00018 family, which can be easilylearned from the empirical single site frequencies. Fromthis model we generate 5 · “aligned” sequences (the“ground-truth”), to which we randomly add some inser-tions, according to the aﬃne insertion penalty distribu-tions learned from the PF00018 seed (Sec. IV B), and20 uniformly randomly chosen symbols at the beginningand at the end of the aligned sequence. We split thisalignment into a training set of 2 . · sequences, whichwe use as seed alignment to learn the insertion and gappenalties (using the scheme for abundant seeds) and aPotts model. We align the remaining 2 . · sequences,used as test set. For comparison, we build a HiddenMarkov Model using hmmbuild of the HMMer package on the training set and we align the test sequences throughthe hmmsearch tool.We show in Fig. 5(a) the histograms of the (normal-ized) Hamming distances, Gap + , Gap − and Mismatchesof the MSA obtained by DCAlign and HMMer comparedto the ground-truth. We observe that DCAlign is able toﬁnd the correct hits and to align them in a more preciseway if compared to HMMer. In fact, the Hamming dis-tance distribution is shifted to smaller values, suggestingthat the number of errors, per sequence, is smaller thanthat obtained by HMMer. The nature of the mistakesseems to be linked to the presence of mismatches in thecase of HMMer, while DCAlign (less often) equally likelyinserts more or less gaps, or a match to the wrong symbol.While DCAlign is in principle constructed to exploit co-variation, these results show that even in cases in which,by construction, there is no covariation signal, DCAlignis able to perform equally good (or even better) thanstate-of-the-art methods. B. Covariance

The second experiment is instead focused on correlateddata. We ad-hoc construct an alignment whose ﬁrst mo-ment statistics resembles that of a uniform distribution,i.e. the probability of observing any amino-acid, in anycolumn of the seed, is 1 /q . In other words, we con-struct the data in such a way that no conserved sitesare present. At the same time, we force the sequencesto show coevolving (i.e. correlated) sites, such that theempirical probability of observing a pair of amino acidsis diﬀerent from that obtained in the uniform distribu-tion, i.e. f ij ( S, S (cid:48) ) = δ S i ,S δ S j ,S (cid:48) (cid:54) = q for some ( i, j ),where the overline indicates the empirical average. Toconstruct a dataset with this statistics, we use as gener-ative model a Potts model with 4 colors (like the RNAalphabet, without the gap state) having non-zero cou-plings J ij ( S i , S j ) = − δ S i ,S j (i.e. an anti-ferromagneticPotts model) and no ﬁelds. The non-zero couplings areassociated with the links of a random regular graph of 50nodes and degree 5. The presence of the links ensures theappearance of non-trivial second moments while, in or-der to avoid “polarized” sites, we sample the model (i.e.the Boltzmann distribution associated with this Hamil-tonian) at temperature T = β = 0 .

3, which is deep inthe paramagnetic phase of this model [52]. We performthe same training pipeline presented in Sec. VI A exceptfor the learning of the gap penalties: because there areno gap states, we set µ ext = 0 and µ int = 0.In this case, due to the absence of any conservation, hmmsearch does not ﬁnd any eligible hit. In fact, HMMertries to align the sequences via a computationally exactrecursion on a HMM, but it has no information to exploitwhile setting up the HMM from the training set, becauseall amino-acids are equally likely to occur in each col-umn. This represents, of course, the worst-case scenariofor HMM-based methods. On the contrary, the couplings4of the learned Potts model allow DCAlign to align thiskind of sequences. We remark that contrarily to HM-Mer, DCAlign has complete information on the statisticsof the training alignment, up to second order covariances.However, being a heuristic method, it sometimes fails toachieve the (global) minimum of the cost function andconverges to a local minimum, which depends on the ini-tialization of the target marginals. Re-iterating the MFequations using 10 diﬀerent seeds of the random num-ber generator suﬃces to reach the proper minimum atleast once, for the majority of sequences. We remarkthat this issue is present only when the MSA does notshow any conserved site and thus the algorithm has noeasy “anchoring” point, which surely helps lifting the de-generacies in the alignment procedure. For protein andRNA families presented below the algorithm seems stableand only one minimum emerges upon re-initialization ofthe marginal probabilities of the algorithm. We quanti-tatively measure the performance of DCAlign using foursequence-based metrics and the energies associated withthe aligned sequences. For this experiment, we refer tothe output of our algorithm as the aligned sequence thathas the minimum energy among the 10 trial re-iterationsof the algorithm. We report the distance metrics inFig. 5(b): the distribution of the Hamming distancessuggests that DCAlign can almost perfectly align themajority of the target sequences. Indeed, as shown inFig. 6, the energies (the Potts Hamiltonian alone or thefull cost function which includes the gap and insertionpenalties) are identical or very close to the energies ofthe true sequences. Only 0.08% of the aligned sequenceshave a Hamming distance density larger than 0.30 (i.e.15 missed positions over 50). This fraction is so low tobe invisible in the histograms of Fig. 5(b). These ex-treme cases, in which our algorithm converged to a localminimum in every trial we performed, are represented aspurple points in the scatter plots in Fig. 6. VII. TEST ON PROTEIN AND RNA FAMILIESA. Choice of families

We show here the performance of our align-ment method for several RNA and protein fami-lies. In particular we select the families PF00035,PF00677, PF00684, PF00763 from the Pfam database( https:://pfam.xfam.org release 32.0) [12], andRF00059, RF00162, RF00167 and RF01734 fromthe Rfam database ( https:://rfam.xfam.org release14.2) [53]. The number of sequences, length of the mod-els, and gap penalties used in the simulations are reportedin Table II.We restrict our analysis to “short” families, having L of at most 100 positions, in order to avoid a signiﬁ-cant slowing down of the alignment process. The seedof PF00035 contains very few sequences, contrarily toPF00677, PF00684, PF00763, which have been chosen because of their large eﬀective number of seed sequences M eﬀ > hmmbuild ) and weapply hmmalign to the full length RNA sequences. Wechoose precisely these families because of their reason-able length, the abundance of the seed sequences, andthe large number of available crystal structures, whichare useful for the contact map comparison. B. Comparison with state-of-the-art methods

As a ﬁrst comparison, we compute the sequence-basedmetrics presented in Sec. V B comparing our full align-ment to that achieved by HMMer, for protein sequences,or by Infernal, when dealing with RNA families. We showthe results for PF00035 in Fig. 7(a-d) and for RF00167in Fig. 7(e-f), which are representative of the typical sce-nario for protein and RNA sequences. The distribution ofall metrics is mostly concentrated in the ﬁrst bins (the binwidth is set here to 0.01) and decays smoothly at largerdistances. The peak in the ﬁrst bin is more prominentwhen the Hamiltonian used for the alignment is phmm indicating that the sequences aligned by this method arecloser to those obtained by HMMer (or Infernal) thanthe outcomes of DCAlign- potts , as one would expect fromthe similarity between the two alignment strategies (seeSec. IV A).A notably diﬀerent behavior emerges for the sequencesof the PF00677 family, as shown in Fig. 8(a-d). It isclear from Fig. 8(a) that a large fraction of the sequencesaligned by DCAlign diﬀers from those aligned by HM-Mer by about 40% of the symbols when using H potts (thepercentage is reduced to about 30% when using H phmm ).The reason seems to be partially linked to the presence ofmismatches and to a non-negligible fraction of additionalgaps, as indicated by Gap + . We notice that, diﬀerentlyfrom the other families, the seed of PF00677 is composedby several clusters of sequences mostly diﬀering in the gapcomposition: a copious fraction of them have generallyfew gaps while some other show two long and localizedstretches of gaps. The structure of the seed can be bettercharacterized in the principal components space, as de-5 Identiﬁer

L M seed M eﬀ seed PDB N seq µ ext µ int H potts H phmm H potts H phmm PF00035 67 81 81 73 19751 2.50 2.00 0.00 2.00PF00677 87 1878 1518 9 14683 0.00 0.50 2.00 2.00PF00684 67 1512 1349 3 10000 0.00 0.00 2.50 2.00PF00763 116 1389 1355 24 10000 1.50 2.50 1.00 1.50RF00162 108 433 241 25 6026 3.50 3.50 3.00 4.00RF00167 102 133 105 49 2631 0.50 1.50 2.00 4.00RF01734 63 287 287 6 2017 1.00 0.00 2.00 1.50RF00059 105 109 83 24 12558 0.00 0.50 1.50 1.50TABLE II.

Features of the protein and RNA families used in this work.

We show here the length L of the sequencesfor each family, the value of M and M eﬀ [20] for the seed alignment, the number of PDBs used to determine the true contactmaps based of real observations of the domains structure, the number of the sequences, N seq , to be aligned by our methodsand the value of the gap penalties associated with each family and Hamiltonian. (a) (b)(c) (d) (e) (f)(g) (h) FIG. 7.

DCAlign vs state-of-the-art methods for PF00035 and RF00167.

We plot here the histograms of the Hammingdistances, Gap + , Gap − and Mismatches for the protein family PF00035 (a, b, c, d) using as reference the HMMer results andas target the DCAlign results, and for the RNA family RF00167 (e, f, g, h) using as reference the Infernal results and as targetthe DCAlign results. picted in Fig. 8(e), where we plot the projections of theseed sequences in the space of the ﬁrst two principal com-ponents as ﬁlled circles. The colors refer to the densityof sequences in the (discretized) space. To understandthe disagreement between HMMer and our methods wesuperimpose the projections of the sequences responsiblefor the huge peaks in the Gap + histogram in the princi-pal component space of the seed sequences, using browncrosses for the sequences aligned by HMMer and pinkdiamonds for those found by DCAlign- potts (note thatnone of these sequences is a re-alignment of a seed se-quence). Only a small fraction of the HMMer sequences overlap with the central and poorly populated clusterwhile the projections obtained from sequences aligned byDCAlign- potts lie on a well deﬁned and populated cluster.We thus conclude that looking at the gap composition ofsequences is not suﬃcient in this case to understand thediﬀerent behavior of HMMer and DCAlign. A more accu-rate analysis in the principal components space suggeststhat the sequences obtained by HMMer are probably mis-categorized, at variance with DCAlign sequences that arein agreement with the seed structure.In summary, although for most of the families analyzedhere (the distribution of the four metrics for the remain-6 (e) potts FIG. 8.

DCAlign vs HMMer for PF00677 . In (a, b, c, d) we plot the Hamming distance, Gap + , Gap − and Mismatchesusing the sequences aligned by HMMer as reference and those obtained by DCAlign as target. In (e) we plot the projectionsof the seed sequences in the ﬁrst two principal components of the seed space; the color scale denotes the density of the space.The additional sequences (depicted as pink diamonds if aligned by DCAlign- potts or as brown crosses if aligned by HMMer)are responsible for the red peak around 0.2 in panel (b). ing families are shown in the SI) the sequences aligned byDCAlign are very similar to those obtained by HMMeror Infernal, the PF00677 family suggests a diﬀerent sce-nario, in which DCAlign is able to learn some non-trivialseed features and to align the target sequences accordingto them. C. Comparison with the seed

In this section, we compare the statistical properties ofthe MSAs obtained by DCAlign with those of the seed.

1. Kullback-Leibler distances

The statistics of a MSA can be characterized in termsof a statistical (DCA) model. Depending on the complex-ity of the model, a certain set of observables are ﬁttedfrom the MSA. For instance, in a proﬁle model only theﬁrst moments are ﬁtted, while in a Potts model we canalso ﬁt the information about second moments. Thesestatistical models deﬁne a probability measure over thespace of sequences and thus characterize a given pro-tein/RNA family. We consider here the seed sequencesas our ground truth, and we thus consider that a modellearned from the seed is the one that better character-izes the protein/RNA family under investigation. Wethen infer a second model from the full set of aligned se-quences, and we ask how diﬀerent is this model from that learned from the seed. To answer this question we com-pute the symmetric Kullback-Leibler divergence D symKL between the two models, see Sec. V B, which must beintended as a statistical measure of distance between theseed and the set of aligned sequences. In order to fairlycompare DCAlign, HMMer and Infernal, which by con-struction treat diﬀerently the pairwise co-variation of theMSA sites, we learn, from a seed and from the test align-ment, a proﬁle model H Prof and a Potts model H Potts .We show in Fig. 9(a,b) the results for all families andall methods, when the model learned is a Potts model ora proﬁle model, respectively. We notice that the align-ments produced by DCAlign- potts always, for the Pottscase, and very often, for the proﬁle case, minimize D symKL with respect to the seed. Infernal is very eﬀective whendealing with RNA sequences but not as good as DCAlign- potts for the majority of the cases. HMMer always pro-duces the largest distance (except for PF00763 where ba-sically all methods perform equally good), in particularfor RNA families. We mention that alignments producedby hmmalign present aligned sequences that always showlong concatenated gaps at the beginning and at the endof the sequence, diﬀerently to the Rfam full alignment,the seed sequences and the outputs of DCAlign. Thispartially explains the diﬀerence with respect to the otheralignment tools.These results suggest that the more we use additionalinformation within the alignment process (in particular,when learning H potts we employ all positions and amino-acid dependent pairwise energy function), the closer the7 Infernal Infernal

FIG. 9.

Symmetric Kullback-Leibler distances.

Weplot the symmetric KL distance between the MSA and theseed alignment, computed via a Potts model (a) or a proﬁlemodel (b), for all the families and all the alignment methodswe considered. ﬁnal alignment will be to the seed. Surprisingly, this fea-ture is retrieved even when the model learned from thefull set of aligned sequences uses less information thanthe model used to align, e.g. for the proﬁle model usedin Fig. 9(b). Of course, there is a tradeoﬀ, because in-cluding additional statistical properties of the seed in thealignment process requires a larger seed.

2. Proximity measures

We present here a sequence-based comparison be-tween a candidate alignment, i.e. an alignment obtainedby DCAlign- potts , DCAlign- phmm , or HMMer/Infernal,and the seed that will be considered here as the referencealignment. The metrics we use is the proximity measureintroduced in Sec. V B. We show in Fig. 10 the distribu-tion of the minimum distances for a representative subsetof the families, i.e. PF00677 in panel (a), PF00684 in (b),RF00162 in (c) and RF01734 in (d). Results for PF00035,PF00762, RF00059 and RF00167 are shown in the SI.We notice that for the majority of the families (proteinor RNA) the histograms built from DCAlign- potts have alarge peak in the ﬁrst bin (which collects distances from0 to 0.02), suggesting that there exist more sequences inthis alignment which are close to the seed than in anyother alignment. A large peak at small distance is alsoobserved for Infernal when dealing with RNA families, asseen from the blue histograms in Fig. 10(c) and (d). TheInfernal results overlap quite well with those obtained byDCAlign- phmm . The histograms produced by HMMerseem to be shifted to larger Hamming distances, thus re-ﬂecting a smaller similarity to the seed than all the othermethods. Although DCAlign- phmm exploits similar in-formation to that encoded in HMMer, the correspondingalignment surprisingly produces, for most of the studiedfamilies, results that are more similar to those obtainedby DCAlign- potts or Infernal.

Infernal Infernal

FIG. 10.

Distribution of proximity measures.

His-tograms of the minimum distances computed according toEq. (32) for the full set of aligned sequences obtainedby DCAlign- potts , DCAlign- phmm , HMMer, and Infernal,against the seed. Panels (a), (b), (c) and (d) refer to thefamilies PF00677, PF00684, RF00162, RF01734 respectively.

D. Contact prediction

An important test of the quality of a MSA is related tothe interpretability of the DCA parameters learned fromit. As mentioned in Sec. V B, the largest couplings are aproxy for the physical contacts in the folded structure ofthe protein domains. In Table III we report a summaryof the results for three observables associated with thecontact prediction: the position of the ﬁrst false positivein the ranked Frobenius norms, the value of the TruePositive Rate (TPR) at 2 · L and the position at whichthe TPR is less than 0.80 for the ﬁrst time. The boldnumber corresponds to the largest value, and thereforethe best performance, among all the methods.We show in Fig. 11 (a,b,c,d) the Positive PredictiveValue (PPV) curves (left) and the contact maps (right)for the PF00035, PF00684, PF00763 and RF00162 fami-lies respectively (results for PF00677 and the other RNAfamilies are shown in the SI). The PPV curves are con-structed by plotting the fraction of true positives TP asa function of the number of predictions (TP and FP),i.e. PPV=TP/(TP+FP). The true contact maps are ex-tracted from all the available PDBs and plotted as grayﬁlled squares, while the predicted contact maps are con-structed by plotting the Frobenius norms of the DCAcouplings that are larger than an arbitrary threshold,here set to 0.20. For RNA sequences the comparisonbetween the predictions and the ground-truth can be per-formed only using the Frobenius norms associated with8 Identiﬁer First FP, TPR(2L), TPR < H Pottsseed

DCAlign ( H potts ) DCAlign ( H phmm ) HMMer InfernalPF00035 9, 0.478, 13 , 0.754, 98 32, 0.799, , 119 -PF00677 22, 0.730, 109 , 0.759, 147 28, 0.747, 128 31, , -PF00684 20, 0.582, 28 , 0.672, 101 27, ,

23, 0.627, 73 -PF00763 80, 0.703, 159 89, 0.828,

85, 0.836, 254 , , 272 -RF00059 18, 0.369, 29 29, 0.531, 57 29, 0.519,

64 37 , 0.519, 59 33, , RF00162 15, 0.306, 52 17, 0.449,

69 25 , 0.398, 61 22, 0.426, 67 19, , 61RF00167 19, 0.324, 27 27, 0.493, 59 25, 0.556, 53 22, 0.577,

62 28 , , 57RF01734 , 0.300, 16 , 0.380, 16 , , 16 , 0.360, 16 , 0.400, TABLE III.

Summary of the contact map results.

For each protein or RNA family we show here three metrics computedfrom the PPV curve retrieved from a set of Potts models. H seed is a Potts model learned using the seed sequences alone, whilethe others are associated with the complete alignments obtained by DCAlign- potts , DCAlign- phmm , HMMer and Infernal. Thechosen observables give the position of the ﬁrst false positive (First FP), the value of the true positive rate (TPR) computedafter 2 · L predictions and the rank at which the value of the true positive rate is smaller than 0.80 for the ﬁrst time. Aperfect prediction is obtained if all the true positive contacts are associated with the highest value of the Frobenius norm,thus the higher the value of these metrics, the better the prediction of the contact maps. We show in bold numbers the bestperformances, for all metrics and among all the methods. Infernal Infernal (a) (b) (c) (d)

FIG. 11.

Contact predictions.

We show the Positive Predictive Value curves on the top panels and, on the bottom ones,the contact map retrieved by a set of known crystal structures (gray squares) and the Frobenius norms (computed from thefull set of aligned sequences or the seed), for (a) PF00035, (b) PF00684, (c) PF00763 and (d) RF00162. the central part of the aligned sequences, because thereis no available structural information about the sites onthe boundaries. In addition to the predictions obtainedfrom the full set of aligned sequences, we show, for com-parison, the predicted contact map obtained from thePotts model inferred from the seed sequences alone. Aswe can notice from Table III and the plot of the contactmaps, there is no strategy that clearly outperforms theothers (except the poor results of seed which are easilyexplained by its limited number of sequences). For RNA families, Infernal seems to accomplish the best predic-tions in terms of ﬁrst FP and TPR but nonetheless allthe other methods, included HMMer, show comparableresults. In fact, although HMMer has the tendency toassign consecutive gaps in the ﬁrst and last sites of thealigned sequences, these regions are not considered in thecomparison, and the core part of the alignment suﬃcesto obtain similar results, in terms of contact prediction,to the other methods. Although in the other metricspresented above there was no clear diﬀerence between9models learned from large or small seeds, in the contactmaps comparison this seems to be an important issue. In-deed, the amount of sequences in the seed slightly aﬀectsthe quality of the contact map for our methods: for thePF00035 family (whose seed contains only 81 sequences)DCAlign reaches slightly worse performances than HM-Mer. On the contrary, for the PF00763 all methods pro-duce indistinguishable PPV curves and contact maps. Fi-nally, we remark the results for PF00684 in Fig. 11(b)where DCAlign achieves a better contact prediction, asmanifested by the PPV lines. This result, not linked tothe way of encoding the seed statistics within the model,but shared by both H potts and H phmm , could be causedby a better treatment of the insertions with respect toHMMer. VIII. CONCLUSIONS

In this work, we have developed and tested DCAlign,a method which allows us to align biological sequencesto Potts models of a seed alignment. The set of hyper-parameters characterizing the models are inferred by aninverse statistical-physics based method known as DirectCoupling Analysis from a seed to capture both the single-and two-site (or column) statistics. Single site statisticsoften signals residue conservation, i.e. the propensityof some sites to restrict the variation of residue (amino-acid or nucleotide) composition because speciﬁc residuesare functionally and/or structurally important at certainpositions. The two-site statistics is related to residue co-variation, or equivalently coevolution. Indeed, residuesin direct contact in a folded protein must preserve bio-physically compatible properties, leading to a correlatedevolution of pairs of sites.Most standard alignment algorithms are based on theassumption of independent-site evolution, which is statis-tically encoded via the so-called proﬁle models. In thesealignment procedures, strongly conserved residues serveas anchoring points, and a mismatch in these positionssurely induces bad alignment scores (i.e. high energiesusing a physics-like terminology). Variable sites, char-acterized by high entropy values in the seed MSA, doprovide little information for aligning a new sequence tothe seed.However, often residue pairs show a strong degree ofcovariation as reﬂected in two-site statistics and as a con-sequence, this important collective information must betaken into account. Up to now, the only example in whichthis information is exploited in the alignment procedure is the case of RNA sequences where the possible basepairing (Watson-Crick or wobble pairs) is encoded in thecovariance models used to align.DCAlign takes advantage of both conservation and co-variation information contained in the seed alignment.It is able to detect within a candidate sequence the mostcompatible domain among all the possible sub-sequences,as that maximizing a score. The latter can be under-stood as a probability measure of the domain accordingto a Boltzmann distribution carefully built from the DCAmodel and gap/insertion penalties learned from the seed.Using synthetic data at ﬁrst, we were able to show thatthe algorithm is able to work under both extreme con-ditions, when all information is contained in conserva-tion but none in covariation, or vice versa. This rendersDCAlign universally applicable, in contrast to more spe-cialized alignment algorithms like HMMer (using proﬁleHMM) or Infernal (using covariance models based on sec-ondary RNA structure).This universal applicability is well conﬁrmed in thecase of real data; we tested both protein and RNA se-quence data, aligning large numbers of sequences to theseed MSA provided by the Pfam and Rfam databases.We ﬁnd that in most cases, our algorithm performs com-parably well to the specialized methods, while e.g. proﬁleHMM applied to RNA perform less well. Interestingly,in one of the studied protein families, we ﬁnd a largegroup of proteins, which are aligned diﬀerently by HM-Mer and DCAlign. The sequences aligned by DCAlignshow a better coherence with the sequence-space struc-ture of the seed MSA than those aligned by HMMer,suggesting that the alignment proposed by DCAlign isactually to be preferred in this case.

ACKNOWLEDGMENTS

The authors thank Sean Eddy, Alessandra Carbone,Francois Coste and Hugo Talibart for interesting discus-sions. APM thanks Edoardo Sarti for interesting dis-cussions and his assistance with the

Pfam interactions code. APM, AP, and MW acknowledge funding by theEU H2020 research and innovation programme MSCA-RISE-2016 under grant agreement No. 734439 INFER-NET. APM and FZ acknowledge funding from the Si-mons Foundation ( [1] R. Durbin, S. Eddy, A. Krogh, and G. Mitchison,

Biolog-ical Sequence Analysis: Probabilistic Models of Proteinsand Nucleic Acids (Cambridge University Press, 1998). [2] S. B. Needleman and C. D. Wunsch, Journal of MolecularBiology , 443 (1970). [3] T. F. Smith and M. S. Waterman, Journal of MolecularBiology , 195 (1981).[4] R. C. Edgar and S. Batzoglou, Current opinion in struc-tural biology , 368 (2006).[5] D.-F. Feng and R. F. Doolittle, Journal of molecular evo-lution , 351 (1987).[6] J. D. Thompson, D. G. Higgins, and T. J. Gibson, Nucleicacids research , 4673 (1994).[7] C. Notredame, D. G. Higgins, and J. Heringa, Journal ofmolecular biology , 205 (2000).[8] K. Katoh, K. Misawa, K.-i. Kuma, and T. Miyata, Nu-cleic acids research , 3059 (2002).[9] R. C. Edgar, Nucleic acids research , 1792 (2004).[10] S. F. Altschul, T. L. Madden, A. A. Sch¨aﬀer, J. Zhang,Z. Zhang, W. Miller, and D. J. Lipman, Nucleic acidsresearch , 3389 (1997).[11] S. R. Eddy, PLoS computational biology (2011).[12] S. El-Gebali, J. Mistry, A. Bateman, S. R. Eddy, A. Lu-ciani, S. C. Potter, M. Qureshi, L. J. Richardson, G. A.Salazar, A. Smart, E. L. Sonnhammer, L. Hirsh, L. Pal-adin, D. Piovesan, S. C. Tosatto, and R. D. Finn, NucleicAcids Research , D427 (2018).[13] S. R. Eddy, Bioinformatics (Oxford, England) , 755(1998).[14] E. P. Nawrocki and S. R. Eddy, Bioinformatics , 2933(2013).[15] R. Nussinov and A. B. Jacobson, Proceedings of the Na-tional Academy of Sciences , 6309 (1980).[16] M. Zuker and D. Sankoﬀ, Bulletin of mathematical biol-ogy , 591 (1984).[17] W. Fontana, D. A. M. Konings, P. F. Stadler, andP. Schuster, Biopolymers , 1389 (1993).[18] D. De Juan, F. Pazos, and A. Valencia, Nature ReviewsGenetics , 249 (2013).[19] M. Weigt, R. A. White, H. Szurmant, J. A. Hoch, andT. Hwa, Proceedings of the National Academy of Sciences , 67 (2009).[20] F. Morcos, A. Pagnani, B. Lunt, A. Bertolino, D. S.Marks, C. Sander, R. Zecchina, J. N. Onuchic, T. Hwa,and M. Weigt, Proceedings of the National Academy ofSciences , E1293 (2011).[21] M. Ekeberg, C. L¨ovkvist, Y. Lan, M. Weigt, and E. Au-rell, Phys. Rev. E , 012707 (2013).[22] S. Cocco, C. Feinauer, M. Figliuzzi, R. Monasson, andM. Weigt, Reports on Progress in Physics , 032601(2018).[23] Y. Roudi, J. Tyrcha, and J. Hertz, Phys. Rev. E ,051915 (2009).[24] V. Sessak and R. Monasson, Journal of Physics A: Math-ematical and Theoretical , 055001 (2009).[25] A. Decelle and F. Ricci-Tersenghi, Phys. Rev. Lett. ,070603 (2014).[26] H. C. Nguyen, R. Zecchina, and J. Berg, Advances inPhysics , 197 (2017).[27] D. S. Marks, L. J. Colwell, R. Sheridan, T. A. Hopf,A. Pagnani, R. Zecchina, and C. Sander, PLOS ONE ,1 (2011).[28] J. I. Sulkowska, F. Morcos, M. Weigt, T. Hwa, and J. N.Onuchic, Proceedings of the National Academy of Sci-ences , 10340 (2012).[29] T. A. Hopf, L. J. Colwell, R. Sheridan, B. Rost,C. Sander, and D. S. Marks, Cell , 1607 (2012).[30] S. Ovchinnikov, H. Park, N. Varghese, P.-S. Huang, G. A.Pavlopoulos, D. E. Kim, H. Kamisetty, N. C. Kyrpides, and D. Baker, Science , 294 (2017).[31] A. Procaccini, B. Lunt, H. Szurmant, T. Hwa, andM. Weigt, PloS one (2011).[32] C. Baldassi, M. Zamparo, C. Feinauer, A. Procaccini,R. Zecchina, M. Weigt, and A. Pagnani, PLOS ONE ,1 (2014).[33] C. Feinauer, H. Szurmant, M. Weigt, and A. Pagnani,PLOS ONE , 1 (2016).[34] A.-F. Bitbol, R. S. Dwyer, L. J. Colwell, and N. S.Wingreen, Proceedings of the National Academy of Sci-ences , 12180 (2016).[35] T. Gueudr´e, C. Baldassi, M. Zamparo, M. Weigt, andA. Pagnani, Proceedings of the National Academy of Sci-ences , 12186 (2016).[36] Q. Cong, I. Anishchenko, S. Ovchinnikov, and D. Baker,Science , 185 (2019).[37] G. Croce, T. Gueudr´e, M. V. R. Cuevas, V. Keidel,M. Figliuzzi, H. Szurmant, and M. Weigt, PLoS com-putational biology (2019).[38] R. S. Dwyer, D. P. Ricci, L. J. Colwell, T. J. Silhavy, andN. S. Wingreen, Genetics , 443 (2013).[39] Cheng, Ryan R. and Morcos, Faruck and Levine, Her-bert and Onuchic, Jose N., Proceedings of the NationalAcademy of Sciences , E563 (2014).[40] M. Figliuzzi, H. Jacquier, A. Schug, O. Tenaillon, andM. Weigt, Molecular Biology and Evolution , 268(2015).[41] Cheng, R. R. and Nordesj¨o, O. and Hayes, R. L. andLevine, H. and Flores, S. C. and Onuchic, J. N. andMorcos, F., Molecular Biology and Evolution , 3054(2016).[42] T. A. Hopf, J. B. Ingraham, F. J. Poelwijk, C. P. Sch¨arfe,M. Springer, C. Sander, and D. S. Marks, Nature biotech-nology , 128 (2017).[43] J. M. Reimer, M. Eivaskhani, I. Harb, A. Guarn´e,M. Weigt, and T. M. Schmeing, Science (2019).[44] W. P. Russ, M. Figliuzzi, C. Stocker, P. Barrat-Charlaix,M. Socolich, P. Kast, D. Hilvert, R. Monasson, S. Cocco,M. Weigt, and R. Ranganathan, Science (in press)(2020).[45] M. M´ezard and A. Montanari, Information, physics, andcomputation (Oxford University Press, 2009).[46] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborov´a,Phys. Rev. E , 066106 (2011).[47] M. Figliuzzi, P. Barrat-Charlaix, and M. Weigt, Molecu-lar Biology and Evolution , 1018 (2018).[48] D. J. Thouless, P. W. Anderson, and R. G. Palmer, Philo-sophical Magazine , 593 (1977).[49] M. Ekeberg, T. Hartonen, and E. Aurell, Journal of Com-putational Physics , 341 (2014).[50] R. D. Finn, J. Clements, and S. R. Eddy, Nucleic AcidsResearch , W29 (2011).[51] Pfam domain-domain interaction benchmark , https://github.com/infernet-h2020/pfam_interactions (2020 – accessed February 16, 2020).[52] F. Krzakala and L. Zdeborov´a, EPL (Europhysics Let-ters) , 57005 (2008).[53] I. Kalvari, J. Argasinska, N. Quinones-Olvera, E. P.Nawrocki, E. Rivas, S. R. Eddy, A. Bateman, R. D.Finn, and A. I. Petrov, Nucleic Acids Research , D335(2017).[54] E. De Leonardis, B. Lutz, S. Ratz, S. Cocco, R. Monas-son, A. Schug, and M. Weigt, Nucleic acids research ,10444 (2015). SUPPORTING INFORMATION1. Belief propagation equation

In order to write the BP equations, it is convenient to introduce weights W i,j associated to pair interactions in theBoltzmann weight Eq. (15). For i < j we deﬁne W i,j ( x i , n i , x j , n j ) = (cid:40) e J i,i +1 ( A xini ,A xi +1 ni +1 ) χ sr ( x i , n i , x i +1 , n i +1 ) , for j = i + 1 ,e J i,j ( A xini ,A xjnj ) χ lr ( x i , n i , x j , n j ) , for j > i + 1 , (38)and for i > j we just set W i,j ( x i , n i , x j , n j ) = W j,i ( x j , n j , x i , n i ). The total Boltzmann weight then becomes W ( x , n ) = 1 Z L (cid:89) i =1 W i ( x i , n i ) × ,L (cid:89) i χ lr are omitted, then W ij = 1 for | i − j | > χ lr are redundant and can be omitted in an exact treatment. However, theapproximate BP equations depend on the χ lr and give diﬀerent results if these terms are omitted. In other words,the BP approximation does not “commute” with the insertion/removal of the χ lr constraints.

2. From belief propagation to mean ﬁeld

We can obtain the mean ﬁeld equations by considering a weak interaction (or large connectivity) limit of theBP equations. This limit consists in making two approximations for pairs of sites j (cid:54) = { i + 1 , i − } that are notnearest-neighbor in the linear chain:1. We assume that J i,j ( A, B ) is suﬃciently small to approximate e J i,j ( A,B ) (cid:39) J i,j ( A, B ). This is likely thecase for far away sites that are not in contact in the three-dimensional protein structure. Moreover, when theHamiltonian is written in zero-sum gauge, all couplings tend to be small (often the largest ones are ∼ J i,j .2. We approximate the messages m i → j ( x i , n i ) with the marginal density P i ( x i , n i ), which is a reasonable choicewhen i and j and far enough that the inﬂuence of j on i is negligible.We emphasize that while both approximations are exact in a mean ﬁeld setting, in which one takes the thermodynamiclimit L → ∞ with couplings vanishing with L , they do not hold exactly for our setting in which both L and thecouplings are ﬁnite. Moreover, an important remark is that this approximation does not preserve the gauge invarianceof the Hamiltonian.To see the eﬀect of these approximations, consider, for instance, the marginal density of a node 1 < i < L in the belief propagation framework. Let us call the nearest-neighbor messages F i ( x i , n i ) = m i → i +1 ( x i , n i ) and B i ( x i , n i ) = m i → i − ( x i , n i ). The contributions coming from the right and from the left along the linear chains arethen F i ( x i , n i ) = (cid:88) x i − ,n i − W i − ,i ( x i − , n i − , x i , n i ) F i − ( x i − , n i − ) , B i ( x i , n i ) = (cid:88) x i +1 ,n i +1 W i,i +1 ( x i , n i , x i +1 , n i +1 ) B i +1 ( x i +1 , n i +1 ) , (42)2and we have P i ( x i , n i ) = 1 z i F i ( x i , n i ) B i ( x i , n i ) W i ( x i , n i ) (cid:89) k (cid:54) = { i − ,i,i +1 } (cid:88) x k ,n k : χ lr ( x i ,n i ,x k ,n k )=1 e J i,k ( A xini ,A xknk ) m k → i ( x k , n k ) . (43)Applying the weak interaction approximation to the contribution of distant sites, and introducing A k → i ( x i , n i ) = (cid:88) x k ,n k χ lr ( x i , n i , x k , n k ) P k ( x k , n k ) , A i ( x i , n i ) = (cid:89) k (cid:54) = { i − ,i,i +1 } A k → i ( x i , n i ) , (44)we obtain P i ( x i , n i ) ∼∼ z i F i ( x i , n i ) B i ( x i , n i ) W i ( x i , n i ) (cid:89) k (cid:54) = { i − ,i,i +1 }  A k → i ( x i , n i ) + (cid:88) x k ,n k : χ lr ( x i ,n i ,x k ,n k )=1 J i,k ( A x i n i , A x k n k ) P k ( x k , n k )  ∼ z i A i ( x i , n i ) F i ( x i , n i ) B i ( x i , n i ) W i ( x i , n i ) e (cid:80) k (cid:54) = { i − ,i,i +1 } (cid:80) xk,nk χ lr ( xi,ni,xk,nk ) Ji,k ( Axini,Axknk ) Pk ( xk,nk ) (cid:80) xk,nk χ lr ( xi,ni,xk,nk ) Pk ( xk,nk ) . (45)Introducing the modiﬁed weight C i ( x i , n i ) = A i ( x i , n i ) W i ( x i , n i ) e (cid:80) k/ ∈{ i,i ± } (cid:80) xk,nk χ lr( xi,ni,xk,nk ) Ji,k ( Axini,Axknk ) Pk ( xk,nk ) (cid:80) xk,nk χ lr( xi,ni,xk,nk ) Pk ( xk,nk ) , (46)we obtain that P i ( x i , n i ) = 1 z i F i ( x i , n i ) B i ( x i , n i ) C i ( x i , n i ) , (47)which amounts to use the transfer matrix expression with the replacement W i → C i . A very similar procedure canbe applied to treat the long range part in the forward and backward messages, F i and B i , with the same result: theequations are identical to the transfer matrix equations with W i → C i . This procedure thus provides a set of closedequations for the forward and backward messages and the marginal probabilities.For simplicity, we also make an additional approximation, namely that for each pair of distant sites i, k and foreach “relevant” choice (in a sense that will be more precise below) of x i , n i , we have A k → i ( x i , n i ) = (cid:88) x k ,n k χ lr ( x i , n i , x k , n k ) P k ( x k , n k ) ∼ ⇒ A i ( x i , n i ) ∼ . (48)In words, this amounts to assume that the long-range constraints we artiﬁcially introduce do not play a big role on thenormalization of marginals, i.e. that most of the mass of P k is concentrated on compatible assignments of x k , n k . Notethat for very unlikely values of x i , n i (e.g. assigning x i = 0 , n i = N + 1 at the very beginning of the sequence), theapproximation in Eq. (48) is probably not correct. However, the probability P i for such values is already suppressedby the short-range terms, making the error irrelevant. Under this approximation, the modiﬁed weights reduce to C i ( x i , n i ) = W i ( x i , n i ) e (cid:80) k/ ∈{ i,i ± } (cid:80) xk,nk χ lr ( x i ,n i ,x k ,n k ) J i,k ( A xini ,A xknk ) P k ( x k ,n k ) . (49)The advantage of this additional approximation is that when the long-range couplings J i,j →

0, the mean ﬁeldequations reduce exactly to the transfer matrix equations, as it should be.

3. Mean ﬁeld free energy

We give here the expression of the free energy associated with the Boltzmann weight in Eq. (15). This free energycan be used, for example, as a score to compare the quality of the alignment of diﬀerent sub-regions of the same verylong sequence. The free energy associated with Belief Propagation is given by F = − T (cid:88) i log z i + T (cid:88) (cid:104) ij (cid:105) log z ij , (50)3where the second sum is over distinct pairs (cid:104) ij (cid:105) . Here, the site term z i is the denominator in Eq. (41), while the linkterm is given by z ij = (cid:88) x i ,n i ,x j ,n j m i → j ( x i , n i ) m j → i ( x j , n j ) W i,j ( x i , n i , x j , n j ) . (51)In the weak interaction approximation, we can write the site term z i as simply the denominator of the single-sitemarginals, as deﬁned in Eq. (47), i.e. z i = (cid:88) x i ,n i C i ( x i , n i ) F i ( x i , n i ) B i ( x i , n i ) . (52)(with small diﬀerences on the boundaries). For the link term z ij we should distinguish between nearest neighboringsites, and far away sites. For the ﬁrst case we get z i − ,i = (cid:88) x i − ,n i − ,x i ,n i χ sr ( x i − , n i − , x i , n i ) e J i,i − ( A xi − ni − ,A xini ) F i − ( x i − , n i − ) B i ( x i , n i ) . (53)For the second case, j (cid:54) = i ±

1, we get, z ij = (cid:88) x i ,n i ,x j ,n j χ lr ( x i , n i , x j , n j ) e J ij ( A xini ,A xjnj ) m i → j ( x i , n i ) m j → i ( x j , n j ) (cid:39) (cid:88) x i ,n i ,x j ,n j χ lr ( x i , n i , x j , n j ) e J ij ( A xini ,A xjnj ) P i ( x i , n i ) P j ( x j , n j ) (cid:39) exp  (cid:88) x i ,n i ,x j ,n j χ lr ( x i , n i , x j , n j ) J ij ( A x i n i , A x j n j ) P i ( x i , n i ) P j ( x j , n j )  , (54)where we applied the mean ﬁeld approximation of identifying messages with marginals (ﬁrst to second line), andconsidering the interactions J ij as being small (second to third line). Note that in the second step we assumed that (cid:80) x i ,n i ,x j ,n j χ lr ( x i , n i , x j , n j ) P i ( x i , n i ) P j ( x j , n j ) = 1, i.e. that the marginals respect the long-range constraints.

4. Zero temperature limit

We discuss here brieﬂy how to take the zero temperature limit of the mean ﬁeld equations. In order to introducea temperature T = 1 /β (cid:54) = 1 we need to rescale all the parameters in the cost function, as J ij → βJ ij , h i → βh i , µ → β µ , λ → β λ .In order to take the T → P i ( x i , n i ) = e βπ i ( x i ,n i ) , F i ( x i , n i ) = e βφ i ( x i ,n i ) , B i ( x i , n i ) = e βψ i ( x i ,n i ) , C i ( x i , n i ) = e β c i ( x i ,n i ) , F i ( x i , n i ) = e β f i ( x i ,n i ) , B i ( x i , n i ) = e β b i ( x i ,n i ) . (55)We also deﬁne ( x ∗ i , n ∗ i ) the maximum of π ( x i , n i ). Note that in the ﬁrst line the messages are normalized, so π ( x ∗ i , n ∗ i ) = 0, and a similar relation for the other messages. In the second line instead, messages are not normalized.The zero temperature mean ﬁeld equations are then obtained by taking the limit β → ∞ , in which the sums over x i , n i are dominated by the maximum of the integrand. One should only take into account that the hard constraints χ in and χ end set to zero some elements of P , F , P L and B L , which translates in −∞ elements for π , φ , p L and ψ L . We obtain (with minor modiﬁcations at the boundaries):c i ( x i , n i ) = h i ( A x i · n i ) − µ ( n i )(1 − x i ) + (cid:88) j / ∈{ i,i ± } χ lr ( x i , n i , x ∗ j , n ∗ j ) J ij ( A x i · n i , A x ∗ j · n ∗ j ) , f i ( x i , n i ) = max x i − ,n i − : χ sr ( x i − ,n i − ,x i ,n i ) > [ φ i − ( x i − , n i − ) + J i − ,i ( A x i − · n i − , A x i · n i ) − ϕ i (∆ n i ) I ( n i − > I ( n i < N + 1)] , b i ( x i , n i ) = max x i +1 ,n i +1 : χ sr ( x i ,n i ,x i +1 ,n i +1 ) > [ ψ i +1 ( x i +1 , n i +1 ) + J i,i +1 ( A x i · n i , A x i +1 · n i +1 ) − ϕ i +1 (∆ n i +1 ) I ( n i > I ( n i +1 < N + 1)] , (56)4together with π i ( x i , n i ) = c i ( x i , n i ) + f i ( x i , n i ) + b i ( x i , n i ) − max x i ,n i [c i ( x i , n i ) + f i ( x i , n i ) + b i ( x i , n i )] ,φ i ( x i , n i ) = c i ( x i , n i ) + f i ( x i , n i ) − max x i ,n i [c i ( x i , n i ) + f i ( x i , n i )] ,ψ i ( x i , n i ) = c i ( x i , n i ) + b i ( x i , n i ) − max x i ,n i [c i ( x i , n i ) + b i ( x i , n i )] . (57)

5. Maximum likelihood equations for insertions

We determine the values of the insertion penalties λ o , λ e by maximizing the likelihood of the data, i.e. the M sequences of the seed, given the parameters: L (cid:16) { ∆ n } Ma =1 | λ o , λ e (cid:17) = log (cid:89) a P i (∆ n a | λ o , λ e ) − λ o − λ e = (cid:88) a log P i (∆ n a | λ o , λ e ) − λ o − λ e = (cid:88) a (cid:110) log (cid:104) δ ∆ n a , + (1 − δ ∆ n a , ) e − λ o − λ e (∆ n a − (cid:105) − log z (cid:111) − λ o − λ e , (58)where the regularization terms are used to avoid inﬁnite of undetermined parameters. From the likelihood we obtainan explicit expression of the gradient, ∂ L ∂λ o = (cid:88) a (cid:26) − (1 − δ ∆ n a , ) e − λ o − λ e (∆ n a − δ ∆ n a , + (1 − δ ∆ n a , ) e − λ o − λ e (∆ n a − − z ∂z∂λ o (cid:27) − λ o = (cid:88) a (cid:40) − (1 − δ ∆ n a , ) e − λ o − λ e (∆ n a − δ ∆ n a , + (1 − δ ∆ n a , ) e − λ o − λ e (∆ n a − + e − λ o (cid:0) − e − λ e (cid:1) − e − λ o (1 − e − λ e ) − (cid:41) − λ o = M (cid:34) e − λ o (cid:0) − e − λ e (cid:1) − e − λ o (1 − e − λ e ) − − (cid:2) − P seed (0) (cid:3) − M λ o (cid:35) ,∂ L ∂λ e = (cid:88) a (cid:26) − (1 − δ ∆ n a , ) (∆ n a − e − λ o − λ e (∆ n a − δ ∆ n a , + (1 − δ ∆ n a , ) e − λ o − λ e (∆ n a − − z ∂z∂λ e (cid:27) − λ e = (cid:88) a (cid:40) − (1 − δ ∆ n a , ) (∆ n a − e − λ o − λ e (∆ n a − δ ∆ n a , + (1 − δ ∆ n a , ) e − λ o − λ e (∆ n a − + e − λ o − λ e (cid:0) − e − λ e (cid:1) − e − λ o (1 − e − λ e ) − (cid:41) − λ e = M (cid:34) e − λ o − λ e (cid:0) − e − λ e (cid:1) − e − λ o (1 − e − λ e ) − − (cid:104) ∆ n (cid:105) P seed (∆ n ) + (cid:2) − P seed (0) (cid:3) − M λ e (cid:35) . (59)The maximum likelihood estimators can then be obtained by gradient descent, as detailed in the main text.5

6. Sequence-based distances plot (a) (b)(c) (d)(e)

FIG. 12.

Distances distribution.

We plot here the histograms of the Hamming distances, Gap + , Gap − and Mismatches forthe protein families PF00684 (a), PF00763 (b) using as reference the HMMer results and as target the DCAlign results, andfor RF00059 (c), RF00167 (d) and RF01734 (e) using as reference the Infernal results and as target the DCAlign results.

7. Proximity measures plot

InfernalInfernal (a) (b)(c) (d)

FIG. 13.

Proximity measures.

Histograms of the minimum distances computed according to Eq. (32) for the full set ofaligned sequences obtained by DCAlign- potts , DCAlign- phmm , HMMer, and Infernal, against the seed. Panels (a), (b), (c)and (d) refer to the families PF00035, PF00763, RF00059, RF00167 respectively.

8. Contact maps (a)(a

Infernal Infernal (a) (b) (c) (d)

InfernalInfernalInfernal Infernal

FIG. 14.