Counting, generating and sampling tree alignments
CCounting, generating and sampling tree alignments
Cedric Chauve , Julien Courtiel , , and Yann Ponty , Department of Mathematics, Simon Fraser University Pacific Institute for the Mathematical Sciences CNRS-LIX, Ecole Polytechnique
Abstract.
Pairwise ordered tree alignment are combinatorial objects thatappear in RNA secondary structure comparison. However, the usual rep-resentation of tree alignments as supertrees is ambiguous, i.e. two distinctsupertrees may induce identical sets of matches between identical pairs oftrees. This ambiguity is uninformative, and detrimental to any probabilis-tic analysis.In this work, we consider tree alignments up to equivalence. Our first re-sult is a precise asymptotic enumeration of tree alignments, obtained froma context-free grammar by mean of basic analytic combinatorics. Our sec-ond result focuses on alignments between two given ordered trees S and T . By refining our grammar to align specific trees, we obtain a decom-position scheme for the space of alignments, and use it to design an ef-ficient dynamic programming algorithm for sampling alignments underthe Gibbs-Boltzmann probability distribution. This generalizes existingtree alignment algorithms, and opens the door for a probabilistic analysisof the space of suboptimal RNA secondary structures alignments. Tree alignments are the natural analog of sequence alignments, and have beenintroduced by Jiang, Wang and Zhang [9] to model and quantify the similaritybetween two (ordered ) trees. Initially proposed as an alternative to tree-editdistance, the tree alignment model has proven more robust, allowing for theinclusion of complex local operations [2], and for being generalized to mul-tiple input trees [8]. Consequently, tree alignment has been used in a widearray of applicative contexts, especially RNA Bioinformatics [7], where RNAsecondary structures alignments can be encoded by tree alignments. The min-imal cost tree alignment between two trees of size n and n , under classicinsertion/deletion/(mis)-match operations, can be computed using dynamicprogramming (DP). The current best algorithms have a worst-case time andspace complexity respectively in O ( n n ( n + n ) ) and O ( n n ( n + n )) [9] al-gorithms, and an average-case time and space complexity (on uniformly drawninstances) in O ( n n ) [6].In the context of sequence alignments, the enumeration of alignments hasbeen the object of much interest in Computational Biology [4,12,1]. Alignments In this work, unless explicitly specified, all trees will be rooted and ordered. a r X i v : . [ q - b i o . Q M ] M a r etween two sequences over an alphabet Σ can be encoded as sequences overan extended alphabet Σ a , representing insertions, deletions and (mis)matches( e.g. Σ = { a, b } , Σ a = { ( a, − ) , ( − , b ) , ( a, b ) , ( a, a ) , ( b, a ) , ( b, b ) } ). Many sequencesover Σ a are equivalent if one considers only (mis)matches of the alignments, i.e. they align sequence of same lengths and induce the same sets of matchedpositions ( e.g. ( a, − ) , ( − , b ) and ( − , b ) , ( a, − ) ). It is a natural problem to enu-merate distinct sequence alignments for two sequences of cumulated length n [14, pp. 188]. Beyond purely theoretical considerations, the decompositionsintroduced for enumerating distinct sequence alignments were adapted intoDP algorithms, e.g. for probabilistic alignment based on expectation maximiza-tion [3], or to compute Gibbs-Boltzmann measures of reliability [13].In the present work, we consider similar questions on tree alignments . Weare first interested in counting distinct tree alignments, i.e. enumerating, up toequivalence, ordered trees whose vertices are labeled in Σ a (called supertrees from now). For trees, the notion of equivalence of alignments generalizes thatof sequence alignments, i.e. two alignments are equivalent when they align thesame pairs of trees, and induce the same sets of (mis)matched positions. Un-fortunately, contrasting with the case of sequence alignments, existing DP al-gorithms for computing an optimal tree alignment [9,2,11] cannot be easilyadapted into enumeration schemes for tree alignments up to equivalence. Thisadditional difficulty is due to the existence of ambiguities of different nature.Our main contribution is a grammar for (distinct) tree alignments, whichprovably generates a single representative for each equivalence class. We usethe symbolic method [5] to obtain the generating function of tree alignments,and asymptotic equivalents for various statistics of interest can easily be de-rived, such as the average number of alignments over trees of total size n . Fi-nally, and, perhaps more importantly from an applied point of view, the gram-mar can be transformed into an unambiguous and complete DP algorithm foraligning two input trees. The resulting algorithm has the same asymptotic worst-case and average-case complexities, up to reasonable constants, as the currentbest – ambiguous – algorithm [9,2]. The main interest of such an algorithm isthat it opens immediately the way to new applications for the tree alignmentmodel, including a critical assessment of the reliability of optimal alignments,either obtained by counting co-optimal alignments, or by sampling suboptimalalignments according to a Gibbs-Boltzmann distribution (see [10] for an exam-ple of this approach for the RNA folding problem).In Section 2 we introduce the main definitions about trees, supertrees andtree alignments. In Section 3, we provide a grammar that generates all treealignments. In Section 4.1 we analyze this grammar from an enumerative pointof view and give precise results on the number of alignments of fixed size. Fi-nally, in Section 4.2 we show how to transform the tree alignments grammarinto a dynamic programming algorithm to sample tree alignments between twospecified trees. Definitions
Trees and supertrees.
Let Σ be an alphabet . A tree T on Σ is a rooted plane treewhose vertices are labeled by elements of Σ . We denote by V T the set of verticesof T . We remove a non-root vertex v from a tree T by contracting the edge between v and its parent u , that keeps its label. Removing the root r of a tree consists increating a forest composed of the subtrees rooted at the children of r . We denotethe operation of removing a vertex v from T by T − v .We denote by Σ a the alphabet defined by Σ a = ( Σ ∪ {−} ) − { ( − , − ) } . Anelement ( x, y ) ∈ Σ a is an insertion (resp. deletion , match ) if y = − (resp. x = − , ( x, y ) ∈ Σ ). A supertree A is a tree on Σ a ; a vertex of A is an insertion (resp.deletion, match) if its label is an insertion (resp. deletion, match). The size of asupertree A is the number of its insertions and deletions, plus twice the numberof its matches. A superforest is an ordered sequence of supertrees.Given a supertree A on Σ , we define two forests π ( A ) and π ( A ) as follows: π ( A ) (resp. π ( A ) ) is obtained by (1) iteratively removing all insertion (resp.deletions) of A , in an arbitrary order, and (2) replacing the label ( x, y ) of eachremaining vertex by x (resp. y ). We refer to Fig. 1 for an illustration. We extendthe notations π and π on vertices: for a non-insertion (resp. non-deletion) ver-tex v of A , we denote by π ( v ) (resp. π ( v ) ) the corresponding vertex in π ( A ) (resp. π ( A ) ). A vertex x of π ( A ) such that π − ( x ) is an insertion (resp. match)is said to be inserted (resp. matched) in A . Similarly, a vertex y of π ( A ) suchthat π − ( y ) is a deletion (resp. match) is said to be deleted (resp. matched) in A . Tree alignments.
As forests π ( A ) and π ( A ) are embedded into the supertree A ,the latter implicitly defines an alignment between the forests π ( A ) and π ( A ) , i.e. a set of correspondences between vertices of π ( A ) and π ( A ) , that is consis-tent with the structure of both forests [9]. We refer to Fig. 1 for an illustration. S = T = A = A AC G U ACU
AC C U AGU A
Fig. 1.
A supertree A with alphabet Σ = { A, C, G, U } , and the associated trees S = π ( A ) and T = π ( A ) . The alignment of S and T defined by A is composed of twopairs of matched ( A, A ) and ( U, A ) , indicated by dashed arrows. We now turn to the central notion of equivalent alignments , i.e. alignments ofidentical pairs of trees, that contain exactly the same set of matched vertices.Given a supertree A , representing an alignment between two trees S = π ( A ) nd T = π ( A ) , the set of matches of A is formed by the elements ( x, y ) of V S × V T such that π − ( x ) = π − ( y ) ( i.e. there exists a vertex v of A such that π ( v ) = x and π ( v ) = y ). Two supertrees A and A are equivalent if π ( A ) = π ( A ) , π ( A ) = π ( A ) , and the sets of matches of A and A are identical (see Fig. 2for an illustration). ∗ ∗ ∗ ∗∗∗ A = A = ∗ ∗∗∗ ∗ ∗∗∗ A AC GUC U A A A GC U AC U
Fig. 2.
Two non-equivalent supertrees, representing two different tree alignments. How-ever, the supertree A from Fig. 1 and the supertree A are equivalent. A tree alignment is then defined as an equivalence class over supertrees withrespect to the above-defined equivalence relation, for which π ( A ) and π ( A ) are trees. The notion of forest alignment is similarly defined when π ( A ) and π ( A ) are not restricted to trees. Given a set S of tree (resp. forest) alignments, aset T of supertrees (resp. superforests) is said to be representative of S if it containsexactly one supertree (resp. superforest) for each alignment ( i.e. equivalenceclasses of supertrees and forests) in S . Tree alignments will now be the focus ofour work. In this section, we describe a context-free grammar for a set A of supertrees thatis representative of the set of all tree alignments.We first define some basic operations on supertrees and superforests: – The (ordered) concatenation of two (super)forests A and B is denoted by A ◦ B . It creates a new superforest beginning by the supertrees of A , andending by the supertrees of B . – Given two disjoint sets T and T of supertrees or superforests, we denoteby T ⊕ T their (disjoint) union. – For any superforest A and a, b ∈ Σ , InsRoot (
A, a ) (resp. DelRoot (
A, b ) , MatchRoot (
A, a, b ) ) denotes the supertree whose root is the vertex ( a, − ) (resp. ( − , b ) , ( a, b ) ) and whose children are the supertrees in A , ordered withthe same order that they have in A . = V ∅ ⊕ T I ⊕ T D ⊕ InsRoot ( F I ◦ T D ) (1) T I = InsRoot ( F I ) , F I = { empty superforest } ⊕ InsRoot ( F I ) ◦ F I (2) T D = InsRoot ( F D ) , F D = { empty superforest } ⊕ InsRoot ( F D ) ◦ F D (3) V ∅ = V ↑ ⊕ InsRoot ( VH ) (4) V ↑ = MatchRoot (cid:0) H I | D , ∅ , ∅ (cid:1) ⊕ DelRoot (cid:16) F D ◦ V ↑ ◦ F D (cid:17) (5) VH = F I ◦ VH ⊕ V ∅ ◦ F I ⊕ DelRoot (cid:0) H I | D , ↔ , ∅ (cid:1) ◦ F I (6)For every ν, M, M (cid:48) with ν ∈ { I | D , D } and M, M (cid:48) ∈ { ∅ , ↔ , →} : H ν,M,M (cid:48) = (cid:77) { empty superforest } if ( M, M (cid:48) ) = ( ∅ , ∅ ) T I ◦ H ν,M,M (cid:48) if ν (cid:54) = D and if M (cid:54) = ↔ T D ◦ H D ,M,M (cid:48) if M (cid:48) (cid:54) = ↔ V ∅ ◦ H , M,M (cid:48)
InsRoot (cid:0) H I | D , ∅ , ↔ (cid:1) ◦ H , + M,M (cid:48)
DelRoot ( H D , ↔ , ∅ ) ◦ H + , M,M (cid:48) (7)For every
M, M (cid:48) ∈ { ∅ , ↔ , →} and i, j ∈ { , + } : H i,jM,M (cid:48) = H I | D ,α ( M ) ,α ( M (cid:48) ) ⊕ F I if M = ∅ and M (cid:48) = → F I if M = ∅ , M (cid:48) = ↔ and j = + F D if M = → and M (cid:48) = ∅ F D if M = ↔ , M (cid:48) = ∅ and i = + ∅ otherwise (8)where α ( ∅ ) = ∅ and α ( ↔ ) = α ( → ) = → . Fig. 3.
A context-free grammar for A , a representative set of all tree alignments. – We naturally extend these operators to a set T of supertrees or superforests: InsRoot ( T ) = (cid:77) A ∈ T ,a ∈ Σ InsRoot (
A, a ) , DelRoot ( T ) = (cid:77) A ∈ T ,a ∈ Σ DelRoot (
A, a ) , MatchRoot ( T ) = (cid:77) A ∈ T , ( a,b ) ∈ Σ MatchRoot (
A, a, b ) .Our grammar is described in Fig. 3, and illustrated in Fig. 4. Theorem 1.
The set of supertrees A generated by the grammar (1) - (8) is representativeof the set of all tree alignments; i.e. A contains exactly one supertree for each equivalenceclass of supertrees. The key ingredient to prove Theorem 1 stems from the following (semantic)properties for the classes of supertrees and forests that appear in the grammar: I | D , ∅ , ∅ ∗ ∗’ L V ↑ ∗V ↑ ∗ ∗∗∗ ∗’ L L ε L L L and M = ↔ if M = ↔ ∗ H D , M , M ’ L L
VH VH∗ H D , ↔ , ∅ ∗ ∗ ∗ if ( M, M ) = ( ∅ , ∅ ) if ν = D ∗ H ν,M,M V ∅ H H D , ↔ , ∅ ∗H I | D , ∅ , ↔ ∗∗ H HH ν,M,M V ∅ ∗∗’ L V ∅ V ↑ ∗VH’ L L L
A V ∅ ∗ ∗ ∗∗ ∗ ∗ Fig. 4.
A schematic illustration of the grammar for tree alignments.
1. Supertrees in T I (resp. T D ) contain only insertion (resp. deletion) vertices.2. F I (resp. F D ) is the set of superforests formed by supertrees of T I (resp. T D ).3. For µ ∈ { ∅ , ↑} , V µ is representative of the set of alignments A with at leastone match, such that, if µ = ↑ , then the root of π ( A ) is matched.4. VH is representative of the set of forest alignments A with at least onematch, such that π ( A ) is a tree.5. For ν ∈ { I | D , D } and ( M, M (cid:48) ) ∈ { ∅ , ↔ , →} , H ν,M,M (cid:48) is representative ofthe set of superforests A such that – if π ( A ) (cid:54) = ∅ and ν = D , then the first tree of π ( A ) is matched in A ; – if M = → , then the last tree of π ( A ) is matched in A (so π ( A ) (cid:54) = ∅ ); – if M (cid:48) = → , then the last tree of π ( A ) is matched in A (so π ( A ) (cid:54) = ∅ ); – if M = ↔ , then the first and last trees in π ( A ) are matched in A (so π ( A ) has at least two trees); – if M (cid:48) = ↔ , then the first and last trees in π ( A ) are matched in A (so π ( A ) has at least two trees).6. For i, j ∈ { , + } , H i,jM,M (cid:48) is representative of superforests A (cid:48) such that – there exists a superforest A such that A ◦ A (cid:48) ∈ H D ,M,M (cid:48) ; – if i = (resp. + ), π ( A ) is a tree (resp. a forest with at least two trees); – if j = (resp. + ), π ( A ) is a tree (resp. a forest with at least two trees).These properties can be verified recursively through a tedious analysis ofthe grammar, and imply quite straightforwardly that A contains one and ex-actly one supertree per equivalence class of supertrees. emark 1 For sequences alignments, a grammar generating a representative set of se-quence alignments can be easily adapted from the grammar generating all sequencesover Σ a , e.g. by preventing any occurrence to immediately precede an insertion. Inthe case of trees, the two-dimensional nature of the objects seems to forbid such a sim-ple characterization, and seem to intrinsically mandate intricate combinatorial con-structs/grammars. Note however, that our grammar, while complex, remains amenableto efficient computations (Section 4). For the sake of simplicity, we will restrict our attention to | Σ | = 1 , i.e. the al-phabet is restricted to a single letter. The general case follows easily, and will bedescribed in an extended version of the paper.For a family F of superforests, we define a bivariate ordinary generatingfunction F ( t, z ) = (cid:88) n ≥ , k ≥ f n,k t n z k where f n,k is the number of superforests in F of size n with k matches.Using the symbolic method [5], one classically translates the specification de-scribed by Eqs. (1)-(8) into a system of functional equations relating the gen-erating functions of the sets of supertrees and forests. To that purpose, classesof objects are replaced by their generating function, disjoint unions (resp. con-catenations) of two sets of supertrees are replaced by additions (resp. multipli-cations) of their generating functions, the addition of a root translates into amultiplication by a monomial tz (resp. t ) if the root represents a match (resp.insertion/deletion), and empty superforests and sets translate into and re-spectively. The grammar is context-free, so the resulting system is algebraic andcan be solved to yield the following characterization result. Theorem 2.
The generating functions T ( t, z ) and F ( t, z ) of tree and forest align-ments, whose size and number of matches are marked by t and z respectively, satisfy T ( t, z ) = (cid:18) t + t − t z + t √ − t (cid:19) F ( t, z ) , (9) ( tzC ( t ) − t C ( t ) +2 t ) F ( t, z ) +( t C ( t ) − tC ( t ) − F ( t, z )+ C ( t ) = 0 , (10) where C ( t ) = (1 − √ − t ) / t is the generating function of Catalan numbers. Solving the quadratic equation (10) leads to an explicit formula for FA (andhence TA ), details of which are omitted due to space constraints. Nonetheless,these explicit expressions can be used to compute an asymptotic estimate usinga transfer theorem [5, Cor. VI.1 p. 392]. heorem 3. The number of tree alignments of size n is asymptotically equivalent to κ × n − / × n , where κ = √ − √ / (24 √ π ) . Corollary 1
The average number of tree alignments for a random pair of trees of cu-mulated size n is κ (cid:48) × . n , where κ (cid:48) = √ − √ / . Similar techniques can be used to characterize the distribution of the num-ber of matches in a random tree alignment. A direct application of [5, TheoremIX.12 p. 676] indeed gives the following.
Proposition 2
Let m n be the random variable that counts the number of matches ina uniformly-drawn random tree alignment. The variable m n follows a Normal law ofmean E ( m n ) ∼ n/ and variance V ( m n ) ∼ n/ . We now consider two fixed trees S and T , and consider the task of sampling atree alignment A such that π ( A ) = S and π ( A ) = T , with respect to the Gibbs-Boltzmann probability distribution. This can be used to assess the stability ofa prediction. We refer the interested reader to our introduction for examples offurther motivation and possible applications. Preliminaries.
Let T S,T be the set of all supertrees A such that π ( A ) = S and π ( A ) = T , and A S,T be a representative set of T S,T . In other words, A S,T canbe interpreted as the set of all alignments between S and T . For any supertree A ∈ T S,T , we define its edit score s ( A ) as the sum of the number of insertions,deletions and matches ( x, y ) such that x (cid:54) = y . For a given positive constant kθ , the partition function Z S,T of A S,T and the
Gibbs-Boltzmann probability
Pr( A ) of an alignment A ∈ A S,T are defined as Z S,T = (cid:88) A ∈ A S,T e − s ( A ) /kθ , Pr( A ) = e − s ( A ) /kθ Z S,T . When kθ tends to , this distribution tends to the uniform distribution oversupertrees of minimum edit score, while, when kθ tends to + ∞ , it tends towardthe uniform distribution over A S,T .We consider the following problem: given two trees S and T , and a posi-tive constant kθ , design a sampling algorithm for alignments between S and T under the Gibbs-Boltzmann probability distribution. This problem generalizesthe classic combinatorial optimization problem of computing a tree alignmentbetween S and T having minimum edit score. The present results can be trivially extended to any edit scoring system that is a posi-tive linear combination of the numbers of insertions, deletions and matches. o address this problem, we rely on dynamic programming, by the ap-proach described, among others, in [10] for RNA folding. We begin by adaptingthe grammar introduced in Section 3 into a grammar for A S,T , then detail howthis grammar leads to an efficient sampling algorithm.
A grammar for A S,T . In order to guarantee that each supertree A indeed alignstwo input trees S and T (namely π ( A ) = S and π ( A ) = T ), we need to re-strict which rules in the grammar can be used, conditionally to which trees andforests are currently being generated. To that purpose, we introduce, for eachset S in the previous grammar, an indexed version S [ u,v ] which denotes the re-striction of S to alignments between u and v two forests in S and T .Slightly abusing previous notations, we denote by a ( u ) the tree whose rootis a vertex a and whose (forest of) children is u . Finally, for every tree/forest X , Ins( X ) (resp. Del( X ) ) represents the supertree/superforest obtained from X byinserting (resp. deleting) each of its elements. If X is empty, Ins( X ) and Del( X ) denote the empty superforest. The grammar for A S,T is described in Fig. 5.
Theorem 4.
Let S and T be non-empty trees. The set of supertrees A S,T generated bygrammar (11) - (18) is representative of T S,T the tree alignments between S and T .Applications to dynamic programming. The grammar defined by Equations (11)-(18) is a decomposition scheme for the alignments between S and T . It caneasily be transformed into an algorithm for computing the partition function Z S,T . Indeed, Z S,T is simply a weighted sum over all possible supertrees of A S,T , which is a set generated by the grammar. Now consider the image of thegrammar as a set of numerical equations, obtained by syntactically replacing: – The operators ( ⊕ , ◦ ) with ( (cid:80) , × ) respectively; – The empty set ∅ with ; – Inserted/Deleted trees/forests
Ins( X ) and Del( X ) with e −| X | /kθ , – Match
MatchRoot (
V, a, a ) events with V , ∀ a ∈ Σ and any expression V ; – Insertion
InsRoot (
V, a ) events, deletion DelRoot (
V, a ) events, and mismatch MatchRoot (
V, a, b ) events with e − /kθ × V , ∀ a (cid:54) = b ∈ Σ and any V .Theorem 4 immediately implies that the resulting set is a dynamic program-ming scheme that computes Z S,T instead of A S,T .Moreover, each non-terminal term of the modified grammar now containsthe partition function of the set of supertrees associated to this non-terminalterm in the set-theoretic grammar, e.g. a term VH [ a ( u ) ◦ X, b ( v )] . This informa-tion can then be used to define an algorithm to sample supertrees from A S,T un-der the Gibbs-Boltzmann distribution, following the recursive method for ran-dom generation [15].To do so, it suffices to reinterpret the grammar defined by Equations (11)-(18) as a branching process: each ⊕ operator is replaced by a branching operator S,TS ≡ r S ( X S ) = V ∅ [ S, T ] ⊕ InsRoot (Ins( X S ) ◦ Del( T ) , r S ) (11) V ∅ [ a ( u ) , b ( v )] = V ↑ [ a ( u ) , b ( v )] ⊕ InsRoot ( VH [ u, b ( v )] , a ) (12) V ↑ [ a ( u ) , b ( v )] = (cid:77) MatchRoot (cid:0) H I | D , ∅ , ∅ [ u, v ] , a, b (cid:1)(cid:76) Y ◦ c ( w ) ◦ Y (cid:48) = v DelRoot (cid:0)
Del( Y ) ◦ V ↑ [ a ( u ) , c ( w )] ◦ Del( Y (cid:48) ) , b (cid:1) (13) VH [ ∅ , b ( v )] = ∅ (14) VH [ a ( u ) ◦ X, b ( v )] = (cid:77) Ins( a ( u )) ◦ VH [ X, b ( v )] (cid:76) X (cid:48) ◦ X (cid:48)(cid:48) = a ( u ) ◦ X | X (cid:48) |≥ DelRoot (cid:0) H I | D , ↔ , ∅ [ X (cid:48) , v ] , b (cid:1) ◦ Ins( X (cid:48)(cid:48) ) V ∅ [ a ( u ) , b ( v )] ◦ Ins( X ) (15)For every ν, M, M (cid:48) with ν ∈ { I | D , D } and M, M (cid:48) ∈ { ∅ , ↔ , →} : H ν,M,M (cid:48) [ X, ∅ ] = (cid:40) Ins( X ) if ( M, M (cid:48) ) = ( ∅ , ∅ ) , ∅ otherwise , (16) H ν,M,M (cid:48) [ ∅ , Y ] = (cid:40) Del( Y ) if ( M, M (cid:48) ) = ( ∅ , ∅ ) , ∅ otherwise, (17) H ν,M,M (cid:48) [ a ( u ) ◦ X, b ( v ) ◦ Y ] = (cid:77) Ins( a ( u )) ◦ H ν,M,M (cid:48) [ X, b ( v ) ◦ Y ] if ν (cid:54) = D and if M (cid:54) = ↔ , Del( b ( v )) ◦ H D ,M,M (cid:48) [ a ( u ) ◦ X, Y ] if M (cid:48) (cid:54) = ↔ , V ∅ [ a ( u ) , b ( v )] ◦ H I | D ,α ( M,X ) ,α ( M (cid:48) ,Y ) [ X, Y ] (cid:76) Y (cid:48) ◦ Y (cid:48)(cid:48) = b ( v ) ◦ Y | Y (cid:48) |≥ InsRoot (cid:0) H I | D , ∅ , ↔ [ u, Y (cid:48) ] , a (cid:1) ◦ H I | D ,α ( M,X ) ,α ( M (cid:48) ,Y (cid:48)(cid:48) ) [ X, Y (cid:48)(cid:48) ] (cid:76) X (cid:48) ◦ X (cid:48)(cid:48) = a ( u ) ◦ X | X (cid:48) |≥ DelRoot ( H D , ↔ , ∅ [ X (cid:48) , v ] , b ) ◦ H I | D ,α ( M,X (cid:48)(cid:48) ) ,α ( M (cid:48) ,Y ) [ X (cid:48)(cid:48) , Y ] (18)where α ( ∅ , X ) = ∅ and α ( ↔ , X ) = α ( → , X ) = (cid:40) ∅ if X = ∅ , → otherwise. Fig. 5.
A grammar for A S,T , a representative set of all tree alignments between two fixedtrees S and T . that, instead of joining sets of supertrees into a larger set of supertrees, choosesone of the sets according to the weight of its partition function. For instance,assume we have a grammar rule U = V ⊕ W : the sampling algorithm will selectone of the sets V, W , with V being chosen with probability Z V / ( Z V + Z W ) , and W with probability Z W / ( Z V + Z W ) , provided that Z V , Z W and Z X have beenpreviously computed. Recursive calls will then result into a supertree, which isprovably randomly generated under the Gibbs-Boltzmann distribution. heorem 5. Let S and T be two trees of respective sizes n S and n T . The above-definedbranching process adapted from grammar (11) - (18) defines an algorithm that samples asupertree from A S,T under the Gibbs-Boltzmann distribution. The worst-case time andspace complexities of the algorithm are in O ( n S n T ( n S + n T ) ) , while the average-casetime and space complexities are in O ( n S n T ) . The correctness of the algorithm immediately follows from Theorem 4. Itscomplexities are identical to [9,6] since the structure of the DP scheme essen-tially remains the same; only the number of DP tables is increased (by a con-stant factor). This implies that our algorithm, while solving a much more gen-eral problem, retains the same asymptotic complexity (up to constants) thanthe current tree alignment algorithms that are limited to computing a singleoptimal tree alignment.
Following a classical line of research in string algorithms, we introduced thenotion of equivalence for tree alignments, and described a context-free gram-mar for a representative set of all possible alignments. We also showed howthis grammar can be used to derive asymptotic properties of alignments, anddesign an efficient dynamic programming sampling algorithm for alignmentsbetween two given trees.From an applied point of view, our results allow to sample optimal, aswell as suboptimal, tree alignments for a pair of given trees under the Gibbs-Boltzmann distribution; following the program outlined in [10], we are cur-rently using this algorithm to revisit the alignment of RNA structures.Our proposed grammar for tree alignments is more complex than the gram-mars used to generate a representative set of sequence alignments, althoughdynamic programming for computing optimal sequences and trees alignmentsare very similar. This is due to the fact that it is particularly hard to charac-terize a representative set of tree alignments (see Remark 1). It thus remainsan open problem to design a representative set of tree alignment that would beamenable to enumeration using a simpler grammar. However, it is important toremark that, despite its apparent complexity, our grammar leads to algorithmswith an asymptotic complexity of the same order than existing optimizationalgorithms.From a theoretical point of view, we believe that tree alignments as de-fined in this work form an interesting combinatorial family whose propertiesdeserve to be explored in depth. More generally, it would be interesting tocharacterize the conditions under which an instance-agnostic grammar, enu-merating a search space, could be adapted into a decomposition for a specificinstance. Such a theory, at the confluence of enumerative combinatorics andalgorithmic design, could provide another principled ways to design dynamic-programming algorithms. eferences
1. Andrade, H., Area, I., Nieto, J.J., Torres, A.: The number of reduced alignments be-tween two dna sequences. BMC Bioinformatics 15, 94 (2014), http://dx.doi.org/10.1186/1471-2105-15-94
2. Blin, G., Denise, A., Dulucq, S., Herrbach, C., Touzet, H.: Alignments of RNA struc-tures. IEEE/ACM Trans. Comput. Biology Bioinform. 7(2), 309–322 (2010), http://doi.acm.org/10.1145/1791396.1791409
3. Do, C., Gross, S., Batzoglou, S.: Contralign: Discriminative training for protein se-quence alignment. In: Apostolico, A., Guerra, C., Istrail, S., Pevzner, P., Waterman,M. (eds.) Research in Computational Molecular Biology, Lecture Notes in ComputerScience, vol. 3909, pp. 160–174. Springer Berlin Heidelberg (2006)4. Dress, A., Morgenstern, B., Stoye, J.: The number of standard and of effective mul-tiple alignments. Applied Mathematics Letters 11(4), 43 – 49 (1998),
5. Flajolet, P., Sedgewick, R.: Analytic combinatorics. Cambridge University Press,Cambridge (2009)6. Herrbach, C., Denise, A., Dulucq, S.: Average complexity of the Jiang-Wang-Zhangpairwise tree alignment algorithm and of a RNA secondary structure alignment al-gorithm. Theor. Comput. Sci. 411(26-28), 2423–2432 (2010), http://dx.doi.org/10.1016/j.tcs.2010.01.014
7. Höchsmann, M., Töller, T., Giegerich, R., Kurtz, S.: Local similarity in rna secondarystructures. Proc IEEE Comput Soc Bioinform Conf 2, 159–168 (2003)8. Höchsmann, M., Voss, B., Giegerich, R.: Pure multiple rna secondary structure align-ments: A progressive profile approach. IEEE/ACM Trans. Comput. Biol. Bioinfor-matics 1(1), 53–62 (Jan 2004), http://dx.doi.org/10.1109/TCBB.2004.11
9. Jiang, T., Wang, L., Zhang, K.: Alignment of trees - an alternative to tree edit.Theor. Comput. Sci. 143(1), 137–148 (1995), http://dx.doi.org/10.1016/0304-3975(95)80029-9
10. Ponty, Y., Saule, C.: A combinatorial framework for designing (pseudoknotted) RNAalgorithms. In: Przytycka, T.M., Sagot, M. (eds.) Algorithms in Bioinformatics - 11thInternational Workshop, WABI 2011, Saarbrücken, Germany, September 5-7, 2011.Proceedings. Lecture Notes in Computer Science, vol. 6833, pp. 250–269. Springer(2011), http://dx.doi.org/10.1007/978-3-642-23038-7_22
11. Schirmer, S., Giegerich, R.: Forest alignment with affine gaps and anchors, appliedin RNA structure comparison. Theor. Comput. Sci. 483, 51–67 (2013), http://dx.doi.org/10.1016/j.tcs.2012.07.040
12. Torres, A., Cabada, A., Nieto, J.J.: An exact formula for the number of alignmentsbetween two dna sequences. DNA Seq 14(6), 427–430 (Dec 2003)13. Vingron, M., Argos, P.: Determination of reliable regions in protein se-quence alignments. Protein Engineering 3(7), 565–569 (1990), http://peds.oxfordjournals.org/content/3/7/565.abstracthttp://peds.oxfordjournals.org/content/3/7/565.abstract