Approximate Search for Known Gene Clusters in New Genomes Using PQ-Trees
AApproximate Search for Known Gene Clusters inNew Genomes Using PQ-Trees
Galia R. Zimerman
Ben Gurion University of the Negev, [email protected]
Dina Svetlitsky
Ben Gurion University of the Negev, [email protected]
Meirav Zehavi
Ben Gurion University of the Negev, Israel [email protected] Michal Ziv-Ukelson
Ben Gurion University of the Negev, Israel [email protected] Abstract
We define a new problem in comparative genomics, denoted
PQ-Tree Search , that takes asinput a PQ-tree T representing the known gene orders of a gene cluster of interest, a gene-to-genesubstitution scoring function h , integer parameters d T and d S , and a new genome S . The objectiveis to identify in S approximate new instances of the gene cluster that could vary from the knowngene orders by genome rearrangements that are constrained by T , by gene substitutions that aregoverned by h , and by gene deletions and insertions that are bounded from above by d T and d S ,respectively. We prove that the PQ-Tree Search problem is NP -hard and propose a parameterizedalgorithm that solves the optimization variant of PQ-Tree Search in O ∗ (2 γ ) time, where γ is themaximum degree of a node in T and O ∗ is used to hide factors polynomial in the input size.The algorithm is implemented as a search tool, denoted PQFinder, and applied to search forinstances of chromosomal gene clusters in plasmids, within a dataset of 1,487 prokaryotic genomes.We report on 29 chromosomal gene clusters that are rearranged in plasmids, where the rearrangementsare guided by the corresponding PQ-tree. One of these results, coding for a heavy metal effluxpump, is further analysed to exemplify how PQFinder can be harnessed to reveal interesting newstructural variants of known gene clusters. Availability
The code for the tool as well as all the data needed to reconstruct the results arepublicly available on GitHub ( github.com/GaliaZim/PQFinder ). Applied computing → Bioinformatics
Keywords and phrases
PQ-Tree, Gene Cluster, Efflux Pump
Supplement Material github.com/GaliaZim/PQFinder
Funding
The research of G.Z. was partially supported by the Planning and Budgeting Committee ofthe Council for Higher Education in Israel. The research of G.Z. and M.Z. was partially supportedby the Israel Science Foundation (grant no. 1176/18). The research of G.Z., D.S. and M.Z.U. waspartially supported by the Israel Science (grant no. 939/18).
Acknowledgements
Many thanks to Lev Gourevitach for his excellent implementation of a PQ-treebuilder. Corresponding authors. a r X i v : . [ q - b i o . GN ] J u l Approximate Search for Known Gene Clusters in New Genomes Using PQ-Trees
Recent advances in pyrosequencing techniques, combined with global efforts to study in-fectious diseases, yield huge and rapidly-growing databases of microbial genomes [38, 42].This big new data statistically empowers genomic-context based approaches to functionalanalysis: the biological principle underlying such analysis is that groups of genes that appeartogether consistently across many genomes often code for proteins that interact with oneanother, suggesting a common functional association. Thus, if the functional association andannotation of the clustered genes is already known in one (or more) of the genomes, thisinformation can be used to infer functional characterization of homologous genes that areclustered together in another genome.Groups of genes that are co-locally conserved across many genomes are denoted geneclusters . The locations of the group of genes comprising a gene cluster in the distinct genomesare denoted instances . Gene clusters in prokaryotic genomes often correspond to (one orseveral) operons; those are neighbouring genes that constitute a single unit of transcriptionand translation. However, the order of the genes in the distinct instances of a gene clustermay not be the same.The discovery (i.e. data-mining) of conserved gene clusters in a given set of genomes is awell studied problem [8, 21, 44]. However, with the rapid sequencing of prokaryotic genomesa new problem is inspired: Namely, given an already known gene cluster that was discoveredand studied in one genomic dataset, to identify all the instances of the gene cluster in a givennew genomic sequence.One exemplary application for this problem is the search for chromosomal gene clusters inplasmids. Plasmids are circular genetic elements that are harbored by prokaryotic cells wherethey replicate independently from the chromosome. They can be transferred horizontallyand vertically, and are considered a major driving force in prokaryotic evolution, providingmutation supply and constructing new operons with novel functions [28], for exampleantibiotic resistance [20]. This motivates biologists to search for chromosomal gene clustersin plasmids, and to study structural variations between the instances of the found geneclusters across the two distinct replicons. However, in addition to the fact that plasmidsevolve independently from chromosomes and in a more rapid pace [14], their sequencing,assembly and annotation involves a more noisy process [29].To accommodate all this, the proposed search approach should be an approximate one,sensitive enough to tolerate some amount of genome rearrangements: transpositions andinversions, missing and intruding genes, and classification of genes with similar function todistinct orthology groups due to sequence divergence or convergent evolution. Yet, for thesake of specificity and search efficiency, we consider confining the allowed variations by twotypes of biological knowledge: (1) bounding the allowed rearrangement events considered bythe search, based on some grammatical model trained specifically from the known gene ordersof the gene cluster, and (2) governing the gene-to-gene substitutions considered by the searchby combining sequence homology with functional-annotation based semantic similarity. (1) Bounding the allowed rearrangement events.
The PQ-tree [9] is a combina-torial data structure classically used to represent gene clusters [6]. A PQ-tree of a genecluster describes its hierarchical inner structure and the relations between instances of thecluster succinctly, aids in filtering meaningful from apparently meaningless clusters, and alsogives a natural and meaningful way of visualizing complex clusters. A PQ-tree is a rootedtree with three types of nodes:
P-nodes , Q-nodes and leaves. The children of a P-node canappear in any order, while the children of a Q-node must appear in either left-to-right or . R. Zimerman, D. Svetlitsky, M. Zehavi and M. Ziv-Ukelson 3
Metabolism (1) (2)(3)
Transport (4)(5) E C D F G H I J K L MR E D C N G H I J L MC D E F G H I J K L M
F D C E G H I J K L M
E D C M L K J I H G F
Figure 1
A gene cluster containing most of the genes of the
PhnCDEFGHIJKLMNOP operon[25] and the corresponding PQ-tree. The
Phn operon encodes proteins that utilize phosphonateas a nutritional source of phosphorus in prokaryotes. The genes
PhnCDE encode a phosphonatetransporter, the genes
PhnGHIJKLM encode proteins responsible for the conversion of phosphonatesto phosphate, and the gene
PhnF encodes a regulator. (1)-(3).
The three distinct gene ordersfound among 47 chromosomal instances of the
P hn gene cluster. (4).
A PQ-tree representing the
P hn gene cluster, constructed from its three known gene orders shown in 1-3. (5).
An example ofa
P hn gene cluster instance identified by the PQ-tree shown in (4), and the one-to-one mappingbetween the leaves of the PQ-tree and the genes comprising the instance. The instance genes arerearranged differently from the gene orders shown in 1-3 and yet can be derived from the PQ-tree.In this mapping, gene F is substituted by gene R , gene N is an intruding gene (i.e., deleted fromthe instance string), and gene I is a missing gene (i.e., deleted from the PQ-tree). right-to-left order. (In the special case when a node has exactly two children, it does notmatter whether it is labeled as a P-node or a Q-node.) Booth and Lueker [9], who introducedthis data structure, were interested in representing a set of permutations over a set U , i.e.every member of U appears exactly once as a label of a leaf in the PQ-tree. We, on the otherhand, allow each member of U to appear as a label of a leaf in the tree any non-negativenumber of times. Therefore, we will henceforth use the term string rather than permutation when describing the gene orders derived from a given PQ-tree.An example of a PQ-tree is given in Figure 1. It represents a P hn gene cluster thatencodes proteins that utilize phosphonate as a nutritional source of phosphorus in prokaryotes[25]. The biological assumptions underlying the representation of gene clusters as PQ-treesis that operons evolve via progressive merging of sub-operons, where the most basic units inthis recursive operon assembly are colinearly conserved sub-operons [17]. In the case wherean operon is assembled from sub-operons that are colinearly dependent, the conserved geneorder could correspond, e.g., to the order in which the transcripts of these genes interact inthe metabolic pathway in which they are functionally associated [43]. Thus, transpositionevents shuffling the order of the genes within this sub-operon could reduce its fitness. Onthe other hand, inversion events, in which the genes participating in this sub-operon remaincolinearly ordered are accepted. This case is represented in the PQ-tree by a Q-node (markedwith a rectangle). In the case where an operon is assembled from sub-operons that are notcolinearly co-dependent, convergent evolution could yield various orders of the assembledcomponents [17]. This case is represented in the PQ-tree by a P-node (marked with a circle).Learning the internal topology properties of a gene cluster from its corresponding gene ordersand constructing a query PQ-tree accordingly, could empower the search to confine theallowed rearrangement operations so that colinear dependencies among genes and betweensub-operons are preserved. (2) Governing the gene-to-gene substitutions.
A prerequisite for gene clusterdiscovery is to determine how genes relate to each other across all the genomes in the dataset.In our experiment, genes are represented by their membership in Clusters of Orthologous
Approximate Search for Known Gene Clusters in New Genomes Using PQ-Trees
Groups (COGs) [37], where the sequence similarity of two genes belonging to the same COGserves as a proxy for homology. Despite low sequence similarity, genes belonging to twodifferent COGs could have a similar function, which would be reflected in the functionaldescription of the respective COGs. Using methods from natural language processing [31],we compute for each pair of functional descriptions a score reflecting their semantic similarity.Combining sequence and functional similarity could increase the sensitivity of the search andpromote the discovery of systems with related functions.
Our Contribution and Roadmap.
In this paper we define a new problem in comparativegenomics, denoted
PQ-Tree Search (in Section 2) , that takes as input a PQ-tree T (the query) representing the known gene orders of a gene cluster of interest, a gene-to-genesubstitution scoring function h , integer parameters d T and d S , and a new genome S (thetarget). The objective is to identify in S a new approximate instance of the gene cluster thatcould vary from the known gene orders by genome rearrangements that are constrained by T , by gene substitutions that are governed by h , and by gene deletions and insertions thatare bounded from above by d T and d S , respectively. We prove that PQ-Tree Search is NP -hard (Theorem 9 in Appendix A) .We define an optimization variant of PQ-Tree Search and propose an algorithm (inSection 3) that solves it in O ( nγd T d S ( m p · γ + m q )) time, where n is the length of S , m p and m q denote the number of P-nodes and Q-nodes in T , respectively, and γ denotesthe maximum degree of a node in T . In the same time and space complexities, we can alsoreport all approximate instances of T in S and not only the optimal one.The algorithm is implemented as a search tool, denoted PQFinder. The code for thetool as well as all the data needed to reconstruct the results are publicly available onGitHub ( github.com/GaliaZim/PQFinder ). The tool is applied to search for instances ofchromosomal gene clusters in plasmids, within a dataset of 1,487 prokaryotic genomes. Inour preliminary results (given in Section 5) , we report on 29 chromosomal gene clustersthat are rearranged in plasmids, where the rearrangements are guided by the correspondingPQ-tree. One of these results, coding for a heavy metal efflux pump, is further analysed toexemplify how PQFinder can be harnessed to reveal interesting new structural variants ofknown gene clusters. Previous Related Works.
Permutations on strings representing gene clusters have beenstudied earlier by [5, 15, 22, 32, 39]. PQ-trees were previously applied in physical mapping[2, 10], as well as to other comparative genomics problems [3, 7, 24].In Landau et al. [24] an algorithm was proposed for representation and detection of geneclusters in multiple genomes, using PQ-trees: the proposed algorithm computes a PQ-tree of k permutations of length n in O ( kn ) time, and it is proven that the computed PQ-tree is theone with a minimum number of possible rearrangements of its nodes while still representingall k permutations. In the same paper, the authors also present a general scheme to handlegene multiplicity and missing genes in permutations. For every character that appears a times in each of the k strings, the time complexity for the construction of the PQ-tree,according to the scheme in that paper, is multiplied by an O (( a !) k ) factor.Additional applications of PQ-trees to genomics were studied in [1, 4, 30], where PQ-treeswere considered to represent and reconstruct ancestral genomes.However, as far as we know, searching for approximate instances of a gene cluster that isrepresented as a PQ-tree, in a given new string, is a new computational problem. . R. Zimerman, D. Svetlitsky, M. Zehavi and M. Ziv-Ukelson 5 Let Π be an NP-hard problem. In the framework of Parameterized Complexity, each instanceof Π is associated with a parameter k , and the goal is to confine the combinatorial explosionin the running time of an algorithm for Π to depend only on k . Formally, Π is fixed-parametertractable ( FPT ) if any instance ( I, k ) of Π is solvable in time f ( k ) · | I | O (1) , where f is anarbitrary computable function of k . Nowadays, Parameterized Complexity supplies a richtoolkit to design or refute the existence of FPT algorithms [11, 12, 16].
PQ-Tree: Representing the Pattern.
The possible reordering of the children nodes in aPQ-tree may create many equivalent PQ-trees. Booth and Lueker [9] defined two PQ-trees
T, T as equivalent (denoted T ≡ T ) if one tree can be obtained by legally reordering thenodes of the other; namely, randomly permuting the children of a P-node, and reversingthe children of a Q-node. To allow for deletions in the PQ-trees, a generalization of theirdefinition is given in Definition 1 below. Here, smoothing is a recursive process in which if bydeleting leaves from a tree, T , some internal node x of T is left without children, then x isalso deleted, but its deletion is not counted (i.e. only leaf deletions are counted). (cid:73) Definition 1 (Quasi-Equivalence Between PQ-Trees) . For any two PQ-trees, T and T , thePQ-tree T is quasi-equivalent to T with a limit d , denoted T (cid:23) d T , if T can be obtainedfrom T by (a) randomly permuting the children of some of the P-nodes of T , (b) reversingthe children of some of the Q-nodes of T , and (c) deleting up to d leaves from T and applyingthe corresponding smoothing. (The order of the operations does not matter.) Figure S5 shows two equivalent PQ-trees (Figure S5a, Figure S5b) that are each quasi-equivalent with d = 1 to the third PQ-tree (Figure S5c). The frontier of a PQ-tree T ,denoted F ( T ), is the sequence of labels on the leaves of T read from left to right. Forexample, the frontier of the PQ-tree in Figure 1 is CDEF M LKJIHG . It is interestingto consider the set of frontiers of all the equivalent PQ-trees, defined in [9] as consistentfrontiers and denoted by C ( T ) = { F ( T ) : T ≡ T } . Intuitively, C ( T ) is the set of all leaflabel sequences defined by the PQ-tree structure and obtained by legally reordering its nodes.Here, we generalize the consistent frontiers definition to allow a bounded number of deletionsfrom T , using quasi-equivalence. (cid:73) Definition 2 ( d -Bounded Quasi-Consistent Frontiers) . C d ( T ) = { F ( T ) : T (cid:23) d T } . clearly C ( T ) = C ( T ), and so in a setting where d = 0 the latter notation is used. For anode x of a PQ-tree T , the subtree of T rooted in x is denoted by T ( x ), the set of leaves in T ( x ) is denoted by leaves ( x ), and the span of x (denoted span ( x )) is defined as | leaves ( x ) | . PQ-Tree Search and Related Terminology.
An instance of the
PQ-Tree Search problemis a tuple (
T, S, h, d T , d S ), where T is a PQ-tree with m leaves, m p P-nodes, m q Q-nodesand every leaf x in T has a label label ( x ) ∈ Σ T ; S = σ . . . σ n ∈ Σ nS is a string of length n representing the input genome; d T ∈ N specifies the number of allowed deletions from T ; d S ∈ N specifies the number of allowed deletions from S ; and h is a boolean substitutionfunction , describing the possible substitutions between the leaf labels of T and the charactersof the given string, S . Formally, h is a function that receives a pair ( σ t , σ s ), where σ t ∈ Σ T is one of the labels on the leaves of T , and σ s ∈ Σ S is one of the characters of the givenstring, S , and returns T rue if σ t can be replaced with σ s , and F alse , otherwise. Consideringthe biological problem at hand, Σ T and Σ S are both sets of genes. For 1 ≤ i ≤ j ≤ n , Approximate Search for Known Gene Clusters in New Genomes Using PQ-Trees S = S [ i : j ] = σ i ...σ j is a substring of S beginning at index i and ending at index j . Thesubstring S is a prefix of S if S = S [1 : j ] and it is a suffix of S if S = S [ i : n ]. In addition,we denote σ i , the i th character of S , by S [ i ].The objective of PQ-Tree Search is to find a one-to-one mapping M between theleaves of T and the characters of a substring S of S , that comprises a set of pairs eachhaving one of three forms: the substitution form, ( x, σ s ( ‘ )), where x is a leaf in T , σ s ∈ Σ S , h ( label ( x ) , σ s ) = T rue and ‘ ∈ { , . . . ,n } is the index of the occurrence of σ s in S that ismapped to the leaf x ; the character deletion form, ( ε, σ s ( ‘ )), which marks the deletion ofthe character σ s ∈ Σ S from the index ‘ of S ; the leaf deletion form, ( x, ε ), which marks thedeletion of x , a leaf node of T .To account for the number of deletions of characters of S and leaves of T in M , thenumber of pairs in M of the form ( ε, σ ) are marked by del S ( M ) and the number of pairsin M of the form ( x, ε ) are marked by del T ( M ). Applying the substitutions defined in M to S resulting in the string S M is the process in which for every ( x, σ s ( ‘ )) ∈ M , thecharacter σ s at index ‘ of S is deleted if x = ε , and otherwise substituted by x . Thisprocess is demonstrated in Figure S6b. We say that S is derived from T under M with d T deletions from the tree and d S deletions from the string, if d T = del T ( M ), d S = del S ( M )and S M ∈ C d T ( T ). Thus, by definition, there is a PQ-tree T such that F ( T ) = S M and T (cid:23) d T T . Note that the deletions of the nodes in T to obtain the nodes in T are determinedby M . The conversion of T to T as defined by the derivation is illustrated in Figure S6a.The set of permutations and node deletions performed to obtain T from T together withthe substitutions and deletions from S specified by M is named the derivation µ of T to S .We also say that M yields the derivation µ .For a derivation µ of T to S = S [ s : e ], we give the following terms and notations(illustrated in Figure S6). The root of T is the node that µ derives or the root of the derivation and it is denoted by µ.v . For abbreviation, we say that µ is a derivation of µ.v . The substring S is the string that µ derives . We name s and e the start and end points of the derivationand denote them by µ.s and µ.e , respectively. The one-to-one mapping that yields µ isdenoted by µ.o . The number of deletions from the tree is denoted by µ.del T . The numberof deletions from the string is denoted by µ.del S . In addition, if x is a leaf node in T and( x, σ s ( ‘ )) ∈ µ.o , then x is mapped to S [ ‘ ] under µ . The character S [ ‘ ] is said to be deletedunder µ if ( ε, σ s ( ‘ )) ∈ µ.o . If x ∈ T ( µ.v ) is a leaf for which ( x, ε ) ∈ µ.o , then x is deletedunder µ . For an internal node of T , x , if every leaf in T ( x ) is deleted under µ , then x is deleted under µ , and otherwise x is kept under µ .We define two versions of the PQ-Tree Search problem: a decision version (Definition 3)and an optimisation version (Definition 4). (cid:73)
Definition 3 (Decision PQ-Tree Search) . Given a string S of length n , a PQ-tree T with m leaves, deletion limits d T , d S ∈ N , and a boolean substitution function h between Σ S and Σ T , decide if there is a one-to-one mapping M that yields a derivation of T to a substring S of S with up to d T and up to d S deletions from T and S , respectively. To define an optimization version of the
PQ-Tree Search problem it is necessary tohave a score for every possible substitution between the characters in Σ T and the charactersin Σ S . Hence, for this problem variant assume that h is a substitution scoring function , thatis, h ( σ t , σ s ) for σ t ∈ Σ T , σ s ∈ Σ S is the score for substituting σ s by σ t in the derivation, andif σ t cannot be substituted by σ s , h ( σ t , σ s ) = −∞ . In addition, we need a cost function,denoted by δ , for the deletion of a character of S and for the deletion of a leaf of T accordingto the label of the leaf. The score of a derivation µ , denoted by µ.score , is the sum of scoresof all operations (deletions from the tree, deletions from the string and substitutions) in µ . . R. Zimerman, D. Svetlitsky, M. Zehavi and M. Ziv-Ukelson 7 Now, instead of deciding whether there is a one-to-one mapping that yields a derivation of T to a substring of S , we can search for the one-to-one mapping that yields the best derivation(if there exists such a derivation), i.e. a one-to-one mapping for which µ.score is the highest. (cid:73) Definition 4 (Optimization PQ-Tree Search) . Given a string of length n , S , a PQ-tree with m leaves, T , deletion limits d T , d S ∈ N , a substitution scoring function between Σ S and Σ T , h , and a deletion cost function, δ , return the one-to-one mapping, M , that yields the highestscoring derivation of T to a substring S of S with up to d T deletions from T and up to d S deletions from S (if such a mapping exists). In this section we develop a dynamic programming (DP) algorithm to solve the optimizationvariant of
PQ-Tree Search (Definition 4). Our algorithm receives as input an instanceof
PQ-Tree Search ( T, S, h, d T , d S ), where h is a substitution scoring function as definedin Section 2. Our default assumption is that deletions are not penalized, and therefore δ isnot given as input. The case where deletions are penalized is described in Appendix G. Theoutput of the algorithm is a one-to-one mapping, M , that yields the best (highest scoring)derivation of T to a substring of S with up to d T deletions from T and up to d S deletionsfrom the substring, and the score of that derivation. With a minor modification, the outputcan be extended to include a one-to-one mapping for every substring of S and the derivationsthat they yield. Brief Overview.
On a high level, our algorithm consists of three components: the mainalgorithm, and two other algorithms that are used as procedures by the main algorithm.Apart from an initialization phase, the crux of the main algorithm is a loop that traversesthe given PQ-tree, T . For each internal node x , it calls one of the two other algorithms:P-mapping (given in Section 3.3) and Q-mapping (given in Appendix F). These algorithmsfind and return the best derivations from the subtree of T rooted in x , T ( x ), to substrings of S , based on the type of x (P-node or Q-node). Then, the scores of the derivations are storedin the DP table.We now give a brief informal description of the main ideas behind our P-mapping andQ-mapping algorithms. Our P-mapping algorithm is inspired by an algorithm described byBevern et al. [40] to solve the Job Interval Selection problem. Our problem differs fromtheirs mainly in its control of deletions. Intuitively, in the P-mapping algorithm we considerthe task at hand as a packing problem, where every child of x is a set of intervals, eachcorresponding to a different substring. The objective is to pack non-overlapping intervalssuch that for every child of x at most one interval is packed. Then, the algorithm greedilyselects a child x of x and decides either to pack one of its intervals (and which one) orto pack none (in which case x is deleted). Our Q-mapping algorithm is similar to theP-mapping algorithm, but simpler. It can be considered as an interval packing algorithm aswell, however, this algorithm packs the children of x in a specific order.In the following sections, we describe the main algorithm, the P-mapping algorithm, andafterwards analyse the time complexity. The Q-mapping algorithm, which is also used as aprocedure in the main algorithm, is described in Appendix F. Approximate Search for Known Gene Clusters in New Genomes Using PQ-Trees
We now delve into more technical details. The algorithm (whose pseudocode is given inAlgorithm 2 in Appendix H) constructs a 4-dimensional DP table A of size m × n × d T + 1 × d S + 1. The purpose of an entry of the DP table, A [ j, i, k T , k S ], is to hold the highest score ofa derivation of the subtree T ( x j ) to a substring S of S starting at index i with k T deletionsfrom T ( x j ) and k S deletions from S . If no such derivation exists, A [ j, i, k T , k S ] = −∞ .Addressing A with some of its indices given as dots, e.g. A [ j, i, · , · ], refers to the subtable of A that is comprised of all entries of A whose first two indices are j and i . Some entries ofthe DP table define illegal derivations, namely, derivations for which the number of deletionsare inconsistent with the start index, i , the derived node and S . These entries are called invalid entries and their value is defined as −∞ throughout the algorithm. A more detaileddescription of the invalid entries is given in Appendix F.The main algorithm first initializes the entries of A that are meant to hold scores ofderivations of the leaves of T to every possible substring of S using the following rule. Forevery 0 ≤ k S ≤ d S and every x j ∈ leaves ( root ), do: A [ j, i, , k S ] = 0 A [ j, i, , k S ] = max i = i,...,i + k S h ( j, S [ i ])Afterwards, all other entries of A are filled as follows. Go over the internal nodes of T inpostorder. For every internal node, x , go in ascending order over every index, i , that can bea start index for the substring of S derived from T ( x ) (the possible values of i are explainedin the next paragraph). For every x and i , use the algorithm for Q-mapping or P-mappingaccording to the type of x . Both algorithms receive the same input: a substring S of S , thenode x , its children x , . . . , x γ , the collection of possible derivations of the children (denotedby D ), which have already been computed and stored in A (as will be explained ahead) andthe deletion arguments d T , d S . Intuitively, the substring S is the longest substring of S starting at index i that can be derived from T ( x ) given d T and d S . After being called, bothalgorithms return a set of derivations of T ( x ) to a prefix of S = S [ i : e ] and their scores.The set holds the highest scoring derivation for every E ( x j , i, d T , ≤ e ≤ E ( x j , i, , d S ) andfor every legal deletion combination 0 ≤ k T ≤ d T , 0 ≤ k S ≤ d S .We now explain the possible values of i and the definition of S more formally. To thisend, note that given the node x and some numbers of deletions k T and k S , the lengthof the derived substring is L ( x, k T , k S ) . = span ( x ) − k T + k S (see Appendix B). Thus, onthe one hand, a substring of maximum length is obtained when there are no deletionsfrom the tree and d S deletions from the string. Hence, S = S [ i : E ( x, i, , d S )] where E ( x, i, k T , k S ) is the function for the calculation of the end point of a derivation, defined as E ( x, i, k T , k S ) . = i − L ( x, k T , k S ). On the other hand, a shortest substring is obtainedwhen there are d T deletions from the tree and none from the string. Then, the lengthof the substring is L ( x, d T ,
0) = span ( x ) − d T . Hence, the index i runs between 1 and n − ( span ( x ) − d T ) + 1.We now turn to address the aforementioned input collection D in more detail. Formally,it contains the best scoring derivations of every child x j of x to every substring of S with upto d T and d S deletions from the tree and string, respectively. It is produced from the entries A [ j, i , k T , k S ] (where each entry gives one derivation) for all k T and k S , and all i between i and the end index of S , i.e. i ≤ i ≤ E ( x j , i, , d S ). For the efficiency of the Q-mapping andP-mapping algorithms, the derivations in D are arranged in descending order with respect totheir end point ( µ.e ). This does not increase the time complexity of the algorithm, as thisordering is received by previous calls to the Q-mapping and P-mapping algorithms. . R. Zimerman, D. Svetlitsky, M. Zehavi and M. Ziv-Ukelson 9 In the final stage of the main algorithm, when the DP table is full, the score of a bestderivation is the maximum of {A [ m , i, k T , k S ] : k T ≤ d T , k S ≤ d S , 1 ≤ i ≤ n − ( span ( root ) − k T ) + 1 } (remember that x m is the root of T ). We remark that by tracing back through A the one-to-one mapping that yielded this derivation can be found. Before describing the P-mapping algorithm, we set up some terminology, which is usefulboth for the P-mapping algorithm and the Q-mapping algorithm (in Appendix F).We first define the notion of a partial derivation. In the Q-mapping and P-mappingalgorithms, the derivation of the input node, x , is built by considering subsets U of its children.With respect to such a subset U , a derivation µ of x is built as if x had only the children in U , and is called a partial derivation . Formally, µ is a partial derivation of a node x if µ.v = x and there is a subset of children U ⊆ children ( x ) such that the two following conditions aretrue. First, for every u ∈ U all the leaves in T ( u ) are neither mapped nor deleted under µ - that is, there is no mapping pair ( ‘, y ) ∈ µ.o such that ‘ ∈ leaves ( u ). Second, for every v ∈ children ( x ) \ U the leaves in T ( v ) are either mapped or deleted under µ . For every u ∈ U ,we say that u is ignored under µ . Notice that any derivation is a partial derivation, where theset of ignored nodes ( U above) is empty. Since all derivations that are computed in a singlecall to the P-mapping or Q-mapping algorithms have the same start point i , it can be omitted(for brevity) from the end point function: thus, we denote E I ( x, k T , k S ) . = L ( x, k T , k S ).Then, for a set U of nodes, we define L ( U, k T , k S ) . = P x ∈ U span ( x ) + k S − k T and accordingly E I ( U, k T , k S ) . = L ( U, k T , k S ).We now define certain collections of derivations with common properties (such as havingthe same numbers of deletions and end point). (cid:73) Definition 5.
The collection of all the derivations of every node u ∈ U to suffixes of S [1 : E I ( U, k T , k S )] with exactly k T deletions from the tree and exactly k S deletions fromthe string is denoted by D ( U, k T , k S ) . (cid:73) Definition 6.
The collection of all the best derivations from the nodes in U to suffixesof S [1 : E I ( U, k T , k S )] with up to k T deletions from the tree and up to k S deletions fromthe string is denoted by D ≤ ( U, k T , k S ) . Specifically, for every node u ∈ U , k T ≤ k T and k S ≤ k S , the set D ≤ ( U, k T , k S ) holds only one highest scoring derivation of u to a suffix of S [1 : E I ( U, k T , k S )] with k T and k S deletions from the tree and string, respectively. It is important to distinguish between these two definitions. First, the derivations in D ( U, k T , k S ) have exactly k T and k S deletions, while the derivations in D ≤ ( U, k T , k S ) have up to k T and k S deletions. Second, in D ( U, k T , k S ) there can be several derivations that differonly in their score and in the one-to-one mapping that yields them, while in D ≤ ( U, k T , k S ),there is only one derivation for every node u ∈ U and deletion combination pair ( k T , k S ).Note that the end points of all of the derivations are equal.Definition 5 is used for describing the content of an entry of the DP table, where thefocus is on the collection of all the derivations of x to S with exactly k T and k S deletions, D ( { x } , k T , k S ). For simplicity, the abbreviation D ( u, k T , k S ) = D ( { u } , k T , k S ) is used. In D ≤ ( U, k T , k S ) can be defined using Definition 5: D ≤ ( U, k T , k S ) = [ u ∈ U [ k T ≤ k T [ k S ≤ k S max µ ∈D ( U,k T ,k S )s . t .µ.del T = k T µ.del S = k S µ.v = u µ.score . every step of the P-mapping and Q-mapping algorithms, a different set of derivations ofthe children of x is examined, thus, Definition 6 is used for U ⊆ children ( x ). In addition,the set of derivations D that is received as input to the algorithms can be described usingDefinition 6 as can be seen in Equation (1) below. In this equation, the union is over all U ⊆ children ( x ) because in this way the derivations of all the children of x with every possibleend point are obtained (in contrast to having only U = children ( x ), which results in thederivations of all the children of x with the end point E I ( children ( x ) , k T , k S )). D = [ U ⊆ children ( x ) [ k T ≤ d T [ k S ≤ d S D ≤ ( U, k T , k S ) (1)In the P-mapping algorithm for C ⊆ children ( x ), the notation x ( C ) is used to indicatethat the node x is considered as if its only children are the nodes in C . Consequentially,the span of x ( C ) is defined as span ( x ( C ) ) . = P c ∈ C span ( c ), and the set D ( x ( C ) , k T , k S ) (inDefinition 5 where U = { x ( C ) } ) now refers to a set of partial derivations. Recall that the input consists of an internal P-node x , a string S , limits on the number ofdeletions from the tree T and the string S , d T and d S , respectively, and a set of derivations D (see Equation (1)). The output is S k T ≤ d T S k S ≤ d S arg max µ ∈D ( x,k T ,k S ) µ.score , which isthe collection of the best scoring derivations of x to every possible prefix of S having up to d T and d S deletions from the tree and string, respectively. Thus, there are O ( d T d S ) derivationsin the output. The pseudocode of our algorithm is given in Algorithm 3 in Appendix H.The algorithm constructs a 3-dimensional DP table P , which has an entry for every0 ≤ k T ≤ d T , 0 ≤ k S ≤ d S and subset C ⊆ children ( x ). The purpose of an entry P [ C, k T , k S ]is to hold the best score of a partial derivation in D ( x ( C ) , k T , k S ), i.e. a partial derivationrooted in x ( C ) to a prefix of S with exactly k T deletions from the tree and k S deletions fromthe string. The children of x that are not in C are ignored (as defined in Section 3.2) underthe partial derivation stored by the DP table entry P [ C, k T , k S ], thus they are neither deletednor counted in the number of deletions from the tree, k T . (They will be accounted for in thecomputation of other entries of P .) Similarly to the main algorithm, some of the entries of P are invalid, and their value is defined as −∞ (for more information see Appendix F). Forlack of space, the description of the initialization of P is deferred to Appendix C.After the initialization, the remaining entries of P are calculated using the recursion rulein Equation (2) below. The order of computation is ascending with respect to the size of thesubsets C of the children of x , and for a given C ⊆ children ( x ), the order is ascending withrespect to the number of deletions from both tree and string. P [ C, k T , k S ] = max P [ C, k T , k S − µ ∈D ≤ ( C,k T ,k S ) P [ C \ { µ.v } , k T − µ.del T , k S − µ.del S ] + µ.score (2)Intuitively, every entry P [ C, k T , k S ] defines some index i of S that is the end point of everypartial derivation in D ( x ( C ) , k T , k S ). Thus, S [ i ] must be a part of any partial derivation µ ∈ D ( x ( C ) , k T , k S ), so, either S [ i ] is deleted under µ or it is mapped under µ . The formeroption is captured by the first case of the recursion rule. If S [ i ] is mapped under µ , thendue to the hierarchical structure of T ( x ), it must be mapped under some derivation µ of oneof the children of x that are in C . Thus we receive the second case of the recursion rule. Weremark that the case of a node deletion is captured by the initialization (further explanationcan be found in Appendix D). . R. Zimerman, D. Svetlitsky, M. Zehavi and M. Ziv-Ukelson 11 Once the entire DP table is filled, a derivation of maximum score for every end point anddeletion number combination can be found in P [ children ( x ) , · , · ]. For the output derivationsto be ordered with respect to their end point, they need to be extracted by traversing P [ children ( x ) , · , · ] in the order described in Appendix F and exemplified in Table S1.The time complexity analysis of the algorithm can be found in Appendix H.5, and theproof of correctness can be found in Appendix H.2. In this section we compare the time complexity of the main algorithm (in Section 3.1) to thenaïve solution for
PQ-Tree Search . We note that the proof of correctness of the algorithmcan be found in Appendix H.1. The proof of Lemma 7 below is given in Appendix H.4. (cid:73)
Lemma 7.
The algorithm in Section 3.1 runs in O ( nγd T d S ( m p γ + m q )) time and O ( d T d S ( mn + 2 γ )) space, where γ is the maximum degree of a node in T . Thus, it is proven that
PQ-Tree Search has an
FPT solution with the parameter γ (Theorem 8). (cid:73) Theorem 8.
PQ-Tree Search with parameter γ is FPT . Particularly, it has an
FPT algorithm that runs in O ∗ (2 γ ) time . The naïve solution for
PQ-Tree Search and its time complexity analysis are givenin Appendix E. There we show that it solves
PQ-Tree Search in O (2 m q ( γ !) m p nm ( d T + d S ) d T d S ) time. We conclude that the time complexity of our algorithm is substantiallybetter, exemplified by considering two complementary cases. One, when there are onlyP-nodes in T (i.e. m = m p ), the naïve algorithm is super-exponential in γ , and even worse,exponential in m , while ours is exponential only in γ , and hence polynomial for any γ that isconstant (or even logarithmic in the input size). Second, when there are only Q-nodes in T (i.e. m = m q ), the naïve algorithm is exponential while ours is polynomial. Dataset and Gene Cluster Generation. ,
487 fully sequenced prokaryotic strains withCOG ID annotations were downloaded from GenBank (NCBI; ver 10/2012). Among thesestrains, 471 genomes included a total of 933 plasmids.The gene clusters were generated using the tool CSBFinder-S [36]. CSBFinder-S wasapplied to all the genomes in the dataset after removing their plasmids, using parameters q = 1 (a colinear gene cluster is required to appear in at least one genome) and k = 0 (noinsertions are allowed in a colinear gene cluster), resulting in 595,708 colinear gene clusters.Next, ignoring strand and gene order information, colinear gene clusters that contain theexact same COGs were united to form the generalized set of gene clusters. The resultinggene clusters were then filtered to 26,270 gene clusters that appear in more than 30 genomes. Generation of PQ-Trees.
The generation of PQ-trees was performed using a program[19] that implements the algorithm described in [24] for the construction of a PQ-tree froma list of strings comprised from the same set of characters. In the case where a characterappeared more than once in a training string, the PQ-tree with the minimum consistent The notation O* is used to hide factors polynomial in the input size. frontier size was chosen. The generated PQ-trees varied in size and complexity. The lengthof their frontier ranged between 4 and 31, and the size of their consistent frontier rangedbetween 4 and 362 , Implementation and Performance.
PQFinder is implemented in Java 1.8. The runswere performed on an Intel Xeon X5680 machine with 192 GB RAM. The time it took to runall plasmid genomes against one PQ-tree ranged between 5 .
85 seconds (for a PQ-tree with aconsistent frontier of size 4) and 181 . , Substitution Scoring Function.
The substitution scoring function reflects the distancebetween each pair of COGs, that is computed based on sentences describing the functionalannotation of the COGs (e.g., "ABC-type sugar transport system, ATPase component").The "Bag of Words model" was employed, where the functional description of each COGis represented by a sparse vector that is normalized to have a unit Euclidean norm. First,each COG description was tokenized and the occurrences of tokens in each description wascounted and normalized using tf–idf term weighting. Then, the cosine similarity betweeneach two vectors was computed, resulting in similarity scores ranging between 0 and 1. Thesentences describing COGs are short, therefore each word largely influences the score, evenafter the tf–idf term weighting. Therefore, words that do not describe protein functions thatwere found in the top 30 most common words in the description of all COGs were used asstop-words. Two COGs with the same COG IDs were set to have a score of 1.1, and thesubstitution score between a gene with no COG annotation to any other COG was set to be-0.1. Two COGs with a zero score were penalized to have a score of -0.2 and the deletion of aCOG from the query or the target string was set to have a score of zero.
Enrichment Analysis.
For each of the four variants in Figure 2.C, a hypergeometric testwas performed to measure the enrichment of the corresponding variant in one of the classesin which it appears. A total of 10 p-values were computed and adjusted using the Bonferronicorrection; two p-values were found significant (<0 . Specificity Score.
We define a specificity score for a PQ-tree T of a gene cluster namedS-score. Let ˜ T be the least specific PQ-tree that could have been generated for the genesof the gene cluster based on which T was constructed. Namely, a PQ-tree that allows allpermutations of said genes, has height 1, is rooted in a P-node whose children (being theleaves of the tree) are the leaves of T . Thus, the S-score of T is | C ( ˜ T ) || C ( T ) | . For a gene cluster ofpermutations (i.e. there are no duplications), the computation of | C ( T ) | is as described inEquation (3), where the set of P-nodes in T is denoted by T.p . | C ( T ) | = 2 m q · Y x ∈ T.p | children ( x ) | ! (3)For a gene cluster that has duplications, the set C ( T ) is generated to learn its size. Let a ( ‘, T ) denote the number of appearances of the label ‘ in the leaves of T and let labels ( T )denote the set of all labels of the leaves of T . So, the formula for | C ( ˜ T ) | is as in Equation (4).Clearly, for T with no duplications | C ( ˜ T ) | = | F ( T ) | !. | C ( ˜ T ) | = | F ( T ) | ! Q ‘ ∈ labels ( T ) a ( ‘, T )! (4) . R. Zimerman, D. Svetlitsky, M. Zehavi and M. Ziv-Ukelson 13 The labeling of each internal node of a PQ-tree as P or Q, is learned during the constructionof the tree, based on some interrogation of the gene orders from which the PQ-tree is trained[24]. As a result, the set of strings that can be derived from a PQ-tree T , consists of twoparts: (1) all the strings representing the known gene orders from which T was constructed,and (2) additional strings, denoted tree-guided rearrangements , that do not appear in theset of gene orders constructing T , but can be obtained via rearrangement operations thatare constrained by T . Thus, the tree-guided rearrangements conserve the internal topologyproperties of the gene cluster, as learned from the corresponding gene orders during theconstruction of T , such that colinear dependencies among genes and between sub-operonsare preserved in the inferred gene orders.In this section, we used the PQ-trees constructed from chromosomal gene clusters, toexamine whether tree-guided rearrangements can be found in plasmids. The objective was todiscover gene orders in plasmids that abide abide by a PQ-tree representing a chromosomalgene cluster, and differ from all the gene orders participating in the PQ-tree’s construction.PQ-trees that are constructed from gene clusters that have only one gene order or gene clusterswith less than four COGs cannot generate gene orders that differ from the ones participatingin their construction. Therefore, only 779 out of 26,270 chromosomal gene clusters were usedfor the construction of query PQ-trees (the generation of the chromosomal gene clusters isdetailed in Section 4). Using our tool PQFinder that implements the algorithm proposed forsolving the PQ-Tree Search problem, the query PQ-trees were run as queries against allplasmid genomes. This benchmark was run conservatively without allowing substitutions ordeletions from the PQ-tree or from the target string. 380 of the query gene clusters werefound in at least one plasmid. The instances of these gene clusters in plasmids are provided inthe Supplementary Materials as a session file that can be viewed using the tool CSBFinder-S[36].Tree-guided rearrangements were found among instances of 29 gene clusters. The PQ-treescorresponding to these gene clusters were sorted by a decreasing S-score, where higher scoresare given to a more specific tree (details in Section 4). In this setting, the higher the S-score,the smaller the number of possible gene orders that can be derived from the respectivePQ-tree. Interestingly, 21 out of these 29 gene clusters code for transporters, namely 20importers (ABC-type transport systems) and one exporter (efflux pump). The 10 top rankingresults are presented in Table 1.We selected the third top-ranking PQ-tree in Table 1 for further analysis. This PQ-treewas constructed from 7 gene orders of a gene cluster that encodes a heavy metal effluxpump. This gene cluster was found in the chromosomes of 79 genomes (represented by the 7distinct gene orders mentioned above) and in the plasmids of 7 genomes. The tree-guidedrearrangement instance was found in the strain
Cupriavidus metallidurans CH34 , isolatedfrom an environment polluted with high concentrations of several heavy metals. This straincontains two large plasmids that confer resistance to a large number of heavy metals such aszinc, cadmium, copper, cobalt, lead, mercury, nickel and chromium. We hypothesize that therearrangement event could have been caused by a heavy metal stress [41]. In the followingsection we will focus on this PQ-tree to further study its different variants in plasmids.
PQ-tree S-score Functional Category1 [[0683 [[0411 0410] [0559 4177]]] 0583] 22.5 5 (2) Amino acid transport2 (1609 [1653 1175 0395] 3839) 10.0 10 (2) Carbohydrate transport3 [[1538 [3696 0845]] [0642 0745]] 7.5 7 (1) Heavy metal efflux4 [[2115 1070] [4213 [1129 4214]]] 7.5 1 (1) Carbohydrate transport5 [1960 [[2011 1135] [2141 1464]]] 7.5 3 (1) Amino acid transport6 [[0596 0599] [[3485 3485] 0015]] 7.5 9 (1) Metabolism7 [[[1129 1172 1172] 1879] 3254] 7.5 6 (1) Carbohydrate transport8 (1609 1869 [[1129 1172] 1879] 0524) 7.5 1 (1) Carbohydrate transport9 (0683 [0559 4177] [0411 0410] 0318) 7.5 1 (1) Amino acid transport10 (3839 0673 [[0395 1175] 1653]) 5.0 10 (1) Carbohydrate transport
Table 1
Ten top ranked PQ-trees for which tree-guided rearrangements were found in plasmids. Square brackets represent a Q-node; round brackets represent a P-node. Numbers indicate therespective COG IDs. This column indicates the number of genomes harboring plasmid instancesof the respective PQ-tree. The number in brackets indicates the number of genomes harboringa tree-guided gene rearrangement of the corresponding gene cluster. The full table can be foundin Table S2.
The heavy metal efflux pump examined in the previous section (corresponding to the thirdtop-ranking PQ-tree in Table 1), was used as a PQFinder query and re-ran against all theplasmids in our dataset in order to discover approximate instances of this gene cluster, possiblyencoding remotely related variations of the efflux pump it encodes. This time, in order toincrease sensitivity, a semantic substitution scoring function (described in Section 4) wasused, and the parameters were set to d T = 1 (up to one deletion from the tree, representingmissing genes) and d S = 3 (up to three deletions from the plasmid, representing intrudinggenes). An instance of a gene cluster is accepted if it was derived from the correspondingPQ-tree with a score that is higher than 0.75 of the highest possible score attainable by thequery. The plasmid instances detected by PQFinder are displayed in Figure S7.Heavy metal efflux pumps are involved in the resistance of bacteria to a wide rangeof toxic metal ions [27] and they belong to the resistance-nodulation-cell division (RND)family. In Gram-negative bacteria, RND pumps exist in a tripartite form, comprised froman outer-membrane protein (OMP), an inner membrane protein (IMP), and a periplasmicmembrane fusion protein (MFP) that connects the other two proteins. In some cases, thegenes of the RND pump are flanked with two regulatory genes that encode the factors of atwo-component regulatory system comprising a sensor/histidine kinase (HK) and responseregulator (RR) (Figure 2.B). This regulatory system responds to the presence of a substrate,and consequently enhances the expression of the efflux pump genes.The PQ-tree of this gene cluster (Figure 2.A) shows that the COGs encoding the IMPand MFP proteins always appear as an adjacent pair, the OMP COG is always adjacent tothis IMP-MFP pair, and the HK and RR COGs appear as a pair downstream or upstreamto the other COGs. COG3696, which encodes the IMP protein, is annotated as a heavymetal efflux pump protein, while the other COGs are common to all RND efflux pumps.Therefore, it is very likely that the respective gene cluster corresponds to a heavy metalRND pump. The absence of an additional periplasmic protein likely indicates that this genecluster encodes a Czc-like efflux pump that exports divalent metals such as the cobalt, zincand cadmium exporter in Cupriavidus metallidurans [27] (Figure 2.C(1)). . R. Zimerman, D. Svetlitsky, M. Zehavi and M. Ziv-Ukelson 15
PQFinder discovered instances of this gene cluster in the plasmids of 12 genomes (Figures2.C(1) and 2.D), and it is significantly enriched in the β -proteobacteria class (hypergeometricp-value= 1 . × − , Bonferroni corrected p-value = 1 . × − ). In addition, three othervariants of RND pumps were found as instances of the query gene cluster (Figure 2.C(2-4)).The plasmids of three genomes contained instances that were missing the COG correspondingto the OMP gene CzcC (Figure 2.C(2)). This could be caused by a low quality sequencingor assembly of these plasmids. An alternative possible explanation is that a Czc-like effluxpump can still be functional without CzcC; a previous study showed that the deletion ofCzcC resulted in the loss of cadmium and cobalt resistance, but most of the zinc resistancewas retained [27].Some instances identified by the query, found in the plasmids of six genomes, seem toencode a different heavy metal efflux pump (Figure 2.C(3)). This variant includes all COGsfrom the query, in addition to an intruding COG that encodes a periplasmic protein (CusF).This protein is a predicted copper usher that facilitates access of periplasmic copper towardsthe heavy metal efflux pump. Indeed, the genomic region of Cus-like efflux pumps that exportmonovalent metals, such as the silver and copper exporter in Escherichia coli , include thisperiplasmic protein, in contrast to the Czc-like efflux pump [27]. This variant was found in theplasmids of six bacterial genomes belonging to the class γ -proteobacteria (Figure 2.D). Thisgene cluster is significantly enriched in the γ -proteobacteria class (hypergeometric p-value=2 . × − , Bonferroni corrected p-value = 2 . × − ). Surprisingly, all of these strains,except for one, are annotated as human or animal pathogens. Interestingly, previous studiessuggest that the host immune system exploits excess copper to poison invading pathogens[18], which can explain why these pathogens evolved copper efflux pumps.Another variant of the pump, appearing in five genomes (Figures 2.C(4) and 2.D), resultedfrom a substitution of the query IMP gene (COG3696) by a different IMP gene (COG0841)belonging to the multidrug efflux pump AcrAB/TolC. The AcrAB-TolC system, mainlystudied in Escherichia coli , transports a diverse array of compounds with little chemicalsimilarity [13]. AcrAB/TolC is an example of an intrinsic non-specific efflux pump, which iswidespread in the chromosomes of Gram-negative bacteria, and likely evolved as a generalresponse to environmental toxins [35]. In this case, the query gene cluster and the identifiedvariant share all COGs, except for the COGs encoding the IMP genes. The differing COGsare responsible for substrate recognition, which naturally differs between the two pumps, asone pump exports heavy metal while the other exports multiple drugs. When consideringthe functional annotation of these two COGs, we see that the query metal efflux pumpCOG encoding the IMP gene is annotated as "Cu/Ag efflux pump CusA", while in themultidrug efflux pump the COG encoding the IMP gene is annotated as "Multidrug effluxpump subunit AcrB". Thus, in spite of the difference in substrate specificity, the semanticsimilarity measure employed by PQFinder was able to reflect their functional similarity andallowed the substitution between them, while conferring to the structure of the PQ-tree.
In this paper, we defined a new problem in comparative genomics, denoted
PQ-TreeSearch . The objective of
PQ-Tree Search is to identify approximate new instancesof a gene cluster in a new genome S . In our model, the gene cluster is represented by aPQ-tree T , and the approximate instances can vary from the known gene orders by genomerearrangements that are constrained by T , by gene substitutions that are governed by agene-to-gene substitution scoring function h , and by gene deletions and insertions that are Efflux Pump
IMPMFP
OMP HK RegulationRR A. CzcC CzcB CzcA CzcR CzcS
CzcB CzcA CzcR CzcS
CusS CusR CusC CusF CusB CusA
AcrB AcrA TolC OmpR EnvZ 𝛼 𝛽 𝛾 𝛿 𝐴. (1)(2)(3) (4) Outer Membrane
Inner Membrane OMPMFP
SubstrateHK RR Periplasm
Cytoplasm OMP
MFP
IMP
MFP
IMP B. C. D. Figure 2 A. A PQ-tree of a heavy metal RND efflux pump, corresponding to the third top scoringresult in Table 1. B. An illustration of an RND efflux pump consisting of an outer-membrane protein(OMP), an inner membrane protein (IMP), and a periplasmic membrane fusion protein (MFP) thatconnects the other two proteins. In addition, a two-component regulatory system consisting of asensor/histidine kinase (HK) and response regulator (RR) enhances the transcription of the effluxpump genes. C. Representatives of the three different RND efflux pumps found in plasmids. (1)
ACzc-like heavy metal efflux pump, (2)
A Czc-like heavy metal efflux pump with a missing OMPgene, (3)
A Cus-like heavy metal efflux pump, (4)
An Acr-like multidrug efflux pump. Additionaldetails can be found in the text. D. The presence-absence map of the three types of efflux pumpsfound in the plasmids of different genomes. The rows correspond to the rows in (C), the columnscorrespond to the genomes in which instances were found, organized according to their taxonomicclasses. A black cell indicates that the corresponding efflux pump is present in the plasmids of thegenome. The labels below the map indicate the classes α, β, γ, δ -Proteobacteria and Acidobacteriia. bounded from above by integer parameters d T and d S , respectively.We proved that the PQ-Tree Search problem is NP -hard and proposed a parameterizedalgorithm that solves it in O ∗ (2 γ ) time, where γ is the maximum degree of a node in T and O ∗ is used to hide factors polynomial in the input size.The proposed algorithm was implemented as a publicly available tool and harnessed tosearch for tree-guided rearrangements of chromosomal gene clusters in plasmids. We identified29 chromosomal gene clusters that are rearranged in plasmids, where the rearrangements areguided by the corresponding PQ-tree. One of those gene clusters, coding for a heavy metalefflux pump, was further analysed to characterize its approximate instances in plasmids.An interesting variant of the analysed gene cluster, found among its approximate instances,corresponds to a copper efflux pump. It was found mainly in pathogenic bacteria, and likelyconstitutes a bacterial defense mechanism against the host immune response. These resultsexemplify how our tool can be harnessed to find meaningful variations of known biologicalsystems that are conserved as gene clusters, suggesting that PQ-Tree Search can befurther utilized in the domain of comparative functional analysis.One of the downsides to using PQ-trees to represent gene clusters is that very rare geneorders taken into account in the tree construction could greatly increase the number ofallowed rearrangements and thus substantially lower the specificity of the PQ-tree. Thus, . R. Zimerman, D. Svetlitsky, M. Zehavi and M. Ziv-Ukelson 17 a natural continuation of our research would be to increase the specificity of the model byconsidering a stochastic variation of
PQ-Tree Search . Namely, defining a PQ-tree in whichthe internal nodes hold the probability of each rearrangement, and adjusting the algorithmfor
PQ-Tree Search accordingly. In addition, future extensions of this work could also aimto increase the sensitivity of the model by taking into account gene duplications, gene-mergeand gene-split events, which are typical events in gene cluster evolution.
References Zaky Adam, Monique Turmel, Claude Lemieux, and David Sankoff. Common intervals andsymmetric difference in a model-free phylogenomics, with an application to streptophyteevolution.
Journal of Computational Biology , 14(4):436–445, 2007. Farid Alizadeh, Richard M Karp, Deborah K Weisser, and Geoffrey Zweig. Physical mappingof chromosomes using unique probes.
Journal of Computational Biology , 2(2):159–184, 1995. Severine Bérard, Anne Bergeron, Cedric Chauve, and Christophe Paul. Perfect sorting byreversals is not always difficult.
IEEE/ACM Transactions on Computational Biology andBioinformatics , 4(1):4–16, 2007. Anne Bergeron, Mathieu Blanchette, Annie Chateau, and Cedric Chauve. Reconstructingancestral gene orders using conserved intervals. In
International Workshop on Algorithms inBioinformatics , pages 14–25. Springer, 2004. Anne Bergeron, Sylvie Corteel, and Mathieu Raffinot. The algorithmic of gene teams. In
International Workshop on Algorithms in Bioinformatics , pages 464–476. Springer, 2002. Anne Bergeron, Yannick Gingras, and Cedric Chauve. Formal models of gene clusters.
Bioinformatics Algorithms: Techniques and Applications , 8:177–202, 2008. Anne Bergeron, Julia Mixtacki, and Jens Stoye. Reversal distance without hurdles andfortresses. In
Annual Symposium on Combinatorial Pattern Matching , pages 388–399. Springer,2004. Sebastian Böcker, Katharina Jahn, Julia Mixtacki, and Jens Stoye. Computation of mediangene clusters.
Journal of Computational Biology , 16(8):1085–1099, 2009. Kellogg S Booth and George S Lueker. Testing for the consecutive ones property, intervalgraphs, and graph planarity using pq-tree algorithms.
Journal of Computer and SystemSciences , 13(3):335–379, 1976. Thomas Christof, Michael Jünger, John Kececioglu, Petra Mutzel, and Gerhard Reinelt. Abranch-and-cut approach to physical mapping of chromosomes by unique end-probes.
Journalof Computational Biology , 4(4):433–447, 1997. Marek Cygan, Fedor V. Fomin, Lukasz Kowalik, Daniel Lokshtanov, Dániel Marx,Marcin Pilipczuk, Michal Pilipczuk, and Saket Saurabh.
Parameterized Algorithms .Springer, 2015. URL: http://dx.doi.org/10.1007/978-3-319-21275-3 , doi:10.1007/978-3-319-21275-3 . Rodney G. Downey and Michael R. Fellows.
Fundamentals of Parameterized Complex-ity . Texts in Computer Science. Springer, 2013. URL: http://dx.doi.org/10.1007/978-1-4471-5559-1 , doi:10.1007/978-1-4471-5559-1 . Dijun Du, Zhao Wang, Nathan R James, Jarrod E Voss, Ewa Klimont, Thelma Ohene-Agyei,Henrietta Venter, Wah Chiu, and Ben F Luisi. Structure of the AcrAB–TolC multidrug effluxpump.
Nature , 509(7501):512–515, 2014. William G Eberhard. Evolution in bacterial plasmids and levels of selection.
The QuarterlyReview of Biology , 65(1):3–22, 1990. Revital Eres, Gad M Landau, and Laxmi Parida. A combinatorial approach to automaticdiscovery of cluster-patterns. In
International Workshop on Algorithms in Bioinformatics ,pages 139–150. Springer, 2003. Fedor V Fomin, Daniel Lokshtanov, Saket Saurabh, and Meirav Zehavi.
Kernelization: Theoryof Parameterized Preprocessing . Cambridge University Press, 2019. Marco Fondi, Giovanni Emiliani, and Renato Fani. Origin and evolution of operons andmetabolic pathways.
Research in Microbiology , 160(7):502–512, 2009. Yue Fu, Feng-Ming James Chang, and David P Giedroc. Copper transport and trafficking atthe host–bacterial pathogen interface.
Accounts of Chemical Research , 47(12):3605–3613, 2014. Lev Gourevitach. A program for pq-tree construction. github.com/levgou/pqtrees . Susu He, Michael Chandler, Alessandro M Varani, Alison B Hickman, John P Dekker, andFred Dyda. Mechanisms of evolution in high-consequence drug resistance plasmids. mBio ,7(6):e01987–16, 2016. Xin He and Michael H Goldwasser. Identifying conserved gene clusters in the presence ofhomology families.
Journal of Computational Biology , 12(6):638–656, 2005. Steffen Heber and Jens Stoye. Algorithms for finding gene clusters. In
International Workshopon Algorithms in Bioinformatics , pages 252–263. Springer, 2001. J Mark Keil. On the complexity of scheduling tasks with discrete starting times.
OperationsResearch Letters , 12(5):293–295, 1992. Gad M Landau, Laxmi Parida, and Oren Weimann. Gene proximity analysis across wholegenomes via pq trees.
Journal of Computational Biology , 12(10):1289–1306, 2005. William W Metcalf and Barry L Wanner. Evidence for a fourteen-gene, phnC to phnP locusfor phosphonate metabolism in escherichia coli.
Gene , 129(1):27–32, 1993. Kazuo Nakajima and S Louis Hakimi. Complexity results for scheduling tasks with discretestarting times.
Journal of Algorithms , 3(4):344–361, 1982. Dietrich H Nies. Efflux-mediated heavy metal resistance in prokaryotes.
FEMS MicrobiologyReviews , 27(2-3):313–339, 2003. Vic Norris and Annabelle Merieau. Plasmids as scribbling pads for operon formation andpropagation.
Research in Microbiology , 164(7):779–787, 2013. Alex Orlek, Nicole Stoesser, Muna F Anjum, Michel Doumith, Matthew J Ellington, Tim Peto,Derrick Crook, Neil Woodford, A Sarah Walker, Hang Phan, et al. Plasmid classification in anera of whole-genome sequencing: application in studies of antibiotic resistance epidemiology.
Frontiers in Microbiology , 8:182, 2017. Laxmi Parida. Using pq structures for genomic rearrangement phylogeny.
Journal of Compu-tational Biology , 13(10):1685–1700, 2006. Gerard Salton, Anita Wong, and Chung-Shu Yang. A vector space model for automaticindexing.
Communications of the ACM , 18(11):613–620, 1975. Thomas Schmidt and Jens Stoye. Quadratic time algorithms for finding common intervals intwo and more sequences. In
Combinatorial Pattern Matching , pages 347–358. Springer, 2004. Frits CR Spieksma. On the approximability of an interval scheduling problem.
Journal ofScheduling , 2(5):215–227, 1999. Frits CR Spieksma and Yves Crama.
The complexity of scheduling short tasks with few startingtimes . Rijksuniversiteit Limburg. Vakgroep Wiskunde, 1992. Mark C Sulavik, Chad Houseweart, Christina Cramer, Nilofer Jiwani, Nicholas Murgolo,Jonathan Greene, Beth DiDomenico, Karen Joy Shaw, George H Miller, Roberta Hare, et al.Antibiotic susceptibility profiles of escherichia coli strains lacking multidrug efflux pump genes.
Antimicrobial Agents and Chemotherapy , 45(4):1126–1136, 2001. Dina Svetlitsky, Tal Dagan, and Michal Ziv-Ukelson. Discovery of multi-operon colinear syntenicblocks in microbial genomes.
Bioinformatics , 2020. doi:10.1093/bioinformatics/btaa503 . Roman L Tatusov, Michael Y Galperin, Darren A Natale, and Eugene V Koonin. The cogdatabase: a tool for genome-scale analysis of protein functions and evolution.
Nucleic AcidsResearch , 28(1):33–36, 2000. Tatiana Tatusova, Stacy Ciufo, Boris Fedorov, Kathleen O’Neill, and Igor Tolstoy. Refseqmicrobial genomes database: new representation and annotation strategy.
Nucleic AcidsResearch , 42(D1):D553–D559, 2014. Takeaki Uno and Mutsunori Yagiura. Fast algorithms to enumerate all common intervals oftwo permutations.
Algorithmica , 26(2):290–309, 2000. . R. Zimerman, D. Svetlitsky, M. Zehavi and M. Ziv-Ukelson 19 René van Bevern, Matthias Mnich, Rolf Niedermeier, and Mathias Weller. Interval schedulingand colorful independent sets.
Journal of Scheduling , 18(5):449–469, Oct 2015. doi:10.1007/s10951-014-0398-5 . Joachim Vandecraen, Michael Chandler, Abram Aertsen, and Rob Van Houdt. The impactof insertion sequences on bacterial genome plasticity and adaptability.
Critical Reviewsin Microbiology , 43(6):709–730, 2017. PMID: 28407717. arXiv:https://doi.org/10.1080/1040841X.2017.1303661 , doi:10.1080/1040841X.2017.1303661 . Alice R Wattam, David Abraham, Oral Dalay, Terry L Disz, Timothy Driscoll, Joseph LGabbard, Joseph J Gillespie, Roger Gough, Deborah Hix, Ronald Kenyon, et al. Patric, thebacterial bioinformatics database and analysis resource.
Nucleic Acids Research , 42(D1):D581–D591, 2014. Jonathan N Wells, L Therese Bergendahl, and Joseph A Marsh. Operon gene order is optimizedfor ordered protein complex assembly.
Cell Reports , 14(4):679–685, 2016. Sascha Winter, Katharina Jahn, Stefanie Wehner, Leon Kuchenbecker, Manja Marz, JensStoye, and Sebastian Böcker. Finding approximate gene clusters with gecko 3.
Nucleic AcidsResearch , 44(20):9600–9610, 2016.
A PQ-Tree Search is NP-Hard
In this section we prove Theorem 9 by describing a reduction from the
Job IntervalSelection problem (
JISP ) to
PQ-Tree Search . (cid:73) Theorem 9.
PQ-Tree Search is NP -hard. JISP was introduced by Nakajima and Hakimi [26]. They considered one machine and acollection of non-preamble jobs, denoted 1 , . . . , n , that need to be executed on that machine.Each job i has an execution time t i and k i possible starting times, ( s i , . . . , s i ki ). Note thatevery t i and s i j define an interval on the real line: [ s i j , s i j + t i ]. The aim is to allocate astarting time for each job such that no two jobs will run simultaneously on the machine. The Job Interval Selection problem (
JISP ) with k intervals per job was named JISP k [33].Since its initial definition, the problem has seen many equivalent definitions [23, 33, 34, 40].We use the following formulation for JISP k based on colors. In this setting, each job i isencoded as a k -tuple of intervals on the real line having the color i . Let γ be the number ofcolors, hence there are γ jobs to be executed. The notation I ij is used to denote the intervalwith starting time s ij finishing time f ij (i.e. duration [ s ij , f ij ]) and color 1 ≤ i ≤ γ (i.e. itis a part of the i th k -tuple). The objective is to select exactly one interval of each color( k -tuple) such that no two intervals intersect. JISP3 was shown to be NP -complete by Keil [23]. Crama et al. [34] showed that JISP3 is NP -complete even if all intervals are of length 2. We use these results to show that PQ-TreeSearch is NP -hard. The Reduction.
Given an instance, J , of JISP3 where all intervals have length 2, aninstance of PQ-Tree Search is created. It is easy to see that shifting all intervals by someconstant does not change the problem. Hence, assume that the leftmost starting intervalstarts at 1. Let L be the rightmost ending point of an interval, so the focus can be only onthe segment [1 , L ] of the real line. Now, an instance of PQ-Tree Search ( T, S, h, d T , d S )is constructed (an illustrated example is given in Figure S1 below): The PQ-tree T : The root node, root , is a P-node with 3 L − − γ children: x , . . . ,x γ ,y , . . . ,y L − − γ . The children of root are defined as follows: for every color 1 ≤ i ≤ γ ,create a Q-node x i with four children x si , x ai , x bi , x fi ; for every index 1 ≤ i ≤ L − − γ ,create a leaf y i . The string S : Define S = σ σ a σ b σ σ a σ b . . . σ a σ b σ L . The substitution function h : for every interval of the color i , I ij = [ s ij , f ij ], thefunction h returns T rue for the following pairs: ( x si , σ s ij ), ( x fi , σ f ij ), ( x ai , σ a ) and ( x bi , σ b ).In addition, every leaf y r can be substituted by every letter of S , namely for every index1 ≤ r ≤ L − − γ and for every s ∈ { a, b, , . . . , L } the function h returns T rue for thepair ( y r , σ s ). For every other pair h returns F alse . For the optimization version of theproblem, define a scored substitution function h , such that h ( u, v ) = 1 if h ( u, v ) = T rue and h ( u, v ) = −∞ if h ( u, v ) = F alse . Number of deletions:
Define d T = 0 and d S = 0, i.e. deletions are forbidden fromboth tree and string.An example of the reduction is shown in Figure S1. A collection of two 3-tuples (oneblue and one red) where each interval is of length 2, i.e a JISP3 instance, is in Figure S1a.Running the reduction algorithm yields the
PQ-Tree Search instance in Figure S1b. Thepairs that can be substituted (i.e. the pairs for which h returns T rue ) are given by thelines connecting the leafs of the PQ-tree and the letters of the string S . The nodes and . R. Zimerman, D. Svetlitsky, M. Zehavi and M. Ziv-Ukelson 21 I I I I I I (a)(b) Figure S1 (a) The input of the reduction - a
JISP3 instance J with intervals of length 2. (b) The output of the reduction - a
PQ-Tree Search instance (
T, S, h, d T , d S ). substitutable pairs created due to the blue and red intervals in the JISP3 instance aremarked in blue and red, respectively. The substitutable pairs containing a y node are markedin gray. Note that the colors given in Figure S1b are not a part of the PQ-Tree Search instance, and are given for convenience.
Correctness.
Let J be an instance of JISP3 , and let (
T, S, h, d T , d S ) be the output of thereduction on this instance. We prove that there exists a collection of intervals that is a solutionfor J if and only if there exists a one-to-one mapping that is a solution to ( T, S, h, d T , d S ). One Direction.
Suppose that there exists a solution to the output instance of
PQ-TreeSearch of the reduction, (
T, S, h, d T , d S ). This solution is a one-to-one mapping M : forevery 1 ≤ i ≤ γ , a set of pairs of the form ( x ji , σ k ( ‘ )) for j ∈ { s, f, a, b } , and for every1 ≤ r ≤ L − − γ , pairs of the form ( y r , σ k ( ‘ )) where k ∈ { , . . . , L, a, b } and 1 ≤ ‘ ≤ L − PQ-Tree Search , each x ji , y r and σ k ( ‘ ) appear in exactly one pair.Considering the mappings of the children of a node x i , they must be the following: ( x si , σ k ( ‘ )),( x ai , σ a ( ‘ + 1)), ( x bi , σ b ( ‘ + 2)) and ( x fi , σ k +1 ( ‘ + 3)). To see this, observe that a node x ai mustbe mapped to σ a , because it is the only letter by which it can be substituted under h . Inthe same way, a node x bi must be mapped to σ b . Because d T = 0, d S = 0 and due to theproperties of a Q-node, once x si is mapped to the letter in index ‘ (i.e. ( x si , σ ( ‘ )) ∈ M ), x ai must be mapped to the letter in index ‘ + 1 or in index ‘ − x si is mapped), then x bi must be mapped to the letter in index ‘ + 2 or ‘ − x fi to ‘ + 3 or ‘ −
3, respectively. Since σ a is always the letter preceding σ b in S , x bi must be mapped to an index larger by one than the index mapped to x ai . Hence,the children of the Q-node x i are mapped from left to right.Now, let us derive a solution for the original JISP3 instance from the solution to
PQ-
Tree Search . For every 3-tuple of color 1 ≤ i ≤ γ , where ( x si , σ k ( ‘ )) ∈ M , choose theinterval I ik = [ k, k + 1] from the 3-tuple of color i . For example, if a part of the solutionfor the PQ-Tree Search instance in Figure S1b is { ( x s , σ (1)) , ( x a , σ a (2)) , ( x b , σ b (3)) , ( x f , σ (4)) } ⊂ M , then I is the interval chosen for the first color (blue) in the derivedsolution for the JISP3 instance in Figure S1a. Note that I k is indeed one of the intervalsof color i , due to the definition of h , h ( x si , σ k ) = T rue and h ( x fi , σ k +1 ) = T rue if and onlyif there is an interval of color i starting at k and ending at k + 1. Thanks to M being aone-to-one mapping, the intervals do not intersect, and for every color there is only oneinterval chosen. Second Direction.
Let us prove that if there is a solution for the original instance of
JISP3 J , then there is a solution for ( T, S, h, d T , d S ). Let I = { I j , ..., I γj γ } be a solution of J suchthat I ij i = [ s ij i , f ij i ] is the interval chosen for the 3-tuple of color i . First, the solutionfor the PQ-Tree Search instance (
T, S, h, d T , d S ) is constructed. For every 1 ≤ i ≤ γ ,insert the following pairs into M : ( x si , σ s iji (3 s ij i − x ai , σ a (3 s ij i − x bi , σ b (3 s ij i )), and( x fi , σ f iji (3 f ij i − I is the interval chosen from the second (red) 3-tuple inthe solution of the JISP3 instance in Figure S1a, then the solution for the
PQ-Tree Search instance in Figure S1b includes the pairs { ( x s , σ (4)) , ( x a , σ a (5)) , ( x b , σ b (6)) , ( x f , σ (7)) } .Observe that only one pair was inserted for every leaf of T , and since no two intervalsintersect, every index of S appears in only one pair in M . Hence, a one-to-one mappingbetween 4 γ leafs of T and 4 γ indices of S was defined, and 3 L − γ − M in order to construct a solution for the PQ-Tree Search instance.According to h , every node y r (1 ≤ r ≤ L − − γ ) can be mapped to every letter σ k , soarbitrarily insert the pairs ( y r , σ k r ( ‘ r )) to M , such that no index or node appear in morethan one pair. (It can be done because there are 3 L − γ − y nodes and after mappingthe 4 children of every one of the γ x i nodes, 3 L − γ − S are left without amapping). Thus, a one-to-one mapping M between all the leafs of T and all the indices of S (i.e. no deletions from S and T ) was defined, and it is left to prove that S can be derivedfrom T under M .The children of a Q-node x i from left to right are: x si , x ai , x bi , x fi , and so, because d T = 0and d S = 0 (no deletions from both tree and string), they have to be mapped to consecutiveindices of S ; this is indeed the case according to our definition of M . The mapping of every y r is obviously also legal. Finally, root is a P-node, so its children can be arranged in anyorder, and they are. This completes the proof of correctness of the reduction. (cid:74) This concludes the proof of Theorem 9.
The Importance of σ a and σ b in the Reduction. In the reduction from
JISP3 to PQ-TreeSearch the string S was defined such that there is a character σ i for every 1 ≤ i ≤ L , and be-tween every two such characters there is the sequence σ a σ b , i.e. S = σ σ a σ b σ σ a σ b . . . σ a σ b σ L .In addition the PQ-tree T and the substitution function h were defined such that for everycolor i (1 ≤ i ≤ γ ), there are the leafs x ai and x bi in T and h returns T rue for both ( x ai , σ a )and ( x bi , σ b ). For abbreviation these leafs, the multiple appearances of σ a σ b in S and theallowed substitutions between them are named ab addition . Here we explain why the abaddition is important.The necessity arises when considering the first direction of the reduction, i.e. if there existsa solution to the output instance of PQ-Tree Search of the reduction (
T, S, h, d T , d S ),then there is a solution to the JISP3 instance J . Consider the partial instance of JISP3 . R. Zimerman, D. Svetlitsky, M. Zehavi and M. Ziv-Ukelson 23 I I I I . . . (a) Partial
JISP3
Instance (b)
Partial
PQ-Tree Search
Instance
Figure S2
An example instance resulting from a reduction without the ab addition . in Figure S2a. Note that it does not have a solution. Applying a reduction similar tothe one defined above but without ab addition , results in the PQ-Tree Search instance(
T, S, h, d T , d S ) in Figure S2b. The mapping lines in bold in Figure S2b are a solution forthat instance.This contradiction arises because Q-node children can also be ordered from right to left.With ab addition a PQ-Tree Search instance is created for which only a left-to-rightordering of the children of a Q-node x i can be a part of a possible solution. The definition of h dictates that in M every x ai will be mapped to a σ a ( j i ) and every x bi will be mapped to a σ b ( ‘ i ). Because there are no deletions allowed and because of the possible reordering of thechildren of a Q-node, either ‘ i = j i + 1 (left-to-right) or ‘ i = j i − S thecharacter σ a is always to the left of σ b , hence there are no indices j, ‘ such that ‘ = j − S [ j ] = a and S [ ‘ ] = b . So, for every 1 ≤ i ≤ γ , ‘ i = j i + 1. This means that the children of aQ-node x i are ordered form left to right as needed. B The Length of the Derived String
Given a node x and the numbers of deletions, k T and k S , the length of S , the string derivedfrom T ( x ), can be calculated. If there were no deletions, the length of S is equal to the spanof x , because every leaf of T ( x ) is mapped to exactly one character of S (see Figure S3a).Consider the case in which there is one deletion from the tree (Figure S3b). Every oneof the leaves in T ( x ) is mapped to one character of S except for the deleted leaf which isnot mapped to any character. So, in this case the derivation is to a substring of length span ( x ) −
1. In general, if there are k T deletions from the tree (and none from the string),then the length of the substring derived from T ( x ) is span ( x ) − k T . Now, consider the casein which there is one deletion from the string (Figure S3c). There are span ( x ) charactersof S that are mapped to the leaves of T ( x ). One more character is a part of the derivedsubstring, but it is not mapped to any of its leaves. So, in this case T ( x ) is derived to asubstring of length span ( x ) + 1. In general, if there are k S deletions from the string (and (a) No deletions derives a string of length3 which is equal to span ( x ). (b) One deletion from the tree ( x ) de-rives a string of length 2 which is equalto span ( x ) − (c) One deletion from the string ( σ )derives a string of length 4 which is equalto span ( x ) + 1. (d) One deletion from the string ( σ ) andone from the tree ( x ) derives a stringof length 3 which is equal to span ( x ). Figure S3
An example of the effect the number of deletion from the tree and string have onthe length of the derived string. In this example the node, x , has a span of 3 and the one-to-onemapping between the children of x and the characters of the string are denoted by dotted lines. none from the tree), the length of the substring derived from T ( x ) is span ( x ) + k S . Thus,the definition of the length function L ( x, k T , k S ) . = span ( x ) − k T + k S . C The Initialization of the DP Table in the P-Mapping Algorithm
The P-mapping algorithm initializes P using the following two rules: If L ( C, k T , k S ) = 0 and k S = 0, then P [ C, k T , k S ] = 0. If C = ∅ and k T = 0, then P [ C, k T , k S ] = 0.The first rule refers to a case where L ( C, k T , k S ) = 0, which means that the derived substringis the empty string and thus no character can be deleted from it; hence, k S must equal 0(and any other value of k S is invalid). From the definition of L ( C, k T , k S ), if L ( C, k T ,
0) = 0,then k T = P x ∈ C span ( x ), i.e. all nodes x ∈ C are deleted. So, the score in P [ C, k T , k S ] is 0.The second rule refers to a case where C = ∅ , i.e. all the children of x are ignored. Similarlyto the first rule, a value for k T other than 0 is invalid, and will have a −∞ value. From thedefinition of L , if k T = 0, then L ( ∅ , , k S ) = k S , so all characters from the substring aredeleted, and the score is 0. D Deleting a Child of a P-Node
In the P-mapping algorithm (Section 3.3) it was claimed that there is no need to add to therecursion rule (Equation (2)) a third case for the deletion of a child of the input node, x ,because that case is captured in the initialization rules. In the following example it is shownthat the first initialization rule (given in Appendix C) is enough to enable the algorithm to findthe best derivation even if it includes a node deletion, and that adding the option of deleting anode in the recursion rule is therefore redundant. Consider the P-node x in Figure S4, which . R. Zimerman, D. Svetlitsky, M. Zehavi and M. Ziv-Ukelson 25 Figure S4
A P-node x with three leaf children x , x and x . The only derivations of the childrenof x that have a score different than −∞ are depicted in dashed lines. has three leaf children ( x , x , x ). Assume the only derivations of the children of x thathave a score different than −∞ are the derivation µ of x to S [2 : 2] with no deletions, andthe derivation µ of x to S [1 : 1] with no deletions. Clearly, the best derivation of x to S isthe derivation that deletes x and maps x and x to S [2] and S [1], respectively (denotedby dotted lines in Figure S4). At the end of the algorithm it is expected that this derivationcan be found in the DP table entry P [ { x , x , x } , , P [ { x , x , x } , , P [ { x , x , x } , ,
0] isachieved by choosing to keep x : P [ { x , x , x } , ,
0] = P [ { x , x } , ,
0] + µ .score . Nowlet us compute the best score for P [ { x , x } , , x : P [ { x , x } , ,
0] = P [ { x } , ,
0] + µ .score . To construct the derivation x needs to be deleted.This deletion adds 0 to the score, and indeed, from the first initialization rule P [ { x } , ,
0] = 0.Note that at this point it is possible to delete x because span ( x ) = 1 = k T . Thus, wereceive the score of the computed derivation is µ .score + µ .score + 0, as required. E A Comparison with the Naïve Solution
In this section a naïve, alternative, algorithm for the
PQ-Tree Search problem is describedand its time complexity is analyzed. It is shown that the time complexity of our algorithm issubstantially smaller than that of the naïve algorithm.Solving the
PQ-Tree Search problem requires a search for a one-to-one mapping thatyields a derivation of a PQ-tree T to a substring of the input string S . That is, a substring S of S , such that the deletion of up to d S characters from S and the substitution of someof its characters yields a new string S ∈ C d T ( T ) (see Definition 2). Hence, a naïve way tosolve the problem is to go over every string in C d T ( T ) and try to find an alignment betweenit and every substring of S , when only d S deletions are allowed from S . Equivalently, it ispossible to search for an alignment between every substring of S and every string S T ∈ C ( T )with up to d S deletions from S and up to d T deletions from S T .Naturally, sequence alignment can be used, but in order to bound the number of deletions,the basic algorithm needs to be modified. The usual 2-dimensional DP-table needs to beextended with two additional dimensions that correspond to the numbers of deletions from S and S T . This way, when filling the table, the best scoring alignment considered so far forevery deletion numbers combination can be stored. At the end of the algorithm, the score ofthe best alignment is the maximum between the entries of the DP-table corresponding toan alignment between S T and a prefix of S that has a length between m − d T and m + d S .Thus, the outline of the naïve algorithm is as follows. For every string S T ∈ C ( T ) and everypossible start index i , preform sequence alignment with a bounded number of deletions.Then, find the index i that resulted in the highest scoring alignment.The size of the DP table is O ( m ( m + d S ) d T d S ), but in the first two dimensions only a diagonal with a width of O ( d T + d S ) entries needs to be computed. The computation of eachentry takes O (1) time and finding the best alignment takes O ( d T d S ). Thus, every run of thesequence alignment with a bounded number of deletions and a specific start index i takes O ( m ( d T + d S ) d T d S ) time. As seen in Section 3.1, there are O ( n ) possible values for i .Finally, let us bound the number of strings in C ( T ) which is equal to the number ofPQ-trees that are equivalent to T . By definition, every legal permutation of the childrenof an internal node of T results in a new PQ-tree T that is equivalent to T , i.e. T ≡ T (equivalence and not quasi-equivalence is used here because C ( T ) is considered, i.e. thereare no deletions from the tree). The children of a Q-node can be permuted only in one oftwo ways (left-to-right or right-to-left) and the children of a P-node can be arranged in anyorder. So, for an internal node x of T , for every string resulting from the rearrangement ofthe children of all the other nodes in T , x contributes 2 strings to C ( T ) if it is a Q-node,and γ ! strings if it is a P-node. Thus, | C ( T ) | = O (2 m q ( γ !) m p ). In total, the naïve solutionfor PQ-Tree Search takes O (2 m q ( γ !) m p nm ( d T + d S ) d T d S ) time.Both algorithms have a factor of O ( nd T d S ), so it can be ignored and more concise time com-plexities can be compared: the naïve O (2 m q ( γ !) m p m ( d T + d S )) versus our O ( d T d S ( m p γ γ + m q γ )). In both algorithms the non-polynomial factors in the time complexity are dependenton the number of P-nodes and the number of Q-nodes, so let us consider two complementarycases. First, assume there are only P-nodes in the PQ-tree (i.e. m = m p ). In this case,the naïve algorithm has a ( γ !) m p = [2 O ( γ log γ ) ] m factor, which is super-exponential in γ ,and even worse, exponential in m , while our algorithm has a m p γ γ = mγ γ factor whichis exponential only in γ , and in particular polynomial for any γ that is constant (or evenlogarithmic in the input size). Second, assume there are only Q-nodes in the PQ-tree (i.e. m = m q ). In this case, the naïve algorithm has a 2 m q = 2 m factor, which is exponential andour algorithm has a m q γ = mγ factor, which is polynomial. F Q-Node Mapping
In this section we describe the Q-mapping algorithm called by the main algorithm describedin Section 3.1
Objective.
As already mentioned in Section 3, the Q-mapping algorithm receives thefollowing as input. An internal node x that is a Q-node and has γ children: x , . . . , x γ . A string S (which is a substring of the original S ). A collection of derivations D of the children of x to substrings of S . The derivations aregrouped by their root nodes µ.v , and ordered by their end points, µ.e . The maximum number of deletions from the tree and string, d T and d S , respectively.The output of the algorithm is the set S k T ≤ d T S k S ≤ d S max µ ∈D ( x,k T ,k S ) µ.score , whichis a set of derivations of x to prefixes of S . The set holds the best scoring derivation forevery possible deletion number combination k T , k S where 0 ≤ k T ≤ d T and 0 ≤ k S ≤ d S .The set is ordered by the end points of the derivations and it is of size O ( d T d S ). Note thatthe input and output of this algorithm is the same as the input and output of the P-mappingalgorithm (Section 3.3), except for the type of the node received as input.As a start, a solution assuming that the children of the Q-node x can only be arranged ina left-to-right order is demonstrated. The fact that they can also be arranged in a right-to-leftorder is addressed at the end of this section. The Q-mapping algorithm is a DP algorithm . R. Zimerman, D. Svetlitsky, M. Zehavi and M. Ziv-Ukelson 27 that uses a 3-dimensional DP table, Q . The pseudocode of our algorithm can be found inAlgorithm 1 ahead. Notations.
At different stages of the algorithm the node x is considered as if it has only itsfirst i children. For an index i such that 1 ≤ i ≤ γ (namely, i is an index of a child of x ),two definitions are given. The first, x [ i ] , denotes the set of the first i children of x . Formally, x [ i ] . = { x , . . . , x i } . The second, x ( i ) , denotes the node x considering only its children in x [ i ] . Consequentially, the span of x ( i ) is defined as P ij =1 span ( x j ) and the set D ( x ( i ) , k T , k S )(in Definition 5 where U = { x ( i ) } ) now refers to a set of partial derivations. To use x ( i ) todescribe the base cases of our algorithm, let us define x (0) ( x ( i ) for i = 0) as a tree with nolabeled leaves to map. The DP Table.
The purpose of an entry Q [ i, k T , k S ] is to hold the score of the best partialderivation of x ( i ) to a prefix of S with exactly k T deletions from the tree and exactly k S deletions from the string. Namely, only the first i children of x , x , . . . , x i , are consideredand the rest are ignored. The other children of x , that is children ( x ) \ x [ i ] , are accounted forin the computation of other entries of Q . Formally, Q [ i, k T , k S ] = max µ ∈D ( x ( i ) ,k T ,k S ) µ.score .Similarly to the main algorithm (Section 3.1) and the P-mapping algorithm (Section 3.3),some of the entries of the DP table are invalid, and their value is defined as −∞ throughoutthe algorithm. Here we give a more detailed description of these entries for all three algorithmsand their DP tables. For a given DP table, the invalid entries are the ones that their indicesdefine an illegal derivation. Namely, derivations that have more deletions from the tree thanthere are leaves in the subtree of T rooted in the derived node, derivations that have moredeletions from the string than there are characters in the derived string, derivations thatderive a string that by definition ends in an index larger than the end index of the inputstring, or derivations that by definition derive a string with a negative length. Thus, an entry Q [ i, k T , k S ] is invalid if one of the following is true: k T > P c ∈ x [ i ] span ( c ), k S > L ( x [ i ] , k T , k S ), L ( x [ i ] , k T , k S ) > len ( S ), or L ( x [ i ] , k T , k S ) <
0. Similarly, an entry P [ C, k T , k S ] is invalid ifone of the following is true: k T > P c ∈ C span ( c ), k S > L ( C, k T , k S ), L ( C, k T , k S ) > len ( S ),or L ( C, k T , k S ) <
0. Lastly, an entry A [ j, i, k T , k S ] is invalid if one of the following is true: k T > span ( x j ), k S > L ( j, i, k T , k S ), E ( j, i, k S , k T ) > n , or L ( j, i, k T , k S ) < Filling the DP Table.
The algorithm initializes Q as follows. For every 0 ≤ k S ≤ len ( S ), Q [0 , , k S ] = 0. These entries of the DP table capture the cases in which a prefix of S oflength L ( ∅ , , k S ) = k S is derived, i.e. there are no leaves to map. Thus, all the charactersin S [1 : k S ] must be deleted under this partial derivation. This is possible because theallowed number of deletions from the string is exactly the number of characters in the derivedsubstring. Note that k S ≤ len ( S ) because otherwise Q [0 , , k S ] is an invalid entry and itsvalue should remain −∞ .Afterwards, the remaining entries of Q are calculated using the recursion rule in Equa-tion (5) ahead. The order of computation is ascending with respect to i (i.e. i = 1 , . . . , γ ),for a given i , the order of computation is ascending with respect to the number of deletionsfrom the string (i.e. k S = 0 , , . . . , d S ), and for a given i and k S , the order of computationdoes not matter. Nonetheless, an ascending order with respect to the number of deletions from the tree (i.e. k T = 0 , , . . . , d T ) is set. Q [ i, k T , k S ] = max Q [ i, k T , k S − Q [ i − , k T − span ( x i ) , k S ]max µ ∈D ≤ ( x [ i ] ,k T ,k S )s . t . µ.v = x i Q [ i − , k T − µ.del T , k S − µ.del S ] + µ.score (5)The intuition behind Equation (5) is that given a partial derivation, µ ∈ D ( x ( i ) , k T , k S ), oneof the three cases of the rule must be true. The end point of µ is E I ( x [ i ] , k T , k S ), and thus,by definition, S [ E I ( x [ i ] , k T , k S )] is either deleted under µ (the first case) or it is mappedunder µ (the third case). The partial derivation µ does not ignore x i , so either it is keptunder µ (the third case), or it is deleted under µ (the second case).In the first case, S [ E I ( x [ i ] , k T , k S )] is deleted under µ . Removing the deletion of S [ E I ( x [ i ] , k T , k S )] from µ (formally defined in Definition 12) results in a partial deriva-tion, µ , that considers the same subset of children of x with the same number of deletionsfrom the tree and one less deletion from the string. That is, µ ∈ D ( x ( i ) , k T , k S − µ is in Q [ i, k T , k S − x i is deleted under µ . Removing the deletion of x i from µ (formallydefined in Definition 14) results in a partial derivation of x ( i − with k T − span ( x i ) deletionsfrom the tree. That is, a derivation in D ( x [ i − , k T − span ( x i ) , k S ), and the score of thebest one is in Q [ i − , k T − span ( x i ) , k S ]. In the third case there is a derivation, µ , ofone of the children of x ( i ) such that S [ E I ( x [ i ] , k T , k S )] is mapped under it. Because x i iskept under µ (in derivation this entry holds) and it is the last child of x ( i ) , then µ is aderivation of x i (i.e. µ .v = x i ). Otherwise, the ordering of the children of the Q-node x is illegal. The score of µ in this case is equal to the score of µ plus the score of a partialderivation of x ( i − with k T − µ .del T and k S − µ .del S deletions from the tree and string,respectively. The best score of a partial derivation in D ( x ( i − , k T − µ .del T , k S − µ .del S ) isin Q [ i − , k T − µ.del T , k S − µ.del S ]. Thus we try to find the derivation, µ , of x i with theend point E I ( x [ i ] , k T , k S ) that maximises µ .score + Q [ i − , k T − µ .del T , k S − µ .del S ]. Finding the Solution.
Our goal is to find a derivation of x according to all of its children,hence once the entire DP table is filled our solution is in Q [ γ, · , · ]. The best derivation forevery deletion combination should be ordered with respect to the end point of the derivation.Here we explain how this can be done by simply traversing Q in a predefined order andwithout any further calculation. First, note that there could be more than one derivationper end point. For example, the second smallest end point (the end point of the secondshortest substring derived from x ) is generated by the deletion combination ( d T , e = E I ( γ, d T ,
1) = P γk =1 span ( x k ) − d T + 1 = span ( x ) − d T + 1. The deletion combination( d T − ,
0) also yields e , E I ( γ, d T − ,
0) = P γk =1 span ( x k ) − ( d T − span ( x ) − d T +1 = e .In fact, only the smallest and largest end points have just one derivation each. The deletioncombination ( d T ,
0) yields the smallest end point and (0 , d S ) yields the largest. Thus, giventhe ( d T + 1) × ( d S + 1) sized table Q [ γ, · , · ], the ordered list of derivations can be generatedby traversing the table in the order specified in Table S1. A Second Ordering of the Children.
Previously an algorithm to find a one-to-one mappingfor a tree rooted in a Q-node assuming its children can only be arranged in a left-to-right orderwas described. To consider also a right-to-left arrangement, the following minor modificationto the algorithm is required. Run the first two parts of the algorithm described above twice, . R. Zimerman, D. Svetlitsky, M. Zehavi and M. Ziv-Ukelson 29 k T k S e e e e e e e e e e e e Table S1
An example of the ordering between the end points induced by the different deletioncombinations. In this example, d T = 3, d S = 2 and the end points are indexed from first to last( e i = 1 + e i − for 2 ≤ i ≤ Algorithm 1
Q-Mapping
Input: x, S , D , d T , d S Output:
The best derivation of x to every prefix of S γ ← | children ( x ) | ; build Q with dimensions γ + 1 × d T + 1 × d S + 1; for k S = 0 to len ( S ) do //initialization Q [0 , , k S ] ← end for i = 1 to γ do for k S = 0 to d S do for k T = 0 to d T do compute Q [ i, k T , k S ] according to Equation (5) ; end end end return Q [ γ, · , · ] ;each run fills a different DP table. The first run of the algorithm will receive the children of x from left to right (i.e. x is the leftmost child of x and x γ is the rightmost child), and willproduce a DP table, Q ‘ , holding the best scores of the partial derivations of x that order itschildren from left to right. The second run will receive the children of x from right-to-left (i.e. x is the rightmost child of x and x γ is the leftmost child), and will produce a DP table, Q r ,holding the best scores of the partial derivations of x that order its children from right to left.To find the solution, go over both DP tables as described above (Table S1), but for everydeletion combination k T , k S , return the maximum between Q ‘ [ γ, kT, kS ] and Q r [ γ, kT, kS ]. Time and Space Complexity.
The DP table is the most space consuming data structurein the described algorithm. Its dimensions are γ + 1 × d T + 1 × d S + 1, and the algorithmuses two of them. The computation of an entry of the DP table, Q [ i, k T , k S ], includes two O (1) calculations (the first and second cases of the recursion rule) and going over everyderivation of x i in D ≤ ( x [ i ] , k T , k S ). All those derivations have the same root and the sameend point, but a different number of deletions. In fact, there are no two derivations of x i in D ≤ ( x [ i ] , k T , k S ) that have the same deletion combination ( k T , k S ). Hence, the numberof such derivations is equal to the number of deletion combinations, k T · k S , and so thecalculation of an entry of the DP table takes O ( k T k S ) = O ( d T d S ) time. Thus, the time complexity of the algorithm is O ( γd T d S ).In the previous paragraph the calculation of E I for every entry, which yields the relevantend point for the entry, was ignored. The most time consuming part of that calculation isthe summation of spans ( P ik =1 span ( x k )) which takes O ( γ ) time. To prevent the wastefulrepetition, these summations are calculated once and then saved in a table of size γ . Thesesummations are calculated twice - once for each possible children ordering (left-to-right andright-to-left). This is negligible with respect to the time it takes to fill the DP table. G Penalizing Deletions
To assign deletions a penalization cost (and not only limit them), the algorithm shouldreceive as input a deletion penalty function, δ : Σ T ∪ Σ S → R . The function defines thepenalty of deleting a character from S or a leaf from T according to its label. Then, let usexpand δ , and define the deletion penalty of a node x in T as the summation of the deletionpenalty of all the leaves in the subtree rooted in x . Thus, the set of nodes in T is denotedby T.nodes , and a new function ∆ :
T.nodes ∪ Σ S → R is defined in Equation (6) below.Note that the ∆ function can be calculated in advance, by going over T in postorder. Thiscalculation takes O ( m ) = O ( m ) time.∆( x ) = δ ( x ) , if x ∈ Σ S X ‘ ∈ leaves ( x ) δ ( label ( ‘ )) , if x ∈ T.nodes (6)In addition, the following changes to the main algorithm and to the P-mapping and Q-mapping algorithms are needed. First, the initialization of the main DP table A shouldchange and add to the score of every leaf entry (i.e. A [ j, i, k T , k S ] such that x j is a leaf) thecost of the deleted nodes and characters. Namely, in Algorithm 2 (given in Appendix H.1)lines 8-9 should be replaced with Equation (7) below. Second, the ∆ function in Equation (6)should be sent from the main algorithm to the Q-mapping and P-mapping algorithms. A [ j, i, , k S ] ← − ∆( x j ) − i + k S − X ‘ = i ∆( S [ ‘ ]) A [ j, i, , k S ] ← max i = i,...,i + d S h ( j, i ) − i + k S X ‘ = i ∆( S [ ‘ ]) + ∆( S [ i ]) (7)Third, the initialization of the DP table, P , and the recursion rule of the P-mappingalgorithm need to change. The first initialization rule, where L ( C, k T , k S ) = 0 and k S = 0,depicts the case in which every node in C is deleted, hence, the rule should be P [ C, k T , k S ] = − P c ∈ C ∆( c ). The second rule, where C = ∅ and k T = 0, concerns the case in which everycharacter in S [1 : k S ] is deleted, so the rule should be P [ C, k T , k S ] = − P k S i =1 ∆( S [ i ]). Therecursion rule should be changed to the one in Equation (8). Note that the change is only inthe first case where the cost of deleting the i th character of S is subtracted from the score. P [ C, k T , k S ] = max P [ C, k T , k S − − ∆( S [ i ])max µ ∈D ≤ ( C,k T ,k S ) P [ C \ { µ.v } , k T − µ.del T , k S − µ.del S ] + µ.score (8)Lastly, in the Q-mapping algorithm the initialization of the DP table, Q , and therecursion rule also need to change. In the initialization, for every 0 ≤ k S ≤ d S , Q [0 , , k S ] = . R. Zimerman, D. Svetlitsky, M. Zehavi and M. Ziv-Ukelson 31 − P k S i =0 ∆( S [ i ]). That is because this is the case in which every character in S [1 : k S ] isdeleted. When filling the DP table, the recursion rule in Equation (9) should be used. Notethe change is in the first and second cases. In the first the cost of deleting the i th characterof S is subtracted from the score, and in the second the score for deleting x i is subtractedfrom the score. Q [ i, k T , k S ] = max Q [ i, k T , k S − − ∆( S [ E I ( i, k T , k S )]) Q [ i − , k T − span ( x i ) , k S ] − ∆( x i )max µ ∈D ≤ ( x [ i ] ,k T ,k S )s . t . µ.v = x i Q [ i − , k T − µ.del T , k S − µ.del S ] + µ.score (9) H Correctness and Runtime Analysis of Our Algorithms
In this section we prove the correctness of the
PQ-Tree Search algorithm (Appendix H.1),the P-mapping algorithm (Appendix H.2) and the Q-mapping algorithm (Appendix H.3),and prove the time complexity of the
PQ-Tree Search algorithm and the P-mappingalgorithm. First, some definitions that are used in the proofs are given.
Addition and Removal of a Derivation.
Given a partial derivation, µ , which derives aninternal node, x , let us define the removal and addition of another derivation η : remove ( µ, η )and add ( µ, η ). Both operations are defined only for a derivation η whose root is a node x ∈ children ( x ). (cid:73) Definition 10.
The operation remove ( µ, η ) is defined only if η is the derivation of η.v under µ and if at least one among η.e = µ.e or η.s = µ.s is true. The operation returnsa new partial derivation µ of µ.v that ignores the subtree of T rooted in the child node η.v . If η.e = µ.e , then µ derives the string S [ µ.s : η.s − , and if η.s = µ.s , then µ derives the string S [ η.e + 1 : µ.e ] . In any case the number of deletions from the tree is µ .del T = µ.del T − η.del T and from the string it is µ .del S = µ.del S − η.del S . Furthermore, µ.o \ η.o is the one-to-one mapping that yields µ . (cid:73) Definition 11.
The operation add ( µ, η ) is defined only if either η.s = µ.e +1 or η.e = µ.s − is true and if the node η.v is ignored under µ . The operation returns a new partial derivation µ of µ.v . The derivation of η.v under µ is η , and the mapping or deletion of every other leafor character in the string is defined the same as it was in µ . Consequentially, if η.s = µ.e + 1 ,then µ derives the string S [ µ.s : η.e ] , and if η.e = µ.s − , then µ derives the string S [ η.s : µ.e ] . Furthermore, µ .del T = µ.del T + η.del T , µ .del S = µ.del S + η.del S and theone-to-one mapping that yields µ is µ.o ∪ η.o . Addition and Removal of a Deleted Character.
Given a partial derivation µ , which derivesa string S , and an index i of S let us define the removal and addition of a deleted character: removeDel ( µ, i ) and addDel ( µ, i ). (cid:73) Definition 12.
The operation removeDel ( µ, i ) is defined only if i = µ.e or i = µ.s , and if S [ i ] is deleted under µ . The operation returns a partial derivation µ with µ.del S − deletionsfrom the string. If i = µ.e , then µ derives the string S [ µ.s, µ.e − , and if i = µ.s , then µ derives the string S [ µ.s + 1 , µ.e ] . The one-to-one mapping that yields µ is µ.o \ { ( ε, S [ i ]( i )) } . (cid:73) Definition 13.
The operation addDel ( µ, i ) is defined only if i = µ.e + 1 or i = µ.s − .The operation returns a partial derivation µ with µ.del S + 1 deletions from the string. If i = µ.e + 1 , then µ derives the string S [ µ.s, µ.e + 1] , and if i = µ.s − , then µ derives thestring S [ µ.s − , µ.e ] . The one-to-one mapping that yields µ is µ.o ∪ { ( ε, S [ i ]( i )) } . Addition and Removal of a Deleted Node.
Given a partial derivation µ , which derivesa string S , let us define the removal and addition of a deleted node: removeDel ( µ, x ) and addDel ( µ, x ). (cid:73) Definition 14.
The operation removeDel ( µ, x ) is defined only if x is deleted under µ . Theoperation returns a partial derivation µ with µ.del T − span ( x ) deletions from the tree and µ.del S deletions from the string. The derivation µ ignores x and derives the same substringderived by µ . The one-to-one mapping that yields µ is µ.o \ { ( ‘, ε ) : ‘ ∈ leaves ( x ) } . (cid:73) Definition 15.
The operation addDel ( µ, x ) is defined only if x is ignored under µ . Theoperation returns a partial derivation µ with µ.del T + span ( x ) deletions from the tree. Thesubstring derived by µ is equal to the substring derived by µ . The one-to-one mapping thatyields µ is µ.o ∪ { ( ‘, ε ) : ‘ ∈ leaves ( x ) } . H.1 The Main Algorithm
In this section we give the pseudocode of the
PQ-Tree Search algorithm presented inSection 3.1 (Algorithm 2) and prove its correctness. In this proof, the correctness of theQ-mapping algorithm (Appendix F) and of the P-mapping algorithm (Section 3.3) is assumed.Their correctness will be proven in Appendix H.3 and Appendix H.2, respectively.For this proof Definition 16 below is used to represent the set of derivations whose scoremight be in A [ j, i, k T , k S ], similarly to the notation in Definition 5. (cid:73) Definition 16.
The set of all derivations to S [ i, E ( x j , i, k T , k S )] rooted in x j that haveexactly k T deletions from the tree and exactly k S deletions from the string is denoted by D M ( x j , i, k T , k S ) . (cid:73) Lemma 17.
At the end of the algorithm every entry A [ j, i, k T , k S ] of the DP-table A holdsthe highest score of a derivation of S [ i, E ( x j , i, k T , k S )] rooted in x j that has k S deletions fromthe string and k T deletions from the tree, i.e. A [ j, i, k T , k S ] = max µ ∈D M ( x j ,i,k T ,k S ) µ.score Proof.
We prove Lemma 17 by induction on the entries of A in the order described in the al-gorithm. Namely, for two entries A [ j , i , k T , k S ] and A [ j , i , k T , k S ], A [ j , i , k T , k S ] < A [ j , i , k T , k S ] if and only if j < j or both j = j and i < i . If j = j and i = i ,then the order between the entries is chosen arbitrarily. Base Case.
The base case of the algorithm is the initialization of the DP table, where theentries A [ j, i, k T , k S ] for x j ∈ leaves ( root ) and k T ∈ { , } are computed. When k T = 0,there are no deletions from the tree. So, x j must be mapped to some character S [ ‘ ]( i ≤ ‘ ≤ E ( x j , i, , k S )). In this version of the algorithm the deletion of a character does notchange the score of the derivation, so the maximal score of a derivation in D M ( x j , i, , k S ) isthe maximum score of a mapping of x j to some character S [ ‘ ] ( i ≤ ‘ ≤ E ( x j , i, , k S )), whichis the initialization value of the entry A [ j, i, , k S ]. When k T = 1, there is one deletion fromthe tree. The derived subtree T ( x j ) has one leaf, x j , and so it must be the deleted leaf. Allcharacters in the derived string, S [ i : E ( x j , i, , k S )], must also be deleted. Deletions do notadd to the score of the derivation, and so all the derivations in D M ( x j , i, , k S ) have a scoreof 0, which is the initialization value of A [ j, i, , k S ]. . R. Zimerman, D. Svetlitsky, M. Zehavi and M. Ziv-Ukelson 33 Algorithm 2
PQ-Tree Search
Input:
T, S, h, d T , d S Output:
The score of the best derivation of T to a substring of S with up to d T and d S deletions from T and S , respectively build A with dimensions m × n × d T + 1 × d S + 1 and initial value −∞ ; for j = 1 to m do //for each node of T in postorder for i = 1 to n do if x j is a Leaf then //initialization for k S = 0 to d S do A [ j, i, , k S ] ← A [ j, i, , k S ] ← max i = i,...,i + k S h ( j, S [ i ]); end end e ← E ( x j , i, , d S ); if x j is a Q-node then A [ j, i, · , · ] ← Q-Mapping( x j , S [ i, e ] , {A [ x j k , i, · , · ] : x j k ∈ children ( x j ) } , d T , d S ); end if x j is a P-node then A [ j, i, · , · ] ← P-Mapping( x j , S [ i, e ] , {A [ x j k , i, · , · ] : x j k ∈ children ( x j ) } , d T , d S ); end end end return max ≤ k T ≤ d T ≤ k S ≤ d S ≤ i ≤ ( n − ( span ( root ) − d T )+1) A [ m , i, k T , k S ] ; Induction Assumption.
Assume that every entry A [ j , i , k T , k S ] such that A [ j , i , k T , k S ] < A [ j, i, k T , k S ] holds the best score of a derivation from D M ( x j , i , k T , k S ). Namely, A [ j , i , k T , k S ] = max µ ∈D M ( x j ,i ,k T ,k S ) µ.score = OP T ( j , i , k T , k S ). Induction Step.
For every internal node x j and possible start index i , the algorithm fillsthe DP table entry A [ j, i, k T , k S ] according to the values returned from the Q-mapping andP-mapping algorithms according to the type of x j . The correctness of these algorithms isproven in Appendix H.3 and Appendix H.2, respectively. Hence, it is only necessary to provethat the input the algorithms expect to receive is sent correctly from the main algorithm.Both the Q-mapping and P-mapping algorithms expect to receive the internal node whichshould be the root of all the output derivations, a substring S of S , the deletion limits d T and d S , and a collection of the best scoring derivations of every child of x to every substring of S with up to d T and d S deletions from the tree and string, respectively. By definition an entryin A [ j, i, · , · ] concerns the derivations of x j with a start point i . The end point of the longestderivation of those derivations is E ( j, i, , d S ). Hence, the internal node sent to the Q-mappingor P-mapping algorithm is x j and the substring S equals S [ i, E ( j, i, , d S )]. The deletionlimits d T and d S are given as input to the main algorithm. Lastly, the best derivations of the Algorithm 3
P-Mapping.
Input: x, S , D , d T , d S Output:
The best derivation of x to every prefix of S γ ← | children ( x ) | ; Build P with dimensions 2 γ × d T + 1 × d S + 1; for size = 0 to γ do foreach C ⊆ children ( x ) s.t. | C | = size do for k S = 0 to d S do for k T = 0 to d T do if ( L ( C, k T , k S ) = 0 and k S = 0 ) or ( size = 0 and k T = 0 ) then //initialization P [ C, k T , k S ] ← else compute P [ C, k T , k S ] according to Equation (2); end end end end end return P [ children ( x ) , · , · ] ;children of x j are stored in A . Because the nodes of T are indexed in postorder, if x c is a childof x j , then c < j . Hence, for every i , k T , k S , it holds that A [ c, i , k T , k S ] < A [ j, i, k T , k S ], andfrom the induction assumption A [ c, i , k T , k S ] = OP T ( c, i , k T , k S ). So, indeed the expectedinput to the Q-mapping and P-mapping algorithms is correct. This completes the proof. (cid:74) H.2 P-Node Mapping
In this section we give the pseudocode of the P-mapping algorithm presented in Section 3.3(Algorithm 3) and prove its correctness. (cid:73)
Lemma 18.
At the end of the algorithm every entry of the DP-table, P [ C, k T , k S ] , holdsthe best score for a derivation of x ( C ) to a prefix of S with k T deletions from the tree and k S deletions from the string, i.e. P [ C, k T , k S ] = max µ ∈D ( x ( C ) ,k T ,k S ) µ.score Proof.
We prove Lemma 18 by induction on the entries of P in the order described inthe algorithm. Namely, for two entries P [ C , k T , k S ] and P [ C , k T , k S ], P [ C , k T , k S ] < P [ C , k T , k S ] if and only if | C | < | C | , or | C | = | C | and k S < k S , or | C | = | C | and k S = k S and k T < k T If C = C , | C | = | C | , k S = k S and k T = k T are all satisfied, then the order betweenthe entries is chosen randomly. Base Cases.
There are two types of base cases, as described in the initialization of the DPtable. L ( C, k T , k S ) = 0 and k S = 0: Let µ be a derivation of x ( C ) with k T and k S deletions. Bydefinition, µ derives an empty string, i.e. there are no characters to map to the leaves of . R. Zimerman, D. Svetlitsky, M. Zehavi and M. Ziv-Ukelson 35 the subtrees rooted in the nodes in C . Hence, every child of x that is considered (thenodes in C ) must be deleted under µ . All the nodes in C can be deleted if the sum of theirspans is equal to the allowed number of deletions in µ (that is, k T ). From the definitionof L ( C, k T , k S ) = 0 and the fact that k S = 0, we receive that indeed k T = P c ∈ C span ( c ).Every child node of x that is kept under µ adds to the score of the derivation of x , butthere are none in this case. In addition, every deletion from the subtree T ( x ) adds nothingto the score (in the penalization-free version of the algorithm). Hence, the score of µ must equal 0. C = ∅ and k T = 0: In this case all of the children of x are ignored, so there are no leaves tomap. Hence, every character of the derived string should be deleted. Note that, the derivedstring is S [1 : E I ( C, k T , k S )], and its length is L ( C, k T , k S ) = P c ∈ C span ( c ) − k T + k S = P c ∈∅ span ( c ) − k S = k S . So, the number of deletions from the string in this state isexactly the number needed to delete the derived string. Induction Assumption.
Assume that every table entry P [ C , k T , k S ] such that P [ C , k T , k S ] < P [ C, k T , k S ] holds the best score of a derivation in D ( x ( C ) , k T , k S ). Namely, P [ C , k T , k S ]= max µ ∈D ( x ( C ) ,k T ,k S ) µ.score = OP T ( C , k T , k S ). Induction Step.
Towards the proof of the step, we prove the following Equation (10):
OP T ( C, k T , k S ) = max( OP T ( C, k T , k S − , max µ ∈D ≤ ( C,k T ,k S ) OP T ( C \ { µ.v } , k T − µ.del T , k S − µ.del S ) + µ.score ) (10) ≤ : Let µ ∗ ∈ D ( x ( C ) , k T , k S ) be a derivation such that µ ∗ .score = OP T ( C, k T , k S ). Bydefinition, µ ∗ is a derivation of x ( C ) to the string S [1 : E I ( C, k T , k S )]. In a derivationevery character of the derived string is either deleted or it is a part of a substring derivedfrom one of the children of x . So, either S [ E I ( C, k T , k S )] is deleted under µ ∗ , or it ismapped under some derivation of a child of x , y ∈ C , to a substring S [ i : E I ( C, k T , k S )](for an index 0 < i ≤ E I ( C, k T , k S )).First, if the former is true, then by removing the deletion of S [ E I ( C, k T , k S )] from µ ∗ , removeDel ( µ ∗ , E I ( C, k T , k S )), a derivation µ ∈ D ( x ( C ) , k T , k S −
1) is received. Thederivation µ derives the string S [1 : E I ( C, k T , k S − S [1 : E I ( C, k T , k S )] −
1. So,the following Equation (11) is true. µ ∗ .score = µ .score ≤ OP T ( C, k T , k S − ≤ max( OP T ( C, k T , k S − , max µ ∈D ≤ ( C,k T ,k S ) OP T ( C \ { µ.v } , k T − µ.del T , k S − µ.del S ) + µ.score ) (11)Note that even if there is a penalization cost for deletions, the cost for the dele-tion of S [ E I ( C, k T , k S )] (i.e. − ∆( S [ E I ( C, k T , k S )])) is constant in this setting. So,for two derivations η, η ∈ D ( k T , k S − , x ( C ) ) if η.score ≤ η .score then η.score − ∆( S [ E I ( C, k T , k S )]) ≤ η .score − ∆( S [ E I ( C, k T , k S )]). Hence, the conclusion fromEquation (11) is still true.Second, if the latter is true, then there is a node y ∈ C for which there is a deriva-tion µ y ∈ D such that µ y .e = E I ( C, k T , k S ) and µ.y is the derivation of y under µ ∗ .For µ ∗ to be a legal derivation, µ y must be in D ≤ ( C, k T , k S ). Hence, µ y .score ≤ max µ ∈D ≤ ( C,k T ,k S ) µ.score . Furthermore, by removing µ y from µ ∗ , remove ( µ ∗ , µ.y ), the received partial derivation, µ , is of x ( C \{ y } ) to S [1 : µ y .s −
1] with k T − µ y .del T deletions from the tree and k S − µ y .del S from the string. Thus, µ ∈ D ( k T − µ y .del T , k S − µ y .del S , x ( C \{ y } ) ), andso µ .score ≤ OP T ( x ( C \{ y } ) , k T − µ y .del T , k S − µ y .del S ). Note that, indeed µ y .s =1 + E I ( C \ { y } , k T − µ y .del T , k S − µ y .del S ), as can be seen in the following Equation (12). µ y .s = E I ( C, k T , k S ) − L ( y, µ y .del T , µ y .del S ) + 1= X c ∈ C span ( c ) + k S − k T − ( µ y .del S − µ y .del T + span ( y )) + 1= X c ∈ C \{ y } span ( c ) + k S − µ y .del S − ( k T − µ y .del T ) + 1= E I ( C \ { y } , k T − µ y .del T , k S − µ y .del S ) + 1 (12)By combining our conclusions about µ y and µ together, we receive the following Equa-tion (13). µ ∗ .score = µ .score + µ y .score ≤ OP T ( C \ { y } , k T − µ y .del T , k S − µ y .del S ) + max µ ∈D ≤ ( C,k T ,k S ) µ.score ≤ max µ ∈D ≤ ( C,k T ,k S ) OP T ( C \ { µ.v } , k T − µ.del T , k S − µ.del S ) + µ.score ≤ max( OP T ( C, k T , k S − , max µ ∈D ≤ ( C,k T ,k S ) OP T ( C \ { µ.v } , k T − µ.del T , k S − µ.del S ) + µ.score ) (13) ≥ : Let µ ∗ be a derivation such that Equation (14) holds. µ ∗ .score = max( OP T ( C, k T , k S − , max µ ∈D ≤ ( C,k T ,k S ) OP T ( C \ { µ.v } , k T − µ.del T , k S − µ.del S ) + µ.score )(14)So, either µ ∗ .score = OP T ( C, k T , k S − µ ∗ .score = max µ ∈D ≤ ( C,k T ,k S ) OP T ( C \ { µ.v } ,k T − µ.del T , k S − µ.del S ) + µ.score .First, if the former is true, let η ∈ D ( x ( C ) , k T , k S −
1) be a derivation with η.score = OP T ( C, k T , k S − η derives the substring S [1 : E I ( C, k T , k S − η the deletion of S [ E I ( C, k T , k S )], addDel ( η, E I ( C, k T , k S )), results in a derivation η of x ( C ) to the string S [1 : E I ( C, k T , k S )] with k T deletions from the tree and k S deletions from the string. The string S [1 : E I ( C, k T , k S )] is equal to the concatenationof S [1 : E I ( C, k T , k S − S [ E I ( C, k T , k S )]. So, η ∈ D ( x ( C ) , k T , k S ), and thus η .score ≤ OP T ( C, k T , k S ). The derivation η was constructed such that µ ∗ .score = η .score , so µ ∗ .score ≤ OP T ( C, k T , k S ).Second, if the latter is true, then let η ∗ = arg max µ ∈D ≤ ( C,k T ,k S ) OP T ( C \ { µ.v } , k T − µ.del T , k S − µ.del S ) + µ.score . Adding η ∗ to a partial derivation η ∈ D ( x ( C \{ η ∗ .v } ) , k T − η ∗ .del T , k S − η ∗ .del S ), add ( η, η ∗ ), results in a partial derivation, η , with k T − η ∗ .del T + η ∗ .del T = k T deletions from the tree and k S − η ∗ .del S + η ∗ .del S = k S deletions from thestring, that takes into account the children of x that are in C \ { η ∗ .v } ∪ { η ∗ .v } = C . It is alegal partial derivation since η ∗ derives the node η ∗ .v that is not in C \ { η ∗ .v } to a stringthat does not intersect with the string derived by η . The string that is derived by η is . R. Zimerman, D. Svetlitsky, M. Zehavi and M. Ziv-Ukelson 37 S [ η.s : η.e ] and it does not intersect with the string derived by η ∗ ( S [ η ∗ .s : η ∗ .e ]). That isbecause η.e + 1 = η ∗ .s , as can be seen similarly to Equation (12). So, η ∈ D ( x ( C ) , k T , k S ),and thus η .score ≤ OP T ( C, k T , k S ). The partial derivation η was constructed such that µ ∗ .score = η .score , so µ ∗ .score ≤ OP T ( C, k T , k S ).From the induction assumption, P [ C, k T , k S −
1] =
OP T ( C, k T , k S −
1) and for every µ ∈ D ≤ ( C, k T , k S ), P [ C \{ µ.v } , k T − µ.del T , k S − µ.del S ] = OP T ( C \{ µ.v } , k T − µ.del T , k S − µ.del S ). Thus from Equation (10), it follows that P [ C, k T , k S ] = OP T ( C, k T , k S ). Thiscompletes the proof. (cid:74) H.3 Q-Node Mapping
In this section we prove the correctness of the Q-mapping algorithm presented in Appendix F.We prove the algorithm for the case where the children of x can only be arranged in aleft-to-right order. The proof of the right-to-left order is similar. (cid:73) Lemma 19.
At the end of the algorithm every entry of the DP-table Q , Q [ i, k T , k S ] , holdsthe best score of a derivation of x ( i ) and a prefix of S with k T deletions from the tree and k S deletions from the string, i.e. Q [ i, k T , k S ] = max µ ∈D ( x ( i ) ,k T ,k S ) µ.score . Proof.
We prove Lemma 19 by induction on the entries of Q in the order described inthe algorithm. Namely, for two entries Q [ i , k T , k S ] and Q [ i , k T , k S ], Q [ i , k T , k S ] < Q [ i , k T , k S ] if and only if i < i , or i = i and k S < k S , or i = i and k S = k S and k T < k T . Base Case.
The base case is the initialization of the DP table entries Q [0 , , k S ] for0 ≤ k S ≤ len ( S ), with a value of 0. Each of these entries holds the score of some derivation µ of x (0) , i.e. µ is a partial derivation that ignores all nodes in T ( x ). In addition µ derivesthe substring S [1 : L ( x (0) , , k S )] = S [1 : k S ]. Hence, all the characters in S [1 : k S ] mustbe deleted under µ . Each deletion does not add to the score of the derivation and there areno mappings under µ either, so the score of such a derivation is 0. Induction Assumption.
Assume that every table entry Q [ i , k T , k S ] such that Q [ i , k T , k S ] < Q [ i, k T , k S ] holds the best score of a derivation from D ( x ( i ) , k T , k S ). Namely, Q [ i , k T , k S ] =max µ ∈D ( x ( i ) ,k T ,k S ) µ.score = OP T ( i , k T , k S ). Induction Step.
Towards the proof of the step, we prove the following Equation (15):
OP T ( i, k T , k S ) = max( OP T ( i, k T , k S − ,OP T ( i − , k T − span ( x i ) , k S ) , max µ ∈D ≤ ( x [ i ] ,k T ,k S )s . t . µ.v = x i OP T ( i − , k T − µ.del T , k S − µ.del S ) + µ.score )(15) ≤ : Let µ ∗ ∈ D ( x ( i ) , k T , k S ) be a derivation such that µ ∗ .score = OP T ( i, k T , k S ). Bydefinition, µ ∗ is a derivation of x ( i ) to the string S [1 : E I ( i, k T , k S )]. Under a derivationevery child of the root of the derivation is either deleted or kept and every character ofthe derived string is either deleted or mapped. Thus, x i ∈ children ( x ( i ) ) is either deleted or kept under µ ∗ , and the character S [ E I ( i, k T , k S )] is either deleted or mapped under µ ∗ . Now, let us consider every case.First, consider a case in which x i is deleted under µ ∗ . By removing the deletion of x i from µ ∗ ( removeDel ( µ ∗ , x i )) a partial derivation, µ , that ignores x i is received, therefore theroot of µ is x ( i − . By Definition 14, µ has µ ∗ .del T − span ( x i ) = k T − span ( x i ) deletionsfrom the tree and µ ∗ .del S = k S deletions from the string. Hence, µ ∈ D ( x [ i − , k T − span ( x i ) , k S ) and Equation (16) below is true (remember that a deletion of a node doesnot change the score of a derivation). µ ∗ .score = µ .score ≤ OP T ( i − , k T − span ( x i ) , k S ) ≤ max( OP T ( i, k T , k S − , OP T ( i − , k T − span ( x i ) , k S ) , max µ ∈D ≤ ( x [ i ] ,k T ,k S )s . t . µ.v = x i OP T ( i − , k T − µ.del T , k S − µ.del S ) + µ.score )(16)Second, consider a case in which S [ E I ( i, k T , k S )] is deleted under µ ∗ . By removing thedeletion of S [ E I ( i, k T , k S )] from µ ∗ , removeDel ( µ ∗ , E I ( i, k T , k S )) (see Definition 12), thepartial derivation received, µ , has k T and k S − x ( i ) . Hence, µ ∈ D ( x ( i ) , k T , k S −
1) and Equation (17) belowis true (remember that a deletion of a character does not change the score of a derivation). µ ∗ .score = µ .score ≤ OP T ( i, k T , k S − ≤ max( OP T ( i, k T , k S − , OP T ( i − , k T − span ( x i ) , k S ) , max µ ∈D ≤ ( x [ i ] ,k T ,k S )s . t . µ.v = x i OP T ( i − , k T − µ.del T , k S − µ.del S ) + µ.score )(17)Lastly, if neither is true, then S [ E I ( i, k T , k S )] is mapped under µ ∗ and x i is kept under µ ∗ . Let µ i be the derivation of x i under µ ∗ (there is one because x i is kept under µ ∗ ).Because S [ E I ( i, k T , k S )] is mapped, then it is a part of a substring of S [1 : E I ( i, k T , k S )]that is derived by some derivation, µ j , such that µ j is the derivation of the child node µ j .v ∈ children ( x ( i ) ) under µ ∗ . Since x i is the rightmost child of x ( i ) and the childrenof the Q-node x can only be arranged from left to right (in this proof), µ j .v must be x i . Otherwise, the left-to-right ordering is defied. Every child of x can only have onederivation under µ ∗ , so µ i = µ j . Note that µ i must have up to k T and k S deletionsfrom the tree and string, respectively, else µ ∗ is not a legal derivation. In addition, theend point of µ i is E I ( i, k T , k S ). Let µ ∗ i be the highest scoring derivation of x i with upto k T and k S deletions which has the endpoint E I ( i, k T , k S ), i.e µ i .score ≤ µ ∗ i .score .By definition, µ ∗ i ∈ D ≤ ( x ( i ) , k T , k S ), hence µ i .score ≤ max µ ∈D ≤ ( x ( i ) ,k T ,k S ) µ.score . Now,removing µ i from µ ∗ , remove ( µ ∗ , µ i ) (see Definition 10), results in a derivation, µ , with µ ∗ .del T − µ i .del T = k T − mu i .del T deletions from the tree and µ ∗ .del S − µ i .del S = k S − mu i .del S deletions from the string. In addition µ ignores x i , and so its root is x ( i − . Hence, similarly to µ i , µ .score ≤ max µ ∈D ( i − ,k T − mu i .del T ,k S − mu i .del S ) µ.score = OP T ( i − , k T − mu i .del T , k S − mu i .del S ). Putting the conclusions on µ i and µ together . R. Zimerman, D. Svetlitsky, M. Zehavi and M. Ziv-Ukelson 39 we receive Equation (18) below. µ ∗ .score = µ i .score + µ .score ≤ max µ ∈D ≤ ( x ( i ) ,k T ,k S ) µ.score + OP T ( i − , k T − µ i .del T , k S − µ i .del S ) ≤ max µ ∈D ≤ ( x [ i ] ,k T ,k S )s . t . µ.v = x i OP T ( i − , k T − µ.del T , k S − µ.del S ) + µ.score ) ≤ max( OP T ( i, k T , k S − , OP T ( i − , k T − span ( x i ) , k S ) , max µ ∈D ≤ ( x [ i ] ,k T ,k S )s . t . µ.v = x i OP T ( i − , k T − µ.del T , k S − µ.del S ) + µ.score ) (18)In any case Equation (19) below is true. OP T ( i, k T , k S ) = µ ∗ .score ≤ max( OP T ( i, k T , k S − , OP T ( i − , k T − span ( x i ) , k S ) , max µ ∈D ≤ ( x [ i ] ,k T ,k S )s . t . µ.v = x i OP T ( i − , k T − µ.del T , k S − µ.del S ) + µ.score ) (19) ≥ : Let µ ∗ be a derivation such that µ ∗ .score = max( OP T ( i, k T , k S − , OP T ( i − , k T − span ( x i ) , k S ) , max µ ∈D ≤ ( x [ i ] ,k T ,k S )s . t . µ.v = x i OP T ( i − , k T − µ.del T , k S − µ.del S ) + µ.score ). Hence, µ ∗ .score = OP T ( i, k T , k S −
1) or µ ∗ .score = OP T ( i − , k T − span ( x i ) , k S ) or µ ∗ .score =max µ ∈D ≤ ( x [ i ] ,k T ,k S )s . t . µ.v = x i OP T ( i − , k T − µ.del T , k S − µ.del S ) + µ.score .First, assume µ ∗ .score = OP T ( i, k T , k S − η ∈ D ( x ( i ) , k T , k S −
1) be a deriva-tion with η.score = OP T ( i, k T , k S − η derives the substring S [1 : E I ( x [ i ] , k T , k S − η the deletion of S [ E I ( x [ i ] , k T , k S )]( addDel ( η, E I ( x [ i ] , k T , k S ))) results in a derivation, η that derives x ( i ) to the string S [1 : E I ( x [ i ] , k T , k S )] with k T deletions from the tree and k S deletions from the string.The string S [1 : E I ( x [ i ] , k T , k S )] is equal to the concatenation of S [1 : E I ( x [ i ] , k T , k S − S [ E I ( x [ i ] , k T , k S )]. So, η ∈ D ( x ( i ) , k T , k S ), and thus η .score ≤ OP T ( i, k T , k S ). Wehave thus built η such that µ ∗ .score = η .score , so µ ∗ .score ≤ OP T ( k T , k S , C ).Second, assume µ ∗ .score = OP T ( i − , k T − span ( x i ) , k S ). Let η ∈ D ( x ( i − , k T − span ( x i ) , k S ) be a derivation with η.score = OP T ( i − , k T − span ( x i ) , k S ). By definition, η derives the substring S [1 : E I ( x [ i − , k T − span ( x i ) , k S )]. From Definition 14, addingthe deletion of the node x i to η ( addDel ( η, x i )) results in a derivation η that derives x ( i ) to the string S [1 : E I ( x [ i ] , k T , k S )] with k T deletions from the tree and k S deletions fromthe string. So, η ∈ D ( x ( i ) , k T , k S ), and thus η .score ≤ OP T ( i, k T , k S ). We built η suchthat µ ∗ .score = η .score , so µ ∗ .score ≤ OP T ( k T , k S , C ).Lastly, assume µ ∗ .score = max µ ∈D ≤ ( x [ i ] ,k T ,k S )s . t . µ.v = x i OP T ( i − , k T − µ.del T , k S − µ.del S ) + µ.score .Let η ∗ be a derivation of x i that is in D ≤ ( x [ i ] , k T , k S ) that by setting η ∗ to µ yields thehighest value for OP T ( i − , k T − µ.del T , k S − µ.del S ) + µ.score of all the derivations of x i in D ≤ ( x [ i ] , k T , k S ). Formally, η ∗ = arg max µ ∈D ≤ ( x [ i ] ,k T ,k S )s . t . µ.v = x i OP T ( i − , k T − µ.del T , k S − µ.del S ) + µ.score . From Definition 11, adding η ∗ to a partial derivation η ∈ D ( x [ i − , k T − η ∗ .del T , k S − η ∗ .del S ) ( add ( η, η ∗ )) results in a partial derivation, η , with k T − η ∗ .del T + η ∗ .del T = k T deletions from the tree and k S − η ∗ .del S + η ∗ .del S = k S deletions from thestring, that takes into account the children of x that are in x [ i − ∪ { x i } = x [ i ] . It is a legal partial derivation since η ∗ derives the node η ∗ .v that is not in x [ i − to a stringthat does not intersect with the string derived by η . The string that is derived by η is S [ η.s : η.e ] and it does not intersect with the string derived by η ∗ ( S [ η ∗ .s : η ∗ .e ]). Thatis because η.e + 1 = η ∗ .s , as can be seen in Equation (20) below. So, η ∈ D ( x ( i ) , k T , k S ),and thus η .score ≤ OP T ( i, k T , k S ). We built η such that µ ∗ .score = η .score , so µ ∗ .score ≤ OP T ( i, k T , k S ). η ∗ .s = E I ( x [ i ] , k T , k S ) − L ( x i , η ∗ .del T , η ∗ .del S ) + 1= X x j ∈ x [ i ] span ( x j ) + k S − k T − ( η ∗ .del S − η ∗ .del T + span ( x i )) + 1= X x j ∈ x [ i − span ( x j ) + k S − η ∗ .del S − ( k T − η ∗ .del T ) + 1= E I ( x [ i − , k T − η ∗ .del T , k S − η ∗ .del S ) + 1 = η.e + 1 (20)From the induction assumption, Q [ i, k T , k S −
1] =
OP T ( i, k T , k S − Q [ i − , k T − span ( x i ) , k S ] = OP T ( i − , k T − span ( x i ) , k S ) and for every µ ∈ D ( x ( i ) , k T , k S ) such that µ.v = x i , Q [ i − , k T − µ.del T , k S − µ.del S ] = OP T ( i − , k T − µ.del T , k S − µ.del S ). Thus fromEquation (15), it follows that Q [ i, k T , k S ] = OP T ( i, k T , k S ). This completes the proof. (cid:74) H.4 Time and Space Complexity of the PQ-Tree Search Algorithm
Here we prove Lemma 7.
Proof.
The number of leaves in the PQ-tree T is m , hence there are O ( m ) nodes in the tree,i.e the size of the first dimension of the DP table, A , is O ( m ). In the algorithm descriptiona bound for the possible start indices of substrings derived from nodes in T is given. Thenode with the largest span in T is the root which has a span of m . The root is mapped tothe longest substring when there are d S deletions from the string. Hence, the size of thesecond dimension of A is Ω( n − ( m + d S ) + 1) = Ω( n ) (given that d < m << n ). The nodeswith the smallest spans are the leaves, which have a span of 1, hence the size of the seconddimension of A is O ( n ). The third and fourth dimensions of A are of size d T + 1 and d S + 1,respectively. In total, the DP table A is of size O ( d T d S mn ).In the initialization step O ( d T d S mn ) entries of A are computed in O (1) time each. Thisholds because there are m leaves and n possible start indices for strings of length 1. The d T and d S factors come from the initialization of entries with −∞ . The P-mapping algorithm iscalled for every P-node in T and every possible start index i , i.e. the P-mapping algorithmis called O ( nm p ) times. Similarly, the Q-mapping algorithm is called O ( nm q ) times. Thus,it takes O ( n ( m p · Time(P-mapping) + m q · Time(Q-mapping)))) time to fill the DP table.In the final stage of the algorithm (line 21 in Algorithm 2) the maximum over the entriescorresponding to every combination of deletion number and start index (0 ≤ k T ≤ d T ,0 ≤ k S ≤ d S , ≤ i ≤ n − ( span ( x ) − d T ) + 1 } ) is computed. So, it takes O ( d T d S n ) timeto find the maximum score of a derivation. Tracing back through the DP table to find theactual mapping a does not increase the time complexity.In Appendix F it is shown that our Q-mapping algorithm takes O ( γd T d S ) time and O ( d T d S γ ) space. From Lemma 20 our P-mapping algorithm takes O ( γ γ d T d S ) time and O ( d T d S γ ) space. Thus, in total, our algorithm runs in O ( n ( m p · γ γ d T d S + m q · γd T d S )) = O ( nγd T d S ( m p · γ + m q )) time. Adding to the space required for the main DP table thespace required for the P-mapping algorithm (the space needed for the Q-mapping algorithmis insignificant with respect to the P-mapping algorithm) results in a total space complexityof O ( d T d S mn ) + O ( d T d S γ ) = O ( d T d S ( mn + 2 γ )). This completes the proof. (cid:74) . R. Zimerman, D. Svetlitsky, M. Zehavi and M. Ziv-Ukelson 41 H.5 Time and Space Complexity of the P-Mapping Algorithm
Here we prove Lemma 20 below. (cid:73)
Lemma 20.
The P-mapping algorithm takes O ( d T d S γ γ ) time and O ( d T d S γ ) space. Proof.
The most space consuming part of the algorithm is the 3-dimensional DP table.The first dimension, C , can be any subset of the set children ( x ), and therefore it is of size2 | children ( x ) | = 2 γ . The sizes of the second and third dimensions (i.e. k T and k S ) are boundedby d T + 1 and d S + 1, respectively. Hence, the space of the DP algorithm is O ( d T d S γ ).The algorithm has three parts: initialization, filling the DP table, and constructing thesolution. The most time consuming calculation required in the initialization is the calculationof L ( C, k T , k S ) in the first rule. It requires summing the spans of all nodes in C . Thiscalculation will also be required in the second part of the algorithm. To avoid the repetitivecalculations, it preformed once for every ( C, k T , k S ) tuple and save the results. This requires O ( d T d S | children ( x ) | ) = O ( d t d S γ ) space (for this is the number of such tuples). Each valueis calculated in O ( | children ( x ) | ) = O ( γ ) time. Hence, the calculation of all the L ( C, k T , k S )values (and thus all the E I ( C, k T , k S ) values) takes O ( d T d S γ γ ) time and O ( d T d S γ ) space.The second step is done by calculating the value of every entry in the O ( d T d S γ ) entries of P , using the recursion rule in Equation (2). The first line among the rule takes O (1) time,since it involves looking in another entry of P and basic computations. The second line of therule involves going over all derivations µ ∈ D ≤ ( C, k T , k S ). Namely, going over all derivationswith a specific end point, which derives a node in C and has no more than a specific numberof deletions from the tree and string (i.e. µ.e = E I ( C, k T , k S ), µ.v ∈ C , µ.del T ≤ k T and µ.del S ≤ k S ). The number of deletions from the tree and string are bounded by d T and d S , respectively, and the number of nodes in C is bounded by the number of children of x , γ . Hence, the time to calculate one entry of P is O ( d T d S γ ). In total, the second part ofthe algorithm takes O ( d T d S γ γ ) time. Finally, to construct the solution the algorithmgoes over every deletion combination k T , k S once, i.e. it takes O ( d T d S ) time. In total, thealgorithm takes O ( d T d S γ γ ) + O ( d T d S γ γ ) + O ( d T d S ) = O ( d T d S γ γ ). (cid:74) I Figures (a) T (b) T (c) T Figure S5
Three different PQ-trees. T can be obtained from T by reversing the children of aQ-node (the left child of the root) and by reordering the children of a P-node (the right child ofthe root), so T ≡ T . T can be obtained from T by deleting one leaf and permuting the childrenof the right child of the root, so T (cid:23) T . Now, T (cid:23) T can be inferred, because the ≡ is anequivalence relation. By the definition of frontier, F ( T ) = ABCDEF G ; F ( T ) = DCBAEGF ; F ( T ) = ABDF EG . (a) The derivation µ applied on T resulting in T : reorder the children of x , delete leaves accordingto M (delete x and x ) and preform smoothing (delete x , the parent node of x and x ). Theroot of T , x , is the node that µ derives, denoted µ.v . Also, µ is a derivation of x . The nodes x , x and x are deleted under µ . The leaves x , x , x , x , x are mapped under µ . The nodes x , x , x are kept under µ . S : σ σ σ σ σ σ σ σ σ S M : ( x , σ (3)) ( ε, σ (4)) ( x , σ (5)) ( x , σ (6)) ( x , σ (7)) ( x , σ (8)) S M : x x x x x (b) The derivation µ on S resulting in S M : apply substitutions and deletions according to M . Thesubstring S = S [3 : 8] is the string that µ derives. The character S [4] is deleted under µ . Thecharacters S [3] , S [5] , S [6] , S [7] , S [8] are mapped under µ . Figure S6
An illustration of the derivation µ from the PQ-tree T to the substring S under theone-to-one mapping M ( µ.o ) with µ.del T = del T ( M ) = 2 deletions from the tree and µ.del S = del S ( M ) = 1 deletions from the string. The start point of the derivation ( µ.s ) is 3. The end point ofthe derivation ( µ.e ) is 8. Notice that that S M = F ( T ) and T (cid:23) T which means that S M ∈ C d T ( T ) . R. Zimerman, D. Svetlitsky, M. Zehavi and M. Ziv-Ukelson 43 Figure S7
This figure in continued in the next page. (a)(b)Figure S7 (Cont.) (a)
The plasmid instances of the heavy metal efflux pump gene clusterdiscussed in Section 5.2. The COGs of the query gene cluster are: COG0642, COG0745, COG3639,COG0845, COG1538. The instances were identified using PQFinder and displayed using the graphicalinterface of the tool CSBFinder-S [36]. X indicates a gene with no COG annotation. The image wasedited to display instances of the same genome in separate lines. (b)
The functional description ofthe COGs shown in (a). . R. Zimerman, D. Svetlitsky, M. Zehavi and M. Ziv-Ukelson 45
J Tables
PQ-tree S-score Functional Category1 [[0683 [[0411 0410] [0559 4177]]] 0583] 22.5 5 (2) Amino acid transport2 (1609 [1653 1175 0395] 3839) 10.0 10 (2) Carbohydrate transport3 [[1538 [3696 0845]] [0642 0745]] 7.5 7 (1) Heavy metal efflux4 [[2115 1070] [4213 [1129 4214]]] 7.5 1 (1) Carbohydrate transport5 [1960 [[2011 1135] [2141 1464]]] 7.5 3 (1) Amino acid transport6 [[0596 0599] [[3485 3485] 0015]] 7.5 9 (1) Metabolism7 [[[1129 1172 1172] 1879] 3254] 7.5 6 (1) Carbohydrate transport8 (1609 1869 [[1129 1172] 1879] 0524) 7.5 1 (1) Carbohydrate transport9 (0683 [0559 4177] [0411 0410] 0318) 7.5 1 (1) Amino acid transport10 (3839 0673 [[0395 1175] 1653]) 5.0 10 (1) Carbohydrate transport11 [0583 (0687 3842 [1176 1177])] 5.0 9 (3) Amino acid transport12 [1012 (0687 3842 [1176 1177])] 5.0 8 (1) Amino acid transport13 (0284 0461 [0540 1781] 0543 0044 0167) 3.5 1 (1) Metabolism14 ((2080 1319 1529) 1975 2068) 3.3 6 (1) Energy production and conversion15 [0044 [[0543 0167] 0284]] 3.0 1 (1) Metabolism16 [1802 [1638 [3090 1593]]] 3.0 7 (1) Carbohydrate transport17 [0410 [[4177 0559] 0683]] 3.0 7 (3) Amino acid transport18 [[4770 0511] [1984 2049]] 3.0 4 (2) Metabolism19 [[2875 [1010 2073]] 2243] 3.0 9 (2) Metabolism20 ([1175 0395] 1409 3839 1653) 2.5 5 (2) Carbohydrate transport21 [(2141 0431 0600 0715) 1116] 2.5 2 (2) Inorganic ion transport22 ([0601 1173] 0444 0444 0747) 2.5 10 (1) Amino acid transport23 [0583 (3842 1840 1178)] 2.0 1 (1) Inorganic ion transport24 (1464 2141 [1135 2011]) 2.0 7 (3) Amino acid transport25 ([2009 2142] 0479 1053) 2.0 2 (1) Energy production and conversion26 ([1622 0843] 0109 1845) 2.0 1 (1) Energy production and conversion27 (1024 1960 4770 4799) 1.0 4 (1) Lipid transport28 (1120 0609 0614 1629) 1.0 4 (1) Inorganic ion transport29 (0411 0559 4177 0683 0410 1022) 1.0 3 (1) Amino acid transport
Table S2
PQ-trees for which tree-guided rearrangements were found in plasmids. Square bracketsrepresent a Q-node; round brackets represent a P-node. Numbers indicate the respective COGIDs.2