Chaining fragments in sequences: to sweep or not
CChaining fragments in sequences: to sweep ornot
Julien Allali , , Cedric Chauve , , and Laetitia Bourgeade LaBRI, Universit´e Bordeaux, France IBP, Universit´e Bordeaux, France Department of Mathematics, Simon Fraser University, Canada
Abstract.
Computing an optimal chain of fragments is a classical prob-lem in string algorithms, with important applications in computationalbiology. There exist two efficient dynamic programming algorithms solv-ing this problem, based on different principles. In the present note, weshow how it is possible to combine the principles of two of these algo-rithms in order to design a hybrid dynamic programming algorithm thatcombines the advantages of both algorithms.
The need for very efficient pairwise sequence alignments algorithm has moti-vated the development of methods aimed at breaking the natural quadratic timecomplexity barrier of dynamic programming alignment algorithms [5]. One ofthe successful alternative approaches is based on the technique of chaining frag-ments. Its principle is to first detect and score highly conserved factors, the frag-ments (also called anchors or seeds ), then to compute a maximal score subset offragments that are colinear and non-overlapping in both considered sequences,called an optimal chain . This optimal chain is then used as the backbone of analignment, that is completed in a final stage by aligning the gaps located betweenconsecutive selected fragments. This approach is used in several computationalbiology applications, such as whole genome comparison [13,1,7], cDNA/ESTmapping [12], or identifying regions with conserved synteny.In the present work we are interested in the problem of computing an op-timal chain of fragments , from a given set of k fragments, for two sequences t and u of respective lengths n and m . Due to its applications, especially incomputational biology, this problem has received a lot attention from the al-gorithmic community [3,4,8,9,10,11,1]. The fragment chaining problem can besolved in O ( k + n × m ) time by using a simple dynamic programming (DP)algorithm (see [9] for example). However, in practical applications, the number k of fragments can be subquadratic, which motivated the design of algorithmswhose complexity depends only of k and can run in O ( k log k ) worst-case time We focus here on the problem of computing the score of an optimal chain, but ouralgorithm can be complemented by a standard backtracking procedure to computean actual optimal chain. a r X i v : . [ c s . D S ] J un see [8,4,10,12]). The later algorithms, known as Line Sweep (LS) algorithms,rely on geometric properties of the problem, where fragments can be seen asrectangles in the quarter plane, and geometric data structures that allow to re-trieve and update efficiently ( i.e. in logarithmic time) optimal subchains(see [12]for example).This raises the natural question of deciding which algorithm to use to whencomparing two sequences t and u . In particular, it can happen that the density of fragments differs depending on the location of the fragments in the consideredsequences, due for example for the presence in repeats. In such cases, it mightthen be more efficient to rely on the DP algorithm in regions with high fragmentdensity, while in regions of lower fragment density, the LS algorithm would bemore efficient. This motivates the theoretical question we consider, that asks todesign an efficient algorithm that relies on the classical DP principle when thedensity of fragments is high and switches to the LS principle when processingparts of the sequences with a low density of fragments. We show that this canbe achieved, and we describe such a hybrid DP/LS algorithm for computing thescore of an optimal chain of fragments between two sequences. We prove thatour algorithm achieve a theoretical complexity that is as good as both the DPand LS algorithm, i.e. that for any instance, our algorithm performs as at leastas well, in terms of theoretical worst-case asymptotic time complexity, as boththe DP and the LS algorithm.In Section 2, we introduce formally the fragment chaining problem and theDP and LS algorithms. In Section 3, we describe our hybrid algorithm andanalyze its complexity.
Preliminary definitions and problem statement.
Let t and u be two sequences, ofrespective lengths n and m . We assume that positions index in sequences startat 0, so t [0] is the first symbol in t and t [ n −
1] its last symbol. As usual, by t [ i, j ]we denote the substring of t composed of symbols in positions i, i + 1 , . . . , j .A fragment is a factor that is common, possibly up to small variations, to t and u . Formally, a fragment s is defined by 5 elements ( s.(cid:96), s.r, s.t, s.b, s.s ):the first four fields indicate that the corresponding substrings are t [ s.(cid:96), s.r ] and u [ s.b, s.t ], while the field s.s is a score associated to the fragment. We call borders of s the coordinates ( s.(cid:96), s.b ) and ( s.r, s.t ). As usual in chaining problems, wesee fragments as rectangles in the quarter plane, where the x -axis correspondsto t and the y -axis to u . For a fragment s , s.(cid:96) , s.r , s.b and s.t denote the lef t and right position of s over t and the bottom and top position of s over u ( s.(cid:96) ≤ s.r and s.b ≤ s.t ). See figure 1 for an example.Let S denote a set of k fragments for t and u . A chain is a set of fragments { s , . . . , s (cid:96) } such that s i .r < s i +1 .(cid:96) and s i .t < s i +1 .b for i = 1 , . . . , (cid:96) −
1; thescore of a chain is the sum (cid:80) (cid:96)i =1 s i .s of the fragments it contains. A chain isoptimal if there is no chain with a higher score. The problem we consider in thepresent work is to compute the score of an optimal chain. (6)s.l s.rs.bs.t tu s' (4) s'' (1) Fig. 1.
Example of the fragment chaining problem with three fragments representedby squares. Possible chains are [( s ) , ( s (cid:48) ) , ( s (cid:48)(cid:48) ) , ( s, s (cid:48)(cid:48) ) , ( s (cid:48) , s (cid:48)(cid:48) )]. The best chain is ( s, s (cid:48)(cid:48) ),with a score of 7. The dynamic programming (DP) algorithm.
We first present a simple dynamicprogramming (DP) algorithm that computes an n × m dynamic programmingtable M such that M [ i ][ j ] is the score of an optimal chain for the prefixes t [0 , i ]and u [0 , j ] (See pseudo-code 1). We present here a version that does not instan-tiate the full n × m DP table, but records only the last filled column, followingthe classical technique introduced in [6] and used in the space-efficient fragmentchaining DP algorithm described in [9].The difference with Morgenstern’s space efficient DP algorithm [9] is that westill require a quadratic space for the data structure L . In terms of computingthe score of an optimal chain, the key point is that S [ s ], if defined, contains theoptimal score of a chain that contains s as last fragment. The worst-case timecomplexity of this algorithm is obviously O ( k + n × m ). The Line Sweep (LS) algorithm.
We now describe a Line Sweep algorithm forthe fragment chaining problem (See pseudo-code 2). The main idea is to processfragments according to their order in the sequence t , while maintaining a datastructure that records, for each position i in u , the best partial chain found sofar using only fragments below position i .In this algorithm P stores all fragments borders, S [ s ], as in the DP algorithm,is the score of an optimal chain among all the chains that end with fragment s . A fragment s is said to have been processed after the entry ( s.r, end, s.s ) hasbeen processed through the loop in line 8. A partial chain is a chain composedonly of processed fragments.The data structure A satisfies the following invariant, that is key to ensurethe correctness of the algorithm: if ( pos, type, s ) is the last entry of P that hasbeen processed, then A contains an entry ( p, v ) if and only if the best chainingscore, among partial chains that belong to the rectangle defined by points (0 , pos, p ), is v and corresponds to a chain ending with a fragment s (cid:48) such that s (cid:48) .t = p .Line 16 ensures this invariant is maintained. This invariant allows to retrievefrom A the score of an optimal partial chain that can be extended by the current lgorithm 1 The Dynamic Programming algorithm L : an array of n × m linked lists2 S : an array of k integers3 M : an array of m integers4 foreach s in S do s, end ) into L [ s.r ][ s.t ]6 front insert ( s, begin ) into L [ s.(cid:96) ][ s.b ]7 for i from 0 to n left = 09 leftDown = 010 for j from 0 to m maxC = 012 foreach ( s, type ) in L [ i ][ j ]13 if type is begin S [ s ] = s.s + leftDown if type is end and S [ s ] > maxC maxC = S [ s ]17 leftDown = left left = M [ j ]19 M [ j ] = max ( M [ j ] , M [ j − , maxC )20 return M [ m − Algorithm 2
The Line Sweep algorithm P : an array of 2 k triples ( position, type, fragment )2 A : a set of pairs ( position, score )3 S : an array of k integers4 foreach s in S do s.(cid:96), begin, s ) into P s.r, end, s ) into P P according to the field position, with begin positions appearing before end positions having the same value8 foreach ( pos, type, s ) in P if type is begin
10 retrieve from A the pair ( p, v ) such that p is the highest position strictlyless than s.b S [ s ] = s.s + v if type is end
13 ( p, v ) = retrieve from A the highest position less or equal to s.t if S [ s ] > v
15 retrieve from A the pair ( p (cid:48) , v (cid:48) ) such that v (cid:48) is the highest score lessthan or equal to S [ s ]16 remove from A all entries ( p (cid:48)(cid:48) , v (cid:48)(cid:48) ) such that p < p (cid:48)(cid:48) ≤ p (cid:48)
17 insert ( s.t, S [ s ]) into A
18 ( p, v ) = last entry of A return v ragment s , i.e. that ends up in u in a position strictly smaller than s.b (line11). This property follows from the fact that the order in which fragments areprocessed ensures that all previously processed fragments do not overlap withthe current fragment in t .In order to implement this algorithm efficiently, it is fundamental to ensurethat in line 16, the time required to remove c entries (the set of all entries of A with first field strictly greater than p and lower than or equal to p (cid:48) ) is O ( c log( k )).If A is implemented in a data structure that satisfies this property and supportsearches, insertions and deletions in logarithmic time, then the time complexityof the algorithm is O ( k log( k )); see [12] for a discussion on such data structures. We now describe an algorithm that combines both approaches described in theprevious section.
Overview.
We first introduce the notion of compact instance . An instance of thechaining problem is said to be compact, if each position of t and each positionof u contains at least one border. If an instance is not compact, then thereexists a unique compact instance obtained by removing from t and from u allpositions that do not contain a fragment border, leading to sequences t (cid:48) and u (cid:48) , and updating the fragments borders according to the sequences t (cid:48) and u (cid:48) ,leading to a set S (cid:48) of fragments. From now, we denote by ( t (cid:48) , u (cid:48) , S (cid:48) ) the compactinstance corresponding to ( t, u, S ), and m (cid:48) and n (cid:48) the lengths of t (cid:48) and u (cid:48) .Next, we define, for a position p of t its border density K p as the number offragment borders ( i.e. number of fragments extremities) located in t [ p ]. If P isthe set of positions in t (cid:48) with border density strictly greater than m (cid:48) log m (cid:48) − , and P the remaining n (cid:48) − | P | positions of t (cid:48) , then the hybrid DP/LS algorithm wedescribe below has time complexity O k + min( k log( k ) , m ) + min( k log( k ) , n ) + (cid:88) p ∈ P ( m (cid:48) + K p ) + log( m (cid:48) ) (cid:88) p ∈ P K p . Intuitively, our hybrid algorithm works on a compact instance, and fills in theDP table for this compact instance, deciding for each column of this table ( i.e. position of t (cid:48) ) to fill it in using the DP equations or the Line Sweep principle,based on its border density. Compacting an instance.
We first describe how to compute the compact instance( t (cid:48) , u (cid:48) , S (cid:48) ). Lemma 1.
The compact instance ( t (cid:48) , u (cid:48) , S (cid:48) ) can be computed in time O ( k + min( k log( k ) , m ) + min( k log( k ) , n )) and space O ( k + n + m ) . he proof of this lemma is quite straightforward, and we omit the detailshere for space reason. Assume we are dealing with t (the same method appliesto u ). – If k log( k ) ≤ m , then we (1) sort the fragments extremities in t in increasingorder of their starting position, (2) cluster together fragment extremities withthe same value, and (3) relabel the coordinates of each fragment extremityusing the number of clusters preceding it in the order, plus one. – If k log k > m , then we (1) detect positions of t with no fragment extremities,in O ( k + m ) time, (2) mark them and relabel the positions with non-zero den-sity in O ( m ) time, and finally (3) relabel the fragment extremities accordingto the new labels of their positions, in O ( k ) time.From now, we assume that the compact instance has been computed andthat it is the considered instance. DP update vs LS update.
In this section, we introduce our main idea. The prin-ciple is to consider fragments in the same order than in the LS algorithm – i.e. through a loop through indices 0 to n (cid:48) −
1, a feature which is common to boththe DP and LS algorithms –, but to process the fragments whose border in t (cid:48) isin position i using either the DP approach if the density of fragments at t (cid:48) [ i ] ishigh, or the LS approach otherwise. Hence, the key requirement will be that, – when using the DP approach, the previous column of the DP table is avail-able, – when using the LS approach, a data structure with similar properties thandata structure A used in the LS algorithm is available. A hybrid data structure.
We introduce now a data structure B that ensures thatthe above requirements are satisfied. The data structure B is essentially an arrayof m (cid:48) entries augmented with a balanced binary search tree. Formally: – We consider an array B of m (cid:48) entries, such that B [ i ] contains chaining scores,and satisfies the following invariant: if s is the last processed fragment, forevery i = 1 , . . . , s.r , B [ i ] ≥ B [ i − – We augment this array with a balanced binary search tree C whose leavesare the entries of B and whose internal nodes are labeled in order to satisfythe following invariant: a node x is labeled by the maximum of the labels ofits right child and left child.The data structure B will be used in a similar way than the data structure A of the LS algorithm, i.e. to answer the following queries: given 0 ≤ p ≤ m (cid:48) ,find the optimal score of a partial chain whose last fragment s satisfies s.t ≤ p .This principle is very similar to solutions recently proposed for handling dynamicminimum range query requests [2].We describe now how we implement this data structure using an array. Let b be the smallest integer such that m (cid:48) ≤ b . We encode B into an array of size2 b +1 , whose prefix of length m (cid:48) − C (so each cell contains a label and the indexes to two other cells,corresponding respectively to the left child and right child), ordered in breadth-first order, while the entries of B are stored in the suffix of length m (cid:48) of the array(see figure 2). From now, we identify nodes of the binary tree and cells of thearray, that we denote by B . ................. 0 1 2 3 4 m0 1 2 3 4 5 6 x x x x x x Fig. 2.
Example of the implementation of the data structure B with an array. Using this implementation, for a given node of the binary search tree, sayencoded by the cell in position x in B (called node x from now), we can quicklyobtain the position, in the array, of its left child, of its right child, but also of itsparent (if B [ x ] is not the root) and of its rightmost descendant, defined as theunique node reached by a maximal path of edges to right children, starting at x edges to a left (resp. right) child. Indeed, it is straightforward to verify that, theconstraint of ordering the nodes of the binary tree in the array according to abreadth-first order implies that, for node x , if y is the largest integer such that2 y ≤ x + 1 and z = x − y + 1, then: – if x ≥ b − x is a leaf; – lef tChild ( x ) = 2 y +1 − ∗ z if x is not a leaf; – rightChild ( x ) = 2 y +1 − ∗ z + 1 if x is not a leaf; – parent ( x ) = − x = 0 ( x is the root); – parent ( x ) = 2 y − − | z | if x (cid:54) = 0; – rightmostChild ( x ) = 2 b − z + 1)2 b − z − Implementing the DP and LS algorithms with the hybrid data structure.
It isthen easy to implement the DP algorithm using the data structure B , by using B as the current column of the DP table ( i.e. if the currently processed positionof t (cid:48) is i , B [ j ] is the score of the best partial chain included in the rectangledefined by (0 ,
0) and ( i, j )), without updating the internal nodes of the binarysearch tree C .To implement the LS algorithm, the key points are – to be able to update efficiently the data structure B , when a fragment s hasbeen processed; – to be able to find the best score of a partial chain ending up at a position in u (cid:48) strictly below p . lgorithm 3 Set a chaining score for a position p . setScore ( B, p, score ) :2 index = 2 b − p // start from leaf corresponding to p while index ! = − B [ index ] < score B [ index ] = score index = parent ( index ) Algorithm 4
Retrieve the best chaining score for partial chains ending strictlybelow position p . getBestScore ( B, p ) :2 let b be the smallest integer s.t. m (cid:48) ≤ b maxScore = 04 currentNode = 0 // the root node indexOfP = 2 b − p while rightmostChild ( currentNode ) > indexOfP left = leftChild ( currentNode )8 if rightmostChild ( left ) > = indexOfP // move left ncurrentNode = left else // move right maxScore = max ( maxScore, B [ left ])12 currentNode = rightChild ( currentNode )13 return max( maxScore, B [ currentNode ]) pdating B can be done through the function setScore below, with param-eters p = s.t and score = S [ s ], while the second task can be achieved by thefunction getBestScore described below, which is a simple binary tree search.It is straightforward to see that if all updates of B are done using the function setScore , then the two required invariants on B are satisfied. The time complex-ity of both setScore and getBestScore is in O (log( m (cid:48) )), due to the fact that thebinary tree is balanced. So now, we can implement the LS algorithm on compactinstances using the data structure B by replacing the instruction in line 11 ofthe LS algorithm by a call to getBestScore ( B, s.b ), the block of instructions inlines 13-17 by setScore ( B, S [ s ]) and reading the optimal chain score in the rootof the binary tree. The complexity of operations over B are logarithmic in m (cid:48) that is less or equal to k . Thus the overall time complexity is in O ( k log m (cid:48) ). LS/DP update with the hybrid data structure.
So, in an hybrid algorithm thatrelies on the data structure B , when the algorithm switches approaches (fromDP to LS, or LS to DP), the data structure B is assumed to be consistent forthe current approach, and needs to be updated to become consistent for the nextapproach.So when switching from DP (say position i − i = 1 , . . . n (cid:48) ) to LS (position i ), we assume that B [ j ] ( j = 0 , . . . , m (cid:48) −
1) is the optimal score of a partial chainin the rectangle defined by (0 ,
0) and ( i − , j ), and we want to update B in sucha way that the label of any internal node x of the binary tree is the maximumof both its children. As B are the leaves of the binary tree, this update can bedone during a post-order traversal of the binary tree, so in time O ( m (cid:48) ).When switching from LS to DP (say to use the DP approach on position i while the LS approach was used on position i − B [ j ] of the binary tree corresponding to a position at most i −
1, the value in B [ j ] is the optimal score of a partial chain in the whose last fragment ends inposition i −
1; this follows immediately from the way labels of the leaves of thebinary tree are inserted by the setScore function. To update B , we want that infact B [ j ] is the optimal score of a partial chain in the whose last fragment endsin position at most i −
1. So the update function needs only to give to B [ j ] thevalue max ≤ j (cid:48) ≤ j B [ j (cid:48) ], which can again be done in time O ( m (cid:48) ).So updating the data structure B from DP to LS or LS to DP can be donein time O ( m (cid:48) ). We denote by update the function performing this update. Deciding between LS and DP using the fragment density.
Before we can finallyintroduce our algorithm, we need to address the key point of how to decide whichparadigm (DP or LS) to use when processing the fragments having a border inthe current position of t , say c . Let K c be the number of fragments s such that s.(cid:96) = c or s.r = c . Using the DP approach, the cost of updating B ( i.e. tocompute the column c of the DP table) is O ( m (cid:48) + K c ). With the LS approach,the cost of updating B is in O ( K c log m (cid:48) ).So, if K c > m (cid:48) log m (cid:48) − , the asymptotic cost of the DP approach is better thanthe asymptotic cost of the LS approach, while it is the converse if K c ≤ m (cid:48) log m (cid:48) − .o, prior to processing fragments, for each position i in t ( i = 0 , . . . , m (cid:48) − C is fragments borders in position i are processed using theDP approach ( C [ i ] contains DP) or the LS approach ( C [ i ] contains LS). Thislast observation leads to our main result, Algorithm 5 below. Algorithm 5
A hybrid algorithm for the fragment chaining problem. t (cid:48) , u (cid:48) , S (cid:48) )2 L
1: an array of n (cid:48) × C : an binary array of size n (cid:48) foreach s in S (cid:48) do if C [ s.r ] is DP then front insert ( s, end, s.t ) into L s.r ][1]6 else front insert ( s, end ) into L s.r ][0]7 if C [ s.(cid:96) ] is DP then front insert ( s, begin, s.b ) into L s.(cid:96) ][1]8 else front insert ( s, begin ) into L s.(cid:96) ][0]9 B : a binary tree for m (cid:48) leafs (all nodes are set to zero)10 B : refers to the m (cid:48) leaves of B S : an array of integer of size k for i from 0 to n (cid:48) do if C [ i ] (cid:54) = C [ i − then update ( B )14 if C [ i ] is DP15 L
2: an array of m (cid:48) linked lists16 for each ( s, t, j ) in L i ][1] do front insert ( s, t ) into L j ]17 left = 0, leftDown = 018 for j from 0 to m (cid:48) do maxC = 020 foreach ( s, type ) in L j ] do if type is begin then S [ s ] = s.s + leftDown if type is end and S [ s ] > maxC then maxC = S [ s ]23 leftDown = left , left = B [ j ]24 B [ j ] = max ( B [ j ] , B [ j − , maxC )25 else // C [ i ] is LS foreach ( s, type ) in L i ][0] do if type is begin then S [ s ] = s.s + getBestScore ( B, s.b )28 if type is end then setScore ( B, s.t, S [ s ])29 if C [ n (cid:48) −
1] is direct then return B [ m (cid:48) − else return value of the root of B Time and space complexity.
In terms of space complexity, the algorithm, weavoid to use O ( k + n (cid:48) × m (cid:48) ) space for storing the fragments borders in n (cid:48) × m (cid:48) lists(structure L of the DP algorithm) by using two lists: L i ] stores all fragmentsborders in position i of t (cid:48) , while L j ] stores all fragments borders in position i of t (cid:48) and j of u (cid:48) , and is computed from L [1]. So the total space requirement isin O ( k + m (cid:48) + n (cid:48) ).e now establish the time complexity of this algorithm. If the current po-sition i of t is tagged as DP, the cost for updating the column is O ( m (cid:48) + K i ),including the cost of setting up L L
1, that is proportional to the numberof fragments borders in the current position (line 14–24). If C [ i ] is LS, the costfor computing chains scores on this position is O ( K i log m (cid:48) ) (line 25– 28). Thus,if we call P the set of positions on t where we use the DP approach, P the setof positions on t where we use the LS approach and P = P ∪ P , the time forthe whole loop at line 12 is O (cid:88) p ∈ P ( m (cid:48) + K p ) + (cid:88) p ∈ P K p log m (cid:48) We have | P | + | P | = n (cid:48) , ∀ p ∈ P : K p > m (cid:48) log m (cid:48) − and ∀ p ∈ P : K p ≤ m (cid:48) log m (cid:48) − .Moreover, updating the data structure B from LS to DP or DP to LS (line 13)is done at most one more time then the size of P , so the total cost of thisoperation is O (cid:16)(cid:80) p ∈ P m (cid:48) (cid:17) , and can thus be integrated, asymptotically, to thecost of processing the positions in P . Theorem 1.
The hybrid algorithm computes an optimal chain score in time O k + min( k log k, m ) + min( k log k, n ) + (cid:88) p ∈ P ( m (cid:48) + K p ) + log m (cid:48) (cid:88) p ∈ P K p (1) and space O ( k + n + m ) . To conclude the complexity analysis, we show that the hybrid algorithmperforms at least as well, asymptotically, than both the DP and the LS algo-rithms. From (1), we deduce that, if P = P , the hybrid algorithm time com-plexity becomes O ( k + min( k log k, m ) + min( k log k, n ) + log m (cid:48) k ), which is atworst equal to the asymptotic worst-case time complexity of the LS algorithmas m (cid:48) = min( m, k ).Now, if P (cid:54) = P , for every position c in P , we know that the cost of updating B and processing c with the DP approach is not worse than processing it withthe LS approach, by the value chosen for K c . This ensures that, asymptotically,the hybrid algorithm does perform at least as well as the LS algorithm.We consider now the DP algorithm. Again, from (1), if P = P , the com-plexity becomes O ( k +min( k log k, m )+min( k log k, n )+ m (cid:48) n (cid:48) ), which is equal tothe original dynamic programming algorithm time complexity as n (cid:48) = min( n, k )and m (cid:48) = min( m, k ).As above, if we assume now that P (cid:54) = P , then we know that the cost ofprocessing the positions of P with the LS approach is asymptotically not worsethan processing them with the DP algorithm. The cost of updating B fromswitching from DP to LS can be integrated into the asymptotic cost of the DPpart. This shows that the hybrid algorithm is, asymptotically, not worse thanthe pure DP algorithm. Discussion
Our main result in the present paper is an hybrid algorithm that combines thepositive features of both the classical dynamic programming and of the linesweep algorithm for the fragment chaining problem. We did show that a simpledata structure can be used to alternate between both algorithmic principles,thus benefiting of the positive behavior of both algorithms. Not surprisingly, thechoice between using the DP or the LS principle is based on fragments density.It is easy to define instances where the hybrid algorithm performs better ,asymptotically, than both the DP and LS algorithms. For example, if m = n and k = 2 n and there are n seeds extremities on t [0] and n extremities on t [ n − t and u , we can show that the com-plexities are O ( n ) for the DP algorithm, O ( n log n ) for the LS algorithm and O ( n ) for the hybrid algorithm. However, so far our result is mostly theoretical.The threshold of m (cid:48) / (log( m (cid:48) ) −
1) considered on real genome data is high, as itassumes a very high vfragment density that is unlikely to be observed often, atleast on applications such the alignment of whole bacterial genomes for exam-ple. Preliminary experiments on such dfatya following t5he approach developpedin [13] show that the LS algorithm is slightly more efficient than the hybrid one.So it remains to be seen if it could result in an effective speed-up when chainingfragments in actual biological applications, especially involving high-throughputsequencing data or overlapping fragments [13]. From a practical point of view,it is also of interest to consider algorithm engineering apsects, especially relatedto the hybrid data structure, to see if this could alleviate the issue of the highdensity threshold required to switch between the LS and DP approaches, andassess the practical interest of the novel theoretical framework we introduced inthe present paper.
References
1. M.I. Abouelhoda and E. Ohlebusch. Chaining algorithms for multiple genomecomparison.
J. Discrete Algorithms
3, pp. 321–341, 2005.2. L. Arge, J. Fischer, P. Sanders, and N. Sitchinava. On (Dynamic) Range MinimumQueries in External Memory. In
WADS 2013 , vol. 8037 of
LNCS , pp. 37–48. 2013.3. D. Eppstein, Z. Galil, R. Giancarlo, and G. F. Italiano. Sparse dynamic program-ming. I: linear cost functions; II: convex and concave cost functions,
J. Assoc.Comput. Mach.
39, pp. 519–567, 1992.4. S. Felsner, R. M¨uller and L. Wernisch. Trapezoid graphs and generalizations,geometry and algorithms.
Discrete Appl. Math.
74, pp. 13–32, 1997.5. D. Gusfield.
Algorithms on Strings, Trees and Sequences . Cambridge UniversityPress, 1997.6. D.S. Hirschberg. A linear space algorithm for computing maximal common subse-quences.
Comm. Assoc. Comput. Mach.
18, pp. 341–343, 1975.7. M. H¨ohl, S. Kurtz and E. Ohlebusch. Efficient multiple genome alignment.
Bioin-formatics
18, pp. S312–S320, 2002.. D. Joseph, J. Meidanis, and P. Tiwari. Determining DNA sequence similarity usingmaximum independent set algorithms for interval graphs. In
SWAT 1992 , vol. 621of
LNCS , pp. 326–337. 1992.9. B. Morgenstern. A simple and space-efficient fragment-chaining algorithm for align-ment of DNA and protein sequences.
Appl. Math. Lett.
15, pp. 11–16, 2002,10. G. Myers and W. Miller. Chaining multiple-alignmment fragments in sub-quadratictime. In
SODA 1995 , pp. 38–47, 1995.11. G. Myers, and X. Huang. An O ( N log N ) restriction map comparison and searchalgorithm. Bull. Math. Biol. , 54, pp. 599–618, 1992.12. E. Ohlebusch and M.I. Abouelhoda. Chaining Algorithms and Applications inComparative Genomics. In (S. Aluru, ed.)
Handbook of Computational MolecularBiology . CRC Press, 2005.13. R. Uricaru, A. Mancheron and E. Rivals. Novel Definition and Algorithm forChaining Fragments with Proportional Overlaps.