[PDF] Computing alignment plots efficiently

Abstract

Dot plots are a standard method for local comparison of biological sequences. In a dot plot, a substring to substring distance is computed for all pairs of fixed-size windows in the input strings. Commonly, the Hamming distance is used since it can be computed in linear time. However, the Hamming distance is a rather crude measure of string similarity, and using an alignment-based edit distance can greatly improve the sensitivity of the dot plot method. In this paper, we show how to compute alignment plots of the latter type efficiently. Given two strings of length m and n and a window size w, this problem consists in computing the edit distance between all pairs of substrings of length w, one from each input string. The problem can be solved by repeated application of the standard dynamic programming algorithm in time O(mnw^2). This paper gives an improved data-parallel algorithm, running in time O(mnw/γ/p) using vector operations that work on γ values in parallel and p processors. We show experimental results from an implementation of this algorithm, which uses Intel's MMX/SSE instructions for vector parallelism and MPI for coarse-grained parallelism.

Full PDF

CComputing alignment plots eﬃciently

Peter Krusche and Alexander Tiskin*,

Dept. of Computer Science, University of Warwick, Coventry, CV4 7AL, UK

Abstract.

Dot plots are a standard method for local comparison of bi-ological sequences. In a dot plot, a substring to substring distance iscomputed for all pairs of ﬁxed-size windows in the input strings. Com-monly, the Hamming distance is used since it can be computed in lin-ear time. However, the Hamming distance is a rather crude measure ofstring similarity, and using an alignment-based edit distance can greatlyimprove the sensitivity of the dot plot method. In this paper, we showhow to compute alignment plots of the latter type eﬃciently. Given twostrings of length m and n and a window size w , this problem consistsin computing the edit distance between all pairs of substrings of length w , one from each input string. The problem can be solved by repeatedapplication of the standard dynamic programming algorithm in time O ( mnw ). This paper gives an improved data-parallel algorithm, run-ning in time O ( mnw/γ/p ) using vector operations that work on γ valuesin parallel and p processors.

1. Introduction

Dot plots are a standard method for local comparison of two biological sequencesintroduced by Gibbs/McIntyre [6] and Maizel/Lenk [10]. When creating a dotplot, a substring to substring distance is computed for all pairs of ﬁxed-size win-dows in the input strings. The result can be visualized by a plot showing a dot foreach pair of windows that achieves a distance below a ﬁxed threshold. Commonly,the Hamming distance is used [10,9], since it can be computed very eﬃciently.However, the Hamming distance is a rather crude measure of string similarity.Using a string edit distance or alignment score (see e.g. [7]) for dot plot ﬁlteringcan greatly improve the sensitivity of the method. In the context of biologicalsequence comparison, this idea has been implemented by Ott et al. [12], where asequential algorithm is given which creates an alignment plot for two strings oflengths m and n using a ﬁxed window length w in time O ( mnw ).In this paper, we give an improved data-parallel algorithm, running in time O ( mnw/γ ) using vector operations that work on γ values in parallel, and showexperimental speedups from an implementation using MMX [8]. Furthermore, wedemonstrate that the algorithm can be parallelized to run on multiple processorsusing MPI [14]. * Research supported by the Centre for Discrete Mathematics and Its Applications (DIMAP),University of Warwick, EPSRC award EP/D063191/1. Computational resources were providedby the Centre for Scientiﬁc Computing at the University of Warwick. a r X i v : . [ c s . D S ] S e p . Computing longest common subsequences and string alignments Let x = x x . . . x m and y = y y . . . y n be two strings over an alphabet Σ ofsize σ . We distinguish between contiguous substrings of a string x , which can beobtained by removing zero or more characters from the beginning and/or the endof x , and subsequences , which can be obtained by deleting zero or more charactersin any position. The longest common subsequence (LCS) of two strings is thelongest string that is a subsequence of both input strings; its length (the LLCS)is a measure for the similarity of the two strings. Substrings of length w are called w -windows . For a given w , the length of the LCS of two w -windows x i . . . x i + w − and y j . . . y j + w − will be denoted as WLCS ( i, j ). An alignment plot for x and y consists of all values WLCS ( i, j ) with i ∈ { , , . . . , m − w } , j ∈ { , , . . . , n − w } .Although the LCS is more accurate than the Hamming score, more generalsimilarity measures are of interest in practice. A standard interpretation of LCSis string alignment [7, p. 209 ﬀ.]. An alignment of strings x and y is obtained byputting a subsequence of x into one-to-one correspondence with a (not necessarilyidentical) subsequence of y , character by character and respecting the index order.The corresponding pairs of characters, one from x and the other from y , are saidto be aligned . A character not aligned with a character of another string is said tobe aligned with a gap in that string. Finding the LCS corresponds to computing amaximum alignment when assigning the scores w = = 1 to aligning a matching pairof characters, w = 0 to inserting a gap, and w (cid:54) = = 0 to aligning two mismatch-ing characters. More general alignments than the LCS can be obtained using thestandard dynamic programming algorithm [17,11], which allows for gap penaltiesas well as diﬀerent scores for each individual pair of matching/mismatching char-acters, forming a pairwise score matrix . Any algorithm for LCS computation canbe generalized to pairwise score matrices with small rational scores at the price ofa constant factor blow-up of the input strings [16]. In this paper, we will consideralignments with match score w = = 1, mismatch score w (cid:54) = = 0 and gap penalty w = − .

5. To compute these alignments, we modify the input strings by addinga new character $ to the alphabet, which we insert before every character in bothinput strings such that e.g. abab transforms into $a$b$a$b . For input strings x and y of length m and n , the alignment score S ( x, y ) can be retrieved from LLCSof the modiﬁed strings x (cid:48) and y (cid:48) as S ( x, y ) = LLCS ( x (cid:48) , y (cid:48) ) − . · ( m + n ). Weexpect the running time of the seaweed algorithm to increase by a factor of fourby this reduction, as both input strings double in size.Our new algorithms are based on semi-local sequence alignment [15], for whichwe now give the necessary deﬁnitions. Throughout this paper, we will denotethe set of integers { i, i + 1 , . . . , j } by [ i : j ], and the set of odd half-integers { i + , i + , . . . , j − } by (cid:104) i : j (cid:105) . We will further mark odd half-integer variablesby a “ˆ” symbol. When indexing a matrix M by odd-half integer values ˆ ı and ˆ  ,we deﬁne that M (ˆ ı, ˆ  ) = M ( i, j ) with i = ˆ ı − / j = ˆ  − /

2. Therefore,if a matrix has integer indices [1 : m ] × [1 : n ], it has odd-half integer indices (cid:104) m + 1 (cid:105) × (cid:104) n + 1 (cid:105) . We also deﬁne the distribution matrix D Σ of an m × n matrix D as D Σ ( i, j ) = (cid:80) D (ˆ ı, ˆ  ) with (ˆ ı, ˆ  ) ∈ (cid:104) i : m + 1 (cid:105) × (cid:104) j (cid:105) .Let the alignment dag (directed acyclic graph) G x,y for two strings x and y be deﬁned by a set of vertices v i,j with i ∈ [0 : m ] and j ∈ [0 : n ] and edges asollows. We have horizontal and vertical edges v i,j − → v i,j and v i − ,j → v i,j ofscore 0. Further, we introduce diagonal edges v i − ,j − → v i,j of score 1, which arepresent only if x i = y j . Longest common subsequences of a substring x i x i +1 . . . x j and y correspond to highest-scoring paths in this graph from v i − , to v j,m . Whendrawing the alignment dag in the plane, its horizontal and vertical edges partitionthe plane into rectangular cells each of which, depending on the input strings,may contain a diagonal edge or not. For every pair of characters x i and y j , wedeﬁne a corresponding cell ( i − , j − ). Cells corresponding to a matchingpair of characters are called match cells , and cells corresponding to mismatchingcharacters are called mismatch cells .Solutions to the semi-local LCS problem are given by a highest-score matrix which we deﬁne as follows. In a highest-score matrix A x,y , each entry A x,y ( i, j ) isdeﬁned as the length of the highest-scoring path in G x,y from v i − , to v j,m . Eachentry A x,y ( i, j ) with 0 < i < j < n gives the LLCS of x and substring y i . . . y j .Since the values of A x,y ( i, j ) for diﬀerent i and j are strongly correlated, it ispossible to derive an implicit, space-eﬃcient representation of matrix A x,y ( i, j ).This implicit representation of a semi-local highest-score matrix consists of a setof critical points . The critical points of a highest-score matrix A are deﬁned as theset of odd half-integer pairs (ˆ ı, ˆ  ) such that A (ˆ ı + , ˆ  − ) + 1 = A (ˆ ı − , ˆ  − ) = A (ˆ ı + , ˆ  + ) = A (ˆ ı − , ˆ  + ). Consider a highest-score matrix A . The matrix D A with D A (ˆ ı, ˆ  ) = 1 if (ˆ ı, ˆ  ) is a critical point in A , and D A (ˆ ı, ˆ  ) = 0 otherwise,is called the implicit highest-score matrix .Tiskin [15] showed that in order to represent a highest-score matrix for twostrings of lengths m and n , exactly m + n such critical points are suﬃcient. Theorem 2.1 (see [15] for proof) . A highest-score matrix A can be representedimplicitly using only O ( m + n ) space by its implicit highest-score matrix D A , whichis a permutation matrix. We have: A ( i, j ) = j − i − D Σ A ( i, j ) , where D Σ A is thedistribution matrix of the implicit highest-score matrix D A . The set of critical points can be obtained using the seaweed algorithm (byAlves et al. [3], based on Schmidt [13], adapted by Tiskin [15]) which computescritical points by dynamic programming on all preﬁxes of the input strings. Thismethod is graphically illustrated by tracing seaweeds that start at odd half-integerpositions between two adjacent vertices v , ˆ ı − and v , ˆ ı + in the top row of thealignment dag, and end between two adjacent vertices v m, ˆ  − and v m, ˆ  + in thebottom row. Each critical point is computed as the pair of horizontal start andend coordinates of such a seaweed (see Algorithm 1). Two seaweeds enter everycell in the alignment dag, one at the left and one at the top. The seaweeds proceedthrough the cell either downwards or rightwards. In the cell, the directions of theseseaweeds are interchanged either if there is a match x k = y l , or if the same pairof seaweeds have already crossed. Otherwise, their directions remain unchangedand the seaweeds cross. By Theorem 2.1, the length of the highest-scoring pathin G x,y from v i − , to v j,m can be computed by counting the number of seaweedswhich both start and end within (cid:104) i : j (cid:105) . . Data-parallel window-window comparison using seaweeds The seaweed algorithm can be used to compute the LCS of all pairs of w -windowssimultaneously in time O ( wn ) for two strings of respective lengths w and n (i.e.one of the strings consists of only one w -window). By Theorem 2.1, the LCS of x and any w -window y i . . . y i + w − is computed as the number of seaweeds startingand ending within the odd half-integer range (cid:104) i : i + w (cid:105) . By keeping track of onlythese seaweeds in a sliding window, our algorithm can compute the LLCS for all w -windows in a single pass over all columns of cells in the alignment dag. Weobtain an improved algorithm for comparing all pairs of w -windows. Theorem 3.1.

Given two strings x and y of lengths m and n , the LLCS for allpairs of w -windows between x and y can be computed in time O ( mnw ) .Proof. We apply the seaweed algorithm for computing the implicit ( w, y and all substrings of x that have length w .Each application of the seaweed algorithm therefore runs on a strip of height w and width n of the alignment dag corresponding to x i . . . x i + w − and y . A columnof cells in this strip can be processed in time O ( w ). In each column j , exactlyone new seaweed starts at the top of the alignment dag, and exactly one seaweedends at the bottom. We track seaweeds ending within (cid:104) j − w : j (cid:105) . To count theseaweeds that have reached the bottom of the alignment dag, we maintain a pri-ority queue B . In each step, one seaweed reaches the bottom of the alignmentdag. Furthermore, in each step, we have to delete at most one seaweed from B .We use a priority queue of (cid:100) log n (cid:101) -bit integers to represent B . For each seaweedthat reaches the bottom, we compute its starting point and add it to the queue.We delete the minimum value from the queue if it is smaller than the startingposition of the current w -window in string y . By using a min-heap [4], both op-erations can be implemented in O (log w ). The number d of seaweeds which startwithin (cid:104) j − w : j (cid:105) is then given by the size of queue B . The LLCS of y j − w +1 . . . y j and x i . . . x i + w − can then be calculated as w − d . In total, we have to process n columns using time O ( w ) in every strip. Overall, m − w strips exist, therefore weobtain running time O ( mnw ). Algorithm 1

The Seaweed Algorithm input:

Strings x and y output: The critical points for x against y Initialise J [ i ] = i + m for i ∈ h− m, n i for r ∈ [1 , m ] do l ← −∞ for c ∈ [1 , n ] do t ← J [ c + ] if x r = y c or l > t then Swap l and t end if J [ c + ] ← t end for J [ n + r − ] ← t end forreturn the points { ( i, J [ i ]) | i ∈ h− m, n i with J [ i ] = −∞} tl ? tl l ttl lt if l < t tl l t if l > t hile this direct application of the seaweed method gives an asymptotic im-provement on the method of computing the LCS independently for every pairof windows by dynamic programming, it is not necessarily more practical. Thedynamic programming method can exploit the fact that we are only interested inwindows with an alignment score above a given threshold. More importantly, thedynamic programming method allows one to improve performance by introduc-ing a step size h , and only comparing w -windows starting at positions that aremultiples of h . We will now show how to improve the practical performance ofthe algorithm.Algorithm 1 requires O (log( m + n )) bits to represent the start and endpointsof a single seaweed. We ﬁrst show that for computing alignment plots with a ﬁxedwindow length w , O (log w ) bits are suﬃcient for tracing seaweeds, independentlyof the size of the input strings. To show this, we deﬁne the span of a seaweed asthe horizontal distance it covers in the alignment dag. A seaweed correspondingto a critical point (ˆ ı, ˆ  ) has span ˆ  − ˆ ı . Seaweeds that have a span greater thanthe window length w are not relevant for computing the alignment plot, sincethey will not start and end within a single window. Furthermore, we are onlyinterested in values with index pairs ( i, j ) having i mod r = j mod r = 0, where r is the constant blowup induced by the alignment score (for the scoring schemedescribed in the previous section, we have r = 2). This is equivalent to computingthe semi-local LCS for substrings restricted to length w , starting and ending atpositions 0 mod r . Deﬁnition 3.2.

Let A be a highest-score matrix. The ( w, r ) -restricted highest-scorematrix A w,r is deﬁned as A w,r ( i, j ) = A ( i, j ) if j − i ≤ w and i mod r = j mod r =0, and A w,r ( i, j ) = undeﬁned otherwise.This restriction on the highest-score matrices allows us to reduce the numberof critical points we need to store, and also to reduce the number of bits requiredto represent seaweeds in our computation. Proposition 3.3.

To represent a ( w, r ) -restricted highest-score matrix implicitly,we only need to store the critical points (ˆ ı, ˆ  ) of the corresponding unrestrictedhighest score matrix for which ˆ  − ˆ ı < w .Proof. Straightforward from Theorem 2.1 and Deﬁnition 3.2.

Proposition 3.4.

We can represent a single critical point in a ( w, r ) -restrictedhighest-score matrix for comparing strings x and y using O (log( w/r )) bits.Proof. We store the seaweeds in a vector S of size m + n , where each vectorelement stores (cid:100) log ( w/r + 1) (cid:101) bits. For each critical point (ˆ ı, ˆ  ), we have onevector element S (ˆ ı + 1 /

2) = min(2 w/r +1 − , (ˆ  − ˆ ı ) /r ). Each vector element storesthe span of the seaweed starting at ˆ ı .It is straightforward to see that we only need O (log w ) bits for a vector ele-ment: seaweeds in a ( w, r )-restricted highest-score matrix become irrelevant oncetheir span is larger than w , since these critical points will not aﬀect any LCS fora substring of length w (see Theorem 2.1). In order to reduce the number of bitsto O (log w/r ), we use the fact that we only need to answer LCS queries correctlyf i mod r = j mod r = 0. We therefore do not need to distinguish the individual r seaweeds starting within [ k : k + r −

1] with k mod r = 0: once they reach thebottom, we only need to know their starting position within a window of size r .We can therefore divide the distance values by r , which gives the claimed numberof required bits.We now show how to use vector instructions for improving Algorithm 1 totrace multiple seaweeds in parallel. A practical example for vector parallelismare Intel’s MMX instructions [8] for integer vector arithmetic and comparison (itwould also be possible to implement our algorithm using ﬂoating point vectorprocessing, e.g. using SSE [8]). In our algorithm, we assume that all elements V ( j )in a vector V are v -bit values. If an element of a vector has all bits set, then thisrepresents the value of + ∞ , having INF ≡ v −

1. When carrying out the seaweedalgorithm on columns of the alignment dag, the result of every comparison in acell of the column depends on the comparison result from the cell above it. To beable to process multiple cells in parallel, we need to process cells by antidiagonals.We can then use vector operations to implement each step in the the seaweedalgorithm, as each cell can be processed only using data computed in the previousstep. We need to track seaweeds only if they are within the window of interest.In order to keep the required value of bits per seaweed as small as possible (andhence allow a high degree of vector parallelism), we identify seaweeds by thedistance of their starting points to the current column. This distance can berepresented using v = (cid:100) log w +1 (cid:101) bits (see Proposition 3.4). When advancing tothe next column, we use saturated addition to increment all distances in parallel(i.e. saturated addition of one to vector element V ( k ) gives V ( k )+1 if V ( k ) < w ,and INF otherwise). In each step, we compare the characters corresponding tothe cells in the current antidiagonal using vectorised mask generation. Given twovectors V and W , we generate a mask vector which contains the value INF at allpositions k , where V ( k ) = W ( k ) and zero otherwise. The seaweed behaviour inmismatch cells is implemented by a compare/exchange operation which, given twovectors V and W , exchanges V ( k ) and W ( k ) only if V ( k ) > W ( k ). To combinethe results from the mismatch cells and the match cells, we require an operationto exchange vector elements conditionally using the mask vector M generatedearlier. Given two vectors V and W ), this operation returns vector elements V ( k )if M ( k ) = INF , and W ( k ) otherwise. All these operations can be vectorizedeﬃciently using MMX. Using these operations, we can implement the seaweedoperations from Algorithm 1 by storing all seaweeds on the current antidiagonalin a vector V if they enter the respective cell from the left, and a vector W if theyenter the respective cell from the top. We use a v -bit shift operation on vector W in each step to advance the seaweeds leaving cells at the bottom downwards.

4. Experimental Results

We have implemented the algorithm from the previous sections for allowing itsapplication to actual biological sequences. The implementation uses C++ withand Intel MMX assembly code. As input data for our tests, we used diﬀerentbiological sequence data sets and a ﬁxed window size of 100 (the nature of theequences does not aﬀect the running time of our algorithm, but may aﬀect theimpact of the heuristic speedup employed by the heuristic method we comparethe results to). In all experiments, we used a vertical step size of 5, i.e. we onlycompare every ﬁfth window in the ﬁrst input to all windows in the other string.Using the scoring scheme as described in Section 2 induces a window size of 200due to the constant-size blowup of the alignment dag. For comparing the resultsto existing methods, we implemented an alternative fast method for bit-parallelLCS computation [5] to compute the pairwise alignment scores (“BLCS”) – our64-bit implementation of this algorithm achieves a speedup of 32 over the stan-dard dynamic programming algorithm for inputs of length 200. Furthermore, wecompared our results to the code used in [12] (“Heur”) which uses the standarddynamic programming algorithm [17,11] and a heuristic to speed up computa-tion when a minimum alignment score for a window pair is speciﬁed. In the Sea-16, vectors of 16-bit values were used. Using the results from Section 3, we canimprove this to use 8-bit values, by computing (200 , Table 1.

Execution times in seconds and speedups

Data Set Mikey (2712 × Berti (2712 × Jimmy (15k × Henry (80k × MMX vector speedup on Linux/x86 64/1.83GHz Core2-duo (non-MPI), gcc 4.3.1

Heur 5.1 ( ÷ ÷ ÷ ÷ ÷ ÷ ÷ ÷ ÷ ÷ ÷ ÷ ÷ ÷ ÷ ÷ Linux desktop system, Core2-quad 2.66GHz, 64-bit, MPI, gcc 4.3.1 ÷ ÷ ÷ ÷ ÷ ÷ ÷ ÷ IBM HPC Cluster [1], 2 × dual-core Xeon 3GHz per node, 64-bit, MPI, gcc 4.1.2 ÷ ÷ ÷ ÷ ÷ ÷ ÷ ÷ ÷ ÷ ÷ ÷ ÷ ÷ . Conclusions and Outlook In this paper, we present a practical algorithm for local string comparison by editdistance ﬁltered dot plots which uses vector-parallelism and recent algorithmicresults to achieve both improved asymptotic cost and performance over applyingoptimized standard methods. We have further shown results from a coarse-grainedparallel implementation of the algorithm, which achieved good speedup on dif-ferent parallel systems. Further performance could be gained by using SSE [8] ornewer vector architectures like Larrabee [2] for implementing our code. A few al-gorithmic improvements are possible as well. In [16], a tree approach is proposedto avoid recomputing all seaweeds in each strip of height w , which allows to per-form the computation in time O ( mn ). We are currently investigating a practicalvariation of this theoretical method which further reduces the dependency on thewindow size, and may improve the algorithm shown here. Moreover, we believethat it is possible to use a similar heuristic to the one applied in [12] to furtherimprove performance. References [1] Centre for Scientiﬁc Computing, University of Warwick, .[2] M. Abrash. A First Look at the Larrabee New Instructions (LRBni).

Dr. Dobb’s Journal ,2009.[3] C.E.R. Alves, E.N. Caceres, and S.W. Song. An all-substrings common subsequence al-gorithm.

Discrete Applied Mathematics , 156(7):1025–1035, April 2008.[4] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein.

Introduction to Algorithms,Second Edition . MIT Press/McGraw-Hill Book Company, 2001.[5] M. Crochemore, C.S. Iliopoulos, Y.J. Pinzon, and J.F. Reid. A fast and practical bit-vectoralgorithm for the longest common subsequence problem.

Inf. Proc. Lett. , 80(6):279–285,2001.[6] A.J. Gibbs and G.A. McIntyre. The diagram: A method for comparing sequences. its useswith amino acids and nucleotide sequences.

Eur. J. Biochem. , 16:1–11, 1970.[7] D. Gusﬁeld.

Algorithms on Strings, Trees, and Sequences . Cambridge Univ. Press, 1997.[8] Intel Software Developer’s Manuals, , 2009.[9] J. Krumsiek, R. Arnold, and T. Rattei. Gepard: a rapid and sensitive tool for creatingdotplots on genome scale.

Bioinformatics , 23(8):1026–1028, 2007.[10] J. V. Maizel and R. P. Lenk. Enhanced graphic matrix analysis of nucleic acid and proteinsequences.

Proc. Nat. Academy of Sciences of the USA , 78(12):7665–7669, 1981.[11] S. B. Needleman and C. D. Wunsch. A general method applicable to the search of simi-larities in the amino acid sequence of two proteins.

J. Mol. Biology , 48:443–453, 1970.[12] S. Ott, S. Gunawardana, M. Downey, and G. Koentges. Loss-free identifcation ofalignment-conserved CRMs.

In preparation , 2009.[13] J. P. Schmidt. All highest scoring paths in weighted grid graphs and their application toﬁnding all approximate repeats in strings.

SIAM J. Computing , 27(4):972–992, 1998.[14] M. Snir, S. W. Otto, D. W. Walker, J. Dongarra, and S. Huss-Lederman.

MPI: TheComplete Reference . MIT Press, Cambridge, MA, USA, 1995.[15] A. Tiskin. Semi-local longest common subsequences in subquadratic time.

J. DiscreteAlgorithms , 6(4):570–581, 2008.[16] A. Tiskin. Semi-local string comparison: Algorithmic techniques and applications.

Math-ematics in Computer Science , 1(4):571–603, 2008. See also arXiv: 0707.3619.[17] R. A. Wagner and M. J. Fischer. The string-to-string correction problem.