Fluctuations of the Longest Common Subsequence for Sequences of Independent Blocks
FFluctuations of the longest common subsequence forsequences of independent blocks
Heinrich Matzinger ∗ Felipe Torres † June 4, 2018
Abstract
The problem of the order of the fluctuation of the Longest Common Subsequence (LCS) oftwo independent sequences has been open for decades. There exist contradicting conjectureson the topic, [1] and [2]. Lember and Matzinger [3] showed that with i.i.d. binary strings, thestandard deviation of the length of the LCS is asymptotically linear in the length of the strings,provided that 0 and 1 have very different probabilities. Nonetheless, with two i.i.d. sequencesand a finite number of equiprobable symbols, the typical size of the fluctuation of the LCSremains unknown. In the present article, we determine the order of the fluctuation of the LCSfor a special model of i.i.d. sequences made out of blocks. A block is a contiguous substringconsisting only of one type of symbol. Our model allows only three possible block lengths,each been equiprobable picked up. For i.i.d. sequences with equiprobable symbols, the blocksare independent of each other. In order to study the fluctuation of the LCS in this model,we developed a method which reformulates the fluctuation problem as a (relatively) lowdimensional optimization problem. We finally proved that for our model, the fluctuation ofthe ength of the LCS coincides with the Waterman’s conjecture [2]. We belive that our methodcan be applied to any other case dealing with i.i.d. sequences, only that the optimizationproblem might be more complicated to formulate and to solve.
In general trough this paper, X and Y will denoted two finite strings over a finite alphabetΣ. A common subsequence of X and Y is a subsequence which is a subsequence of X as wellas of Y . A Longest Common Subsequence of X and Y (denoted simply by LCS of X and Y , or only LCS when the context is clear enough) is a common subsequence of X and Y ofmaximal length. ∗ School of Mathematics, Georgia Institute of Technology, 686 Cherry Street, GA 30332-0160 Atlanta, USA † Fakult¨at f¨ur Mathematik, Universit¨at Bielefeld, Postfach 100131 D-33501 Bielefeld, Germany a r X i v : . [ m a t h . P R ] N ov et us motivate the study of the LCS of two string with an example: let x = ACGT AGCA and y = ACCGT AT A two sequences over the finite alphabet Σ = { A, C, G, T } . A commonsubsequence of x and y could be z = AT A . For example, the string z can be obtained fromboth x and y by just deleting some letters. We can represent the common subsequence z asan alignment with gaps (a gap is denoted by ’–’). The letters which are not in the subse-quence get aligned with gaps, so that the subsequence has aligned the common letters of bothsequences. The common subsequence z = AT A can correspond to the following alignment: x A C − − G − T A G − T − Ay A − C C − G T A − C − A − (1.1)The representation of a subsequence as an alignment with gaps is not necessarily unique.However, each alignment with gaps defines exactly one common subsequence. We are inter-ested only on alignments which aligns same-letter pairs or letters with gaps. In this paper,an alignment which aligns a maximum number of letter pairs of x and y is called optimalalignment . The subsequence defined by an optimal alignment is hence an LCS. The LCS of x and y is LCS( x, y ) = ACGT AA and corresponds to the optimal alignment: x A C − G T A G − T Ay A C C G T A − C − A LCS( x, y ) A C G T A A (1.2)In Bioinformatics (for instance [4, 5]), one of the main problems is to decide if two sequencesare related or not. If they are, it probably means that they evolved from a common ancestor.So, if they are related they should look somehow similar. Biologists try to determine whichparts are related by finding an alignment which aligns the related parts. In our currentexample, the sequences x = ACGT AGCA and y = ACCGT AT A are somehow similar, butif we compare them letter by letter the great similarity does not become obvious: x A C G T A G T Ay A C C G T A C A (1.3)In the alignment without gaps 1.3 we aligned mostly non-corresponding letter pairs, fromwhere we obtained only 3 aligned same-letter-pairs: (from left to right) the first, the secondand the last pair. This is much less than what our optimal LCS alignment 1.2 achieved. Apossible explanation why 1.3 is worse in looking for similarities than 1.2 is that some letters“got lost” in the evolution process, so that they are present only in one of the two sequences,so it is more useful to consider alignments with gaps instead to look for similarities. LongestCommon Subsequences and Optimal Alignments [6, 7] are the main tools in computationalbiology to recognize when strings are similar. A relatively long LCS indicates that the stringsare related, but how long does the LCS need to be to imply relatedness? Sequences which arenot related are stochastically independent. Could it be that independent stochastic stringshave a long LCS because of bad luck? To understand this questions, we need to figure out thesize of the fluctuation of the LCS of independent strings. We are interested in the asymptoticof the fluctation since we mainly consider long sequences. .2 Notation and history Let X = X X . . . X n and Y = Y Y . . . Y n be two stationary random sequences which are in-dependent of each other, both drawn from the same finite alphabet Σ. Let L n := | LCS(
X, Y ) | denote the length of the LCS of X and Y . A simple sub-additivity argument [1] shows thatthe expected length of the LCS divided by n converges to a constant:lim n →∞ E[ L n ] n =: γ > . The constant γ depends on the distribution of X and Y . But even for such simple cases asi.i.d. sequences with equiprobable symbols, the exact value of γ is not known. Chv`atal-Sankoff[1] derived upper and lower bounds for γ . These bounds were further refined by Baeza-Yates,Gavalda, Navarro and Scheihing [8], Deken [9], Dancik-Paterson [10, 11] and finally Durringer,Hauser, Martinez, Matzinger [12, 13]. The asymptotic value of the rescaling coefficient γ asthe number of symbols (the size of Σ) goes to infinity was determined by Kiwi, Loebl andMatousek [14]. On the other hand, the speed of convergence was obtained by Alexander [15]by using techniques from percolation theory.The order of magnitude of the fluctuation of L n is unknown for situations as simple as i.i.d.sequences of equiprobable letters. In [2] Waterman conjectured that, in many situations, thefluctuation of the LCS is of order square root of the length times a constant:VAR[ L n ] = Θ( n ) . (1.4)Here the order Θ( n ) means that there exist constants 0 < a < b such that an ≤ VAR[ L n ] ≤ bn for all n ∈ N (the constants a and b might depend on the distribution of X and Y ).So far, Lember and Matzinger [3] proved the order given in 1.4 for binary i.i.d. sequence,but when the probability of 1 is much less than the probability of 0. Durringer, Lemberand Matzinger [16] obtained also the same order when one sequence is non-random, binaryand periodic whilst the other binary sequence is i.i.d. Bonetto and Matzinger [17] provedalso the same order when the first sequence is drawn from a three letter alphabet { , , a } whilst the second sequence is binary. Finally, Houdre and Matzinger [18] proved also thesame orden when the two sequence are binary and i.i.d. but the scoring function which de-fines the alignment is such that one letter has a somewhat larger score than the other letter.Recall that in [19], Steele proved that there exists a constant c > n such that VAR[ L n ] ≤ c · n , regardless of the alphabet Σ. This means that one only needs tofind good lower bounds for the variance of L n in order to find results on the fluctuation of L n .The LCS problem can be formulated as another popular open problem in probability the-ory, namely the Last Passage Percolation problem with correlated weights. The equivalenceis as follows: let the set of vertices in our percolation setting be V := { , , , . . . , n } ×{ , , , . . . , n } . The set of oriented edges E ⊂ V × V contains horizontal, vertical and di-agonal edges. The horizontal edges are oriented to the right, whilst the vertical edges are riented upwards. Both have unit length. The diagonal edges point up-right at a 45-degreeangle and have length √
2. Hence V := { ( v, v + e ) , ( v + e , v ) , ( v, v + e ) | v ∈ V } , where e := (1 , e := (0 ,
1) and e := (1 , i, j ) to ( i + 1 , j + 1) we associate the weight 1if X i +1 = Y j +1 and −∞ otherwise. In this manner, we obtain that the length of the LCSdenoted by L n = | LCS( X X . . . X n , Y Y . . . Y n ) | is equal to the total weight of the heaviestpath going from (0 ,
0) to ( n, n ). Note that the weights on our 2-dimensional graph are not“truly 2-dimensional”: they depend only on the one dimensional sequences X = X . . . X n and Y = Y . . . Y n .The LCS problem is also related to the problem of the Longest Increasing Subsequenceof a random permutation (for short only LIS), namely the LIS can be seen as the LCS oftwo sequences where one is a sequence of randomly permuted numbers and the other is thesequence of increasing integers. Take for example 5 cards numerated from 1 to 5. Mix themthoroughly (until each permutation is equally likely). Then, lay them down face up in oneline on a table. For example, you could obtain the permutation:2 3 1 5 4A longest increasing subsequence here is 2 , ,
5. We designate by l n the length of the longestincreasing subsequence of such a random permutation, so in our case l = 3. Note that thelength of the LIS is equal to the length of the LCS of the permutation and the sequence ofincreasing numbers. In our example l = | LCS(23154 , | . So, thanks to this relationand the recently tremendous breakthrough on the study of the LIS problem, many peoplewas optimistic about finding a solution to the LCS problem by applying the new techniquesfrom the LIS problem. Unfortunatelly, nobody has succeded so far in doing that. Moreover,we now belive that the LCS problem and the LIS problem are essentially from differentclasses though they have some features in common, for instance that both can be seen aspassage percolation models, since the LIS problem is asymptotically equivalent to a speciallast passage percolation process on a Poisson graph. Let us recall some basic results aboutthe LIS problem. In [20], Baik, Deift and Johansson proved that l n − √ nn / converges in distribution as n → ∞ to a so called Tracy-Widom distribution (here l n denotesthe length of the longest increasing sequence of a random permutation drawn from the sym-metric group S n with the uniform distribution). This limiting distribution can be obtainedvia the solution of the Painleve II equation. It was first obtained by Tracy and Widom[21, 22] in the framework of Random Matrix Theory where it gives the limit distribution forthe (centered and scaled) largest eigenvalues in the Gaussian Unitary Ensemble of Hermitianmatrices. The problem of the asymptotic of l n was first raised by Ulam [23]. Substantialcontributions to the solution of the problem have been made by Aldous and Diaconis [24],Hammersley [25], Logan and Shepp [26], Vershik and Kerov (Vershik/Kerov 1977 Soviet mathdokl). oming back to the reason we belive make the LCS problem and the LIS problem essen-tially different, we can say the following: in the LIS case, the order of the fluctuation is power1 / n . So, if the fluctuation was also a third power of the expectation, then we would have thatVAR[ L n ] should be of order linear in n / . This is the order of magnitude conjectured byChvatal-Sankoff [1] for which several people have some heuristic proofs. We believe that thisorder is wrong for the LCS-problem, based in all the above cites (and the present article)which confirmed Waterman’s conjecture in many cases. However, for short sequences (small n ) the order conjectured by Chvatal-Sankoff (which corresponds to the order of the fluctua-tion of the LIS) might be what one approximately observes in simulations. We believe thatfor short sequences, the underlying percolation structure shared between the LCS problemand the LIS problem make the two fluctuation look the same though the situation changesfor large n : for short sequences, the correlation of the weights in the LCS problem has nostrong effect and the system behaves as if the weights would be independent, as in the Poissongraph situation. So far, this arguments have not been rigoruosly proved, turning them in ouropinion into attractive open questions in the area. Let l > B X , B X , . . . and B Y , B Y , . . . be two i.i.d. sequencesindependent of each other such that:P( B Xi = l −
1) = P( B Xi = l ) = P( B Xi = l + 1) = 1 / B Y i = l −
1) = P( B Y i = l ) = P( B Y i = l + 1) = 1 / . We call the runs of 0’s and 1’s blocks. Let X ∞ = X X X . . . be the binary sequence sothat the i -th block has length B Xi where X is choosen 0 with probability 1 / /
2. Similarly let Y ∞ = Y Y Y . . . be the binary sequence so that the i -th blockhas length B Y i and Y is choosen 0 with probability 1 / / Example 2.1
Assume that X = 1 and B X = 2 , B X = 3 and B X = 1 . Then we havethat the sequence X ∞ starts as follows X ∞ = 1100010 · · · meaning that in X ∞ the first blockconsists of two 1’s, the second block consists of three 0’s, the third block consists of one 1’s,etc. Let X denote the sequence obtained by only taking the first n bits of X ∞ , namely X = X X X . . . X n and similarly Y = Y Y Y . . . Y n . Let L n denote the length of the LCS of X and Y , L n := | LCS(
X, Y ) | . The main result of this paper states that for l large enough, the order of thefluctuation of L n is n : Theorem 2.1
There exists l so that for all l ≥ l we have that: VAR[ L n ] = Θ( n ) for n large enough. e show that the above theorem is equivalent to proving that “a certain random modificationhas a biased effect on L n ”. This is a technique with similar approches in other papers (forinstance see [3], [17]). So the main difficulty is actually proving that the random modificationhas typically a biased effect on the LCS. This random modification is performed as follows:we choose at random in X a block of length l − l + 1,this means that all the blocks in X of length l − l − X oflength l + 1 have the same probability to be chosen and we pick one of those blocks of length l + 1 up. Then we change the length of both these blocks to l . The resulting new sequenceis denoted by ˜ X . Let ˜ L n denote the length of the LCS after our modification of X . Hence:˜ L n := | LCS( ˜
X, Y ) | . If we can prove that our block length changing operation has typically a biased effect on theLCS than the order of the fluctuation of L n is √ n . This is the content of the next theorem: Theorem 2.2
Assume that there exists (cid:15) > and α > not depending on n such that forall n large enough we have: P (cid:16) E[ ˜ L n − L n | X, Y ] ≥ (cid:15) (cid:17) ≥ − exp( − n α ) . (2.1) Then,
VAR[ L n ] = Θ( n ) for n large enough. The above theorem reduces the problem of the order of fluctuation to proving that our ran-dom modification has typically a higher probability to lead to an increase than to a decreasein score. The proof of this result is not included in the present article for shortness reasons,though all the details are in [27]. In all what follows, we assume that theorem 2.2 is true.The next step is to ask: how can we prove, in our block model, that the condition 2.1 issatisfied? In theorem 2.3, we see that the condition 2.1 can be obtained from the positivesolution of a minimizing problem. This minimizing problem has to do with the proportion ofsymbols which build up the LCS, been placed on a 9 dimensional space. By using Lagrangemultiplyers techniques, we are able to further reduce it to a parametrized 3 dimensional op-timization problem. Furthermore, we numerically and graphically verify that the positiveminimum condition is already verified for l >
5, which implies that VAR[ L n ] = Θ( n ) holdsalready for l = 6. Details on the solution of this minimization problem can be found in [28].The article is organized as follows: in what is left of section 2, we explain how to relatethe effect of the random modification with a constrained optimization problem on the pror-portion of symbols used to build up the LCS and how this relation is used to prove theorem2.1. In section 3, we discuss some combinatorial aspects of aligned blocks in optimal align-ments especific for this block model. Finally, in section 4 we devote ourself to prove theorem2.3 .1 Random modification and proportion of aligned blocks Let us next look, with the help of an example, when the random modification introduces anincrease or a decrease in the score:
Example 2.2
Let us suppose l = 3 . Let us take two sequences x = 00110011110000111 and y = 0011100001100001111 . An optimal alignment (in the sense of the Example 1.1) wouldbe: x − − − − y − − In this example no block gets left out completely. By this we mean that no block is only alignedwith gaps. The first block of x is aligned with the first block of y . The second block of x isaligned with the second block of y . By this we mean that all the bits from the second block of x are either aligned with bits of the second block of y or with gaps and vice versa. We havethat the second block of the LCS is hence obtained from the second blocks of x and y by takingthe minimum of their respective lengths. In our current special example, we have that for all i = 1 , , . . . , , the i -th block of X gets aligned with the i -th block of y . We could representthis idea visually by viewing the alignment as an alignment of blocks in the following manner: x
00 11 00 1111 0000 111 y
00 111 0000 11 0000 1111LCS 00 11 00 11 0000 111 (2.3)
Let us next analyze what is the expected change when we perform our random modification. In x there are exactly blocks of length l − . These are the first three blocks of x . The firstblock of x of length is aligned with a block of y of length , the second one with a block oflength and the fourth with a block of length . Hence, when we increase the length of the firstblock of length of x by one the score does not increase. When we increase the second or third,however, the score increases by one unit. Each of these blocks has the same probability / to get drawn. Hence, the conditional expected increase due to the enlargement of a randomlychosen block of length in this case, is equal to / . In our random modification we alsochoose a block of length l + 1 and decrease it to length l . In our example, there are two blocksin x of length l + 1 = 4 . These blocks are the fourth and fifth block of x . The fourth block isaligned with a block of length whilst the fifth is aligned with a block of length . Hence, whenwe decrease the length of the fourth block we get no change in score whilst when we decreasethe fifth we get a decrease by one unit. Each of the two blocks have same probability to getdrawn. This implies that the expected change due to decreasing a randomly chosen block oflength is equal to − / . Adding the two changes, we find that for x and y defined as in thecurrent example, the conditional expected change is equal to: E[ ˜ L n − L n | X = x, Y = y ] = 23 −
12 = 16 (2.4)
In our example we have six aligned block pairs leading to the following set of pairs of lengths: { (2 , , , , , , } . et p ij designate the proportion of aligned block pairs which have the x -block having length i and the y -block having length j . Example 2.3
For our example above we have: p p p p p p p p p =
16 16 16 (2.5)With this notation, equality 2.4 can be written as:E[ ˜ L n − L n | X = x, Y = y ] ≥ p l − ,l + p l − ,l +1 p l − ,l − + p l − ,l + p l − ,l +1 − p l +1 ,l +1 p l +1 ,l − + p l +1 ,l + p l +1 ,l +1 (2.6)The inequality 2.6 holds if there exists an optimal alignment a of x and y leaving out noblocks, and having a proportion p ij of aligned block pairs such that the x -block has length i and the y -block has length j (for every i, j ∈ { l − , l, l + 1 } ). Typically, for large n , theoptimal alignment will not be like in the example above, but there will be blocks which areleft out, which implies also that some blocks are aligned with several blocks at the same time.Let us check an example: Example 2.4
Let x = 00110011100011000 and y = 00001111000011000 . In this situationthe LCS is equal to
LCS( x, y ) = 000011100011000 and corresponds to the following optimalalignment: x − − y − − which in block representation would be: x y In the last alignment above we see that the first block of y is aligned with the first and thirdblock of x . This implies that the second block of x is “completely left out”, which means all itsbits are aligned with gaps. The other blocks are aligned one block with one block: the fourthblock of x is aligned with the second block of y , whilst the fifth block of x is aligned with thethird block of y . Finally the last blocks of x and y are aligned with each other. In everything that follows, the proportions p ij will only refer to the block pairs aligned oneblock with one block. Hence, in the alignment 2.7, the first three blocks of x and the firstblock of y do not contribute to { p ij } i,j . Example 2.5
In the last example above there are block-pairs aligned one block with oneblock. The corresponding pairs of block-lengths are: (3 , , , , ence for the alignment 2.7, we find p , = 2 / , p , = 1 / , p , = 1 / and p ij = 0 for all ( i, j ) / ∈ { (3 , , (2 , , (3 , } . We will denote by q , resp. q , the proportion of left out blocksin x , resp. in y . In the alignment 2.7, in the sequence x there is one left out block from atotal of blocks. This implies that q = 1 / . There is no left out block in y so that q = 0 .In section 3, we will see that typically, for n large enough, q and q can be taken as closeto each other as we want to. When q = q we denote the proportion of left out blocks by q .When we choose a block of length l − in x to increase its length we will have to consider theprobability that the block is not aligned one block with one block. In the alignment 2.7, thereare blocks in x of length l − . The first three are not aligned one block with one block:the second is left out, whilst the first and the third block are aligned with the same block of y .Hence in 2.12 the proportion of blocks not aligned one to one among the blocks of length is / . On the other hand, the blocks of length l + 1 = 4 in x are all aligned one to one. So,for the alignment 2.7, we have that the proportion of blocks not aligned one to one among theblocks of length is . Using some combinatorial arguments, in section 3 we will see that typically the proportionamong the blocks of x of length l − q . Similarly for the blocks of length l + 1 in x one gets a bound 3 q forthe proportion of blocks aligned with several blocks of y or left out. We can rewrite thelower bound on the right side of inequality 2.4, taking also into account the left out blocks.Assuming that there is an equal proportion of blocks q which are not aligned one to one in x and in y we get the following lower bound for the conditional expected increase in the LCS: p l − ,l + p l − ,l +1 p l − ,l − + p l − ,l + p l − ,l +1 (1 − q ) − p l +1 ,l +1 p l +1 ,l − + p l +1 ,l + p l +1 ,l +1 (1 − q ) − q (2.9)The above lower bound for the conditional expected increase in LCS holds assuming that thefollowing conditions holds: • There exists an optimal alignment leaving out exactly the same proportion q of blocks in X and in Y . For that optimal alignment a , let { p ij } i,j denote the empirical distributionof the aligned block pairs, so that p ij = P ij ( a ). • There is exactly the same number of blocks in X and in Y . • In X , each block lenght l − , l, l + 1 constitutes exactly 1 / Y .The above conditions do not typically hold exactly but only approximately. We first lookat this somehow simplified case before looking at the general case (for the general case, seethe proof of theorem 2.1.3). Let us next explain how we get the bound 2.9 for this somehowsimplified case (also, the reader should compare it to the version 2.6 with no gaps). Assumenext that we have an optimal alignment a with given empirical distribution { p ij } i,j ∈{ l − ,l,l +1 } of the aligned block pairs and leaving out in both sequences x and y a proportion q of blocks.What is now the effect of our random change on the score of the alignment a ? First let uslook at the randomly chosen block of length l − l . If thatblock is aligned with a block of length l or l + 1 the alignment gets increased by one unit. So,conditional that the randomly chosen block of length l − et that the probability of an increase is equal at least to: p l − ,l + p l − ,l +1 p l − ,l − + p l − ,l + p l − ,l +1 . Now, if the randomly chosen block of length l − Y which is aligned with several blocks of Y (let us call it a polygamist block), then we haveno increase. The same happens if the block is not aligned with a block of Y . There are atmost a proportion of 3 q blocks which are not aligned with any block or aligned together withpolygamist block of Y . There are about a proportion of 1 / l −
1. Henceamong the blocks of length l −
1, there is a proportion of at least 1 − q which are aligned oneblock with one block or aligned one with several. Hence we get that the conditional expectedchange due to changing the randomly chosen block of length l − l is equal at least to: p l − ,l + p l − ,l +1 p l − ,l − + p l − ,l + p l − ,l +1 (1 − q ) . (2.10)Similarly we can analyze the effect of the randomly chosen block of length l + 1 which getsreduced to length l . If the block is aligned one block to one block and the length of thealigned block of Y is l + 1 then the score can get reduced by one. If the block is aligned witha block of Y of length l or l − l + 1 chosen is aligned one block to one block, the conditional expected change isnot less than: − p l +1 ,l +1 p l +1 ,l − + p l +1 ,l + p l +1 ,l +1 . On the other hand, when the chosen block of length l + 1 is aligned with several blocks of Y then the score goes down by one unit. There are at most a proportion q of blocks of X alignedwith several blocks of Y . So, among the blocks of length l + 1 this represents a proportion ofat most 3 q . Hence we get that at worst the expected change due to changing a random blockfrom l + 1 to l is equal to: − p l +1 ,l +1 p l +1 ,l − + p l +1 ,l + p l +1 ,l +1 (1 − q ) − q (2.11)Putting 2.10 and 2.11 together we get that the expected conditional change of the alignmentscore is bounded below as follows: E[∆ L a | X, Y ] ≥ p l − ,l + p l − ,l +1 p l − ,l − + p l − ,l + p l − ,l +1 (1 − q ) − p l +1 ,l +1 p l +1 ,l − + p l +1 ,l + p l +1 ,l +1 (1 − q ) − q where ∆ L a denotes the change in score of the alignment a due to the random modificationof X .Then, to prove inequality 2.1 in theorem 2.2, it is thus sufficient to show that for all optimalalignments a of X and Y , expression 2.9 is positive and bounded away from zero with highprobability. Hence the next question is how can we prove that typically, for large n , expression2.9 is larger than a positive constant not depending on n ? xample 2.6 Let us return back to the example of alignment 2.7. That alignment left outonly one block, and that was the second block of X . We could now proceed in a differentorder. We could first decide which blocks get left out before generating the random sequences X and Y . The resulting alignment is in general not optimal. On the other hand, such analignment has the property that the block pairs aligned one to one are i.i.d. This is a verynice property for large deviation estimations, for instance. Let us give an example. Assumewe request that the only left out block is the second block of X (as in alignment 2.7). Assumewe redraw X and Y and obtain X = 00111001100011000 and Y = 00011110000111000 . Thenwe get as alignment and Common Subsequence (CS) the following: x − − − − y − − − − which can be represented as an alignment of blocks by: x y
000 1111 0000 111 000CS 000 11 000 11 000In this case we use the term of common subsequence instead of the longest common subse-quence because we are leaving a block out of the alignment, if we do not leave it out we mightget a longer common subsequence (which does not happen in this case neither but mighthappens in the general case). So, in this last example, before drawing X and Y , we knowthat the fourth block of X gets aligned with the second block of Y and this aligned pair builtsthe second block in the CS. The length of the second block of the CS has thus length equalto min { B X , B Y } . Similarly, before even drawing X and Y , we know that the fifth blockof X gets aligned with the third block of Y . Hence, we have that the pair of lengths in thesecond block pair is ( B X , B Y ) whilst the third block of the CS has length min { B X , B Y } .Note that ( B X , B Y ) is independent of ( B X , B Y ) and B X is independent of B Y whilst B X is independent of B Y . The distribution of each of the blocks B X , B Y , B X and B Y is unchanged, they take value l − l or l + 1 with equal probability 1 /
3. Hence, ( B X , B Y )can take any of the nine values in the set { ( i, j ) | i, j = l − , l, l + 1 } with probability 1 / X and Y , the aligned block pairs are “almost” i.i.d. Why do we say “almost”? In the above example( B X , B Y ) and ( B X , B Y ) are i.i.d. and not just close to be i.i.d. On the other hand, block B X in the case 2.7 is no longer in X if the first, third and fourth blocks get each increase byone unit. In this sense the blocks are not completely independent. But since we take n largethis is only a minor effect. We will take care of this detail in section 4 and until then pretendthat the aligned block pairs are i.i.d.Note that for each alignment a defined by specifying which blocks we left out before drawing X and Y , the empirical distribution of the aligned blocks is random. We write { P ij ( a ) } i,j ∈{ l − ,l,l +1 } for this empirical distribution. Thus, P ij ( a ) denotes the proportion of aligned block pairswhere the block of X has length i and the block of Y has length j . Given a non-random istribution { p ij } i,j ∈{ l − ,l,l +1 } we can ask what is the probability for the empirical distribu-tion to be equal to the { p ij } i,j . The answer is, since the block pairs are close to i.i.d, thedistribution is close to a multinomial distribution:P ( P ij ( a ) = p ij , ∀ i, j ∈ I l ) ≈ (cid:18) n ∗ p l − ,l − n ∗ p l − ,l n ∗ . . . p l +1 ,l n ∗ p l +1 ,l +1 n ∗ (cid:19) (cid:18) (cid:19) n ∗ (2.13)where n ∗ designates the total number of aligned block pairs (here we act as if that numberwould be non-random). By using Stirling, the expression 2.13 is approximately equal to: e (ln(1 / H ( p )) n ∗ (2.14)where H ( p ) designates the entropy of the empirical distribution: H ( p ) = (cid:88) i,j ∈{ l − ,l,l +1 } p ij ln(1 /p ij ) . A question arises: for a given aligned block pairs distribution { p ij } , is it likely that thereexist an alignment with that distribution and having a proportion q of left out blocks? Let A ( q ) denote the set of alignments leaving out a proportion q of blocks. Let A denote theevent that there exists an alignment in A ( q ) having its empirical distribution equal to { p ij } .An upper bound for the probability P( A ) is given by the number of elements in A ( q ) timesthe probability 2.13. By using 2.14, this product is close to: |A ( q ) | · e (ln(1 / H ( p )) n ∗ . (2.15)But the size of the set A ( q ) is approximately equal to e H ( q ) n/l , since there are about n/l blocks. Hence, expression 2.15 is approximately equal to: e (2 H ( q ) n/l )+(ln(1 / H ( p )) n ∗ . (2.16)If we want the event A to not have exponentially small probability in n , we need the logarithmof 2.16 to be non-negative, which leads to the condition:2 H ( q ) + (1 − q )(ln(1 /
9) + H ( p )) ≥ , (2.17)where we used as lower bound on n ∗ the number ( n/l )(1 − q ).We can now explain how we prove that typically, for all optimal alignment, expression 2.9is larger than a positive constant not depending on n . For this we simply need to find a q so that we can prove that the optimal alignment leaves out at most a proportion of q ≤ q blocks and then show that expression 2.9 is bounded away from zero under condition 2.17 for q ∈ [0 , q ].Let F n ( q ) be the event that any optimal alignment of X and Y leaves out at most a propor-tion q of bocks in X and leaves out the same proportion q of blocks in Y . In more details,given q > a of X and Y in F n ( q ), we can count the number ofblocks that are left out (not used in a ) and divide this number by the total number of blocksin X to obtain q , and also divide this number by the total number of blocks in Y to obtain q , then we know that q ≤ q and q ≤ q . xample 2.7 Let us take again the case where X = 00111001100011000 and Y = 00011110000111000 , then we have as before the following common subsequence (CS)represented in an alignment: X − − − − Y − − − − and represented as an alignment of blocks by: X Y
000 1111 0000 111 000CS 000 11 000 11 000
Let us compute q and q in this case. For X we have a total of 7 blocks and only 1 block isleft out in the alignment, so q = 1 / . For Y we do not have left out blocks so q = 0 . Thengiven q > , this alignment belongs to F n ( q ) if and only if q = 1 / ≤ q and q = 0 ≤ q . The next theorem says that if we can bound expression 2.9 away from zero under condition2.17, then we have typically the desired bias for E[ ˜ L n − L n | X, Y ] the conditional expectedincrease in score:
Theorem 2.3
Assume that there exists q ∈ [0 , (1 / such that the following minimizingproblem: min (cid:18) p l − ,l + p l − ,l +1 p l − ,l − + p l − ,l + p l − ,l +1 (1 − q ) − p l +1 ,l +1 p l +1 ,l − + p l +1 ,l + p l +1 ,l +1 (1 − q ) − q (cid:19) (2.18) under the conditions: q ∈ [0 , q ] , (cid:88) j p l − ,j ≥ ((1 / − q ) / , (cid:88) j p l +1 ,j ≥ ((1 / − q ) / (cid:88) i,j ∈ I p ij = 1 , p ij ≥ , ∀ i, j ∈ I (2.20)2 H ( q ) + (1 − q ) (ln(1 /
9) + H ( p )) ≥ has a strictly positive solution. Let this minimum be equal to (cid:15) > . Then we have that: P (cid:16) E[ ˜ L n − L n | X, Y ] ≥ (cid:15) (cid:17) ≥ − e − n β − P( F nc ( q )) (2.22) where β > is a constant not depending on n . Note that the high probability of the biased effect is only given when P( F nc ( q )) is small (re-call that F n ( q ) is the event that in any optimal alignment the proportion of left out blocksis less/equal to q ). This means that, in order to apply the above theorem, we first need tocome up with a way to bound the proportion q of left out blocks in any optimal alignment. Ifthe proportion of left-out blocks q is too high, the joint distribution p ij ( a ) of the aligned block engths could just be anything. In other words, the entropy condition 2.21 becomes uselesswhen q is not small enough. We can now summarize how to apply the last theorem above:we first need to establish that the proportion of left out blocks is small enough. This meansthat we need to find a q which satisfies that P( F n ( q )) is close to 1 and small enough, sothat the objective function 2.18 is bounded away from 0 under the constrains 2.19, 2.20 and2.21. In section 3.1, we show that the proportion of left out blocks does typically not exceedany q for which q > (4 / / ( l − l is the average block length. With this boundon q , (that is taking q = (4 / / ( l − ?? that theobjective function 2.18 is bounded away from 0 under our constrains, already for l = 6. Thisthen implies that for any l ≥
6, the order of the fluctuation is VAR[ L n ] = Θ( n ). In the nextsection, we prove the last theorem precisely, taking care of other details, for example that theproportion of left out blocks in X and in Y does not coincide in every alignment, only in theoptimal alignment. Other important point is that the probability P( F nc ( q )) depends on theparameter l . In chapter 3.1, we show how to find upper bounds on the proportion of left outblocks. In general, for l larger, the bounds gets better. Actually the bounds even converge tozero as l goes to infinity. As q goes to zero, expression 2.18 gets close to 1 / l islarge enough.Let us no prove that theorem 2.3 and theorem 2.2 together imply theorem 2.1: Proof.
We suppose that F nc ( q ) has exponentially small probability for any fixed q > l is large enough (see section 3.1 and 4.2). In section 4 we will show how large l should be depending on q but not on n . The conditions in theorem 2.3 are satisfied when q > q ≤ q small enough) is taken small enough. Let us explain why. First notethat inequality 2.21 can be written: H ( p ) ≥ − H ( q )1 − q + ln(1 /
9) (2.23)When q goes to zero, then H ( q ) also goes to zero and so does 2 H ( q ) / − q . But we havethat H ( p ) is always less or equal to ln(1 / p ij ’s are equal to 1 / q > p ij gets as close as we want to the equiprobable distribution. On the other hand,when q goes to zero and all the p ij ’s converge to 1 /
3, then the quantity p l − ,l + p l − ,l +1 p l − ,l − + p l − ,l + p l − ,l +1 (1 − q ) − p l +1 ,l +1 p l +1 ,l − + p l +1 ,l + p l +1 ,l +1 (1 − q ) − q, converges to 2 / − / / >
0. This shows that by taking q > q > • F nc ( q ) has exponentially small probability in n . • The minimizing problem in theorem 2.3, has a strictly positive solution. Call thissolution 2 (cid:15) , where (cid:15) > y theorem 2.3, we then have that inequality 2.22 holds. But since F nc ( q ) is exponentiallysmall in n , we get that the expression on the right hand side of inequality 2.22 is smaller thanexp( − n α ) for all n and α > n . This implies that condition 2.1 in theorem2.2 is satisfied. Then theorem 2.2 implies that:VAR[ L n ] = Θ( n ) . In this section we describe some of the combinatorial properties that our block model exhibitswhen looking for optimal alignments. Let us begin with an example:
Example 3.1
Let X = 0011100 and Y = 0001100 . The LCS is . This corresponds tothe following alignment: X − Y − In this example, the first block of the
LCS has length . It is obtained from the first blockof X and the first block of Y . The first block of X has length whilst the first block of Y has length . The length of the first block of the LCS is equal to the minimum of these twonumbers. In this kind of situation we say that the first block of X is aligned to the first blockof Y . Similarly the length of the second block of the LCS is the minimum of the lengths ofthe second block of X and of Y . We say that in this alignment the second block of X getsaligned with the second block of Y . Finally the third block of X gets aligned with the thirdblock of Y to yield the third block of the LCS . In this present example no block of X or Y gotleft out completely: every block “contributed” some bits to the LCS . All the blocks are alignedone block of X with one block of Y . Each such pair of aligned blocks is responsible for oneblock in the LCS . In some other cases, some blocks of X and Y are completely left out. Let us look at such asituation. Example 3.2
Consider X = 00100000111 and Y = 00000100011 . The LCS would be . The
LCS corresponds to the alignment: X − − Y − − LCS 0 0 0 0 0 0 0 1 1 (3.2)In the last example above we have that the second block of X and of Y are totally left outand do not contribute to the LCS. We say that these blocks are left out blocks. The lastblock of X and the last block of Y “get aligned” together to yield the last block of the LCS.We say that this is an aligned block pair or also that these two blocks are aligned one blockto one block . One way of thinking about the LCS defined by the alignment 3.2 above is asfollows: we first decide which blocks we leave out in X and Y . Then from the two obtained equences, we align block by block without leaving out any blocks. So the alignment 3.2 canbe seen as the alignment in which we leave out the second block of X and the second blockof Y . This gives then the modified sequences X ∗ = 0000000111 and Y ∗ = 0000000000011.Then we align X ∗ and Y ∗ block by block. The common subsequence we obtain has its i -thblock having length equal to the minimum of the length of block i of X ∗ and of Y ∗ . In thisexample we have that the first and the third block of X get aligned with the first and thirdblock of Y . By this we mean that in both sequences the first and third block are made intoone block and these blocks are then matched. We will be able to prove that in the case westudy here this is untypical: for optimal alignment we will only have one block aligned withseveral at the same time, but not several with several. Let us look at one more example: Example 3.3
Let X = 001001111 and Y = 000011011 . The LCS is . This corre-sponds to the alignment: X − Y − Here the second block of X is left out. Hence the first and the third block of X get alignedwith the first block of Y . Similarly the fourth block of X gets aligned with the second andfourth block of Y . The third block of Y is left out. This situation will happen in optimal alignment: one block aligned with several blocks of theother sequence.Assume that we know for an alignment a which blocks are left out. Assume that X ∗ , resp. Y ∗ denotes the modified sequence X , resp. Y where we left out the specified blocks. Let Z denote a common subsequence defined by the alignment a . The alignment must then alignall the blocks of X ∗ with the blocks of Y ∗ one to one, otherwise there would be more left-outblocks. Hence, the first block of X ∗ gets aligned, then the second block of X ∗ and so on.If the alignment wants to stand a chance to be an optimal one (and hence Z to be a LCS)for each pair of aligned blocks from X ∗ and Y ∗ aligned to one another, it needs to extract amaximum of bits of each such pair. Hence, for every i = 1 , , . . . , j we have that the lengthof the block number i of Z must be equal to the minimum between the length of the i -thblock of X ∗ and the length of the i -th block of Y ∗ (here j denotes the number of blocks in Z .) Hence, since we are interested in LCS’s (and hence in optimal alignments) we will onlyconsider alignments defined in the following manner: first we define exactly which blocks getleft out. Second we align the resulting sequences X ∗ and Y ∗ one block with one block. Thenext lemma says that in our setting an optimal alignment cannot align several blocks withseveral blocks.Another useful fact is that for optimal alignments we do not need to consider adjacent left-outblocks except maybe at the end of the sequences. But in section 4 we prove that only a smallpercentage of bits could be left out at the end of X and Y in an optimal alignment. Hence,the practical implication is that we only need to consider left out blocks at least separated byone non-left out block. Let us first explain what we mean by adjacent left out blocks betweenaligned blocks: xample 3.4 Take x = 11001100 and y = 00001100 . Let us align as follows: x − − − − y − − − − x in the alignment above get leftout (i.e. entirely aligned with gaps). These two blocks are adjacent and they are comprisedbetween aligned blocks (i.e. in our example they are comprised between the first and fourthblock of x which are “aligned”, by aligned we mean aligned with another block hence notentirely aligned with gaps). The next lemma below states that for our LCS problem (i.e.optimal alignments) the kind of situation we face in the current numerical example 3.4 canbe discarded. The reason is as follows. In the current example y gets aligned with x . Nowinstead align y with x and keep all the rest of alignment 3.4 identical otherwise. Then bydoing this you have not decreased the score but have destroyed the situation of two adjacentcompletely left out blocks. The next lemma shows what we explained in our example in arigorous way: Lemma 3.1
There exists an optimal alignment of X and Y having no adjacent left-out blocksbetween aligned blocks. Proof.
View an alignment as a finite sequence of points in N × N , so that if x i gets aligned with y j , then ( i, j ) is a point in the set representing the alignment. Introduce for two alignments a, b ∈ N × N the order relation a ≤ b iff all a contains the same number of points as b and if wenumerate in both sets the points from down left to up right then the i -th point a i = ( a ix , a iy )of a and the i -th point b i = ( b ix , b iy ) of b satisfy a ix ≤ b ix and a iy ≤ b iy for all i ≤ | a | . Here | a | designates the number of points in a . Take now an optimal alignment which is minimalaccording to the relation ≤ . That optimal alignment satisfies the property of not havingseveral adjacent left out blocks between aligned blocks.Next we show the relation between left out blocks at the end of each sequence and the totalleft out blocks in each sequence: Lemma 3.2
Let x, y ∈ { , } n be two sequences of length n . Let the number of blocks of x ,resp. y be denoted by n ∗ = ( n/l ) + ∆ , resp. n ∗ = ( n/l ) + ∆ . Assume that | ∆ | , | ∆ | ≤ ∆ .Assume also that a is an alignment of x and y which does never leave out adjacent blocksexcept maybe a contiguous group at the very end of x and of y . Let δ ≥ denote theproportion of blocks which are entirely left out at the end of x , resp. y , among all the blocksof x , resp. y . Let q , resp. q denote the proportion of blocks left out in x , resp. in y . Thenwe find that: | q − q | ≤ . | δ − δ | + 4 l ∆ n (3.5) Proof.
Let x ∗ , resp. y ∗ denote the sequence we obtain after we removed the blocks whichare completely left out by a . Since there are no other completely left out blocks, we havethat the number of blocks in x ∗ must be equal to the number of blocks in y ∗ . Note that forevery left out blocks which has no adjacent left out block the number of blocks is reduced by . for the adjacent left out blocks at the end, for each left out block there is one block less.Since there are no adjacent left out blocks except the adjacent blocks at the end, we get thatthe number of blocks of x ∗ , resp. of y ∗ is equal to n ∗ (1 − q − δ ) + δ )) , (3.6)resp. n ∗ (1 − q − δ ) + δ )) . (3.7)Taking the difference of 3.6 and 3.7 and dividing by ( l/ n ), we find q − q = 1 . δ − δ ) + bl ∆ n (3.8)where 2 b = 1 − q − δ ) + δ ) − (1 − q − δ ) + δ ))we see that b is always smaller than 4 which ends the proof. Lemma 3.3
For l > any optimal alignment of X and Y does not align several blocks in X with several blocks in Y . Proof.
Let us explain the idea behind through an example. Let us take x = 0001111000111100000and y = 0001111000001110000 two realizations of X and Y , respectively, with l = 4. Analignment using all blocks of x and y in block representation becomes: x
000 1111 000 1111 00000 y
000 1111 00000 111 0000LCS 000 1111 000 111 0000 (3.9)Let us now suppose that we leave out the second block of x and the second block of y , thenthe alignment in block representation looks like: x y y had all together atleast as many symbols (8 zeros all together) as the neighbour blocks of the left out block in x had all together (6 zeros). In general we could gain at most 2 new symbols from the neighbourblocks of the left out block but we always loose at least l − − ( l −
1) = 3 − l . Then for blocks of length l >
4, to alignseveral blocks with several blocks decreases the LCS rather than to increase it. .1 Maximum number of left out blocks The first key question is the percentage of blocks which are at most left out in an optimalalignment. Since the blocks have length l − l or l +1 with equal probability 1 / l . Hence, the expected number of blocks in a sequence of length n is about n/l . Now let us define the limit: γ l = lim n →∞ E[ L n ] n . (3.11)Hence, the number of bits in the sequence X (and also in the sequence Y ) which are not usedfor the LCS is about (1 − γ l ) n . Every block we leave out means at least l − − γ l ) nl − . This represents typically a proportion of:(1 − γ l ) n/ ( l − n/l = 1 − γ l − (1 /l )from the total number of blocks. Hence we find that the proportion of left out blocks in theoptimal alignment is typically close or below the following bound:1 − γ l − (1 /l ) . (3.12)Let us next find a simple lower bound for γ l which we can use in expression 3.12. Assume wechoose an alignment which leaves out no blocks. The typical score of such an alignment givesa lower bound for γ l . In this case the common subsequence defined by such an alignment hasits i -th block having length: B i := min { B Xi , B Y i } . where B Xi (resp. B Y i ) is the length of the i -th block of X (resp. Y ). Recall that B Xi (resp. B Y i ) has uniform distribution on the set { l − , l, l + 1 } . The distribution of the minimumabove is as follows:P( B i = l −
1) = 5 / , P( B i = l ) = 3 / , P( B i = l + 1) = 1 / . The expected length is thus:E[ B i ] = 59 ( l −
1) + 39 l + 19 ( l + 1) = l − . (3.13)Since there are about n/l blocks, the score aligning all the blocks gives thus about a score of: nl · E[ B i ] = n (cid:18) − l (cid:19) , so that we obtain: γ l ≥ (cid:18) − l (cid:19) . he last inequality together with the bound 3.12 implies that the proportion of left out blocksshould typically not be much above the following bound:1 − (1 − (4 / l ))1 − (1 /l ) = 4 / l − γ l by simulations. As a matter of fact wehave for any n that E[ L n ] /n is a lower bound for γ l . By Montecarlo we can find an estimateof E[ L n ] /n and a very likely lower bound γ lb . We then replace in inequality 3.12 γ l by γ lb . Let δ > n . We will define a number of events relatedwith the combinatorial properties of the optimal alignments, called C n , D n ( δ ), G n ( δ ) and J n ( δ ). In the following we will prove that these events have high probability for n large. Byhigh probability, we mean a quantity which is negatively exponential close to one in n . It willturn out that this is true for the above events for any parameters δ > n .Also we will prove that F n ( q ) has high probability for n large in the same sense as above butrestricted to some values of q . All the missing proofs (omitted for shortness reasons) can befound with details in [28].A very useful tool we often use is the Azuma-Hoeffding theorem. The following is a versionof it for martingales (for a proof see [29]): Theorem 4.1 (Hoeffding’s inequality) Let ( V, F ) be a martingale, and suppose that thereexists a sequence a , a , · · · of real numbers such that P( | V n − V n − | ≤ a n ) = 1 for all n . Then: P( | V n − V | ≥ v ) ≤ (cid:110) − v (cid:46) n (cid:88) i =1 a i (cid:111) (4.1) for every v > . We also will use a corollary of the above theorem, for some intermediate bounds:
Corollary 4.1
Let a > be constant and V , V , . . . be an i.i.d sequence of random boundedvariables such that: P( | V i − E[ V i ] | ≤ a ) = 1 for every i = 1 , , . . . Then for every ∆ > , we have that: P (cid:18) (cid:12)(cid:12)(cid:12)(cid:12) V + · · · + V n n − E[ V ] (cid:12)(cid:12)(cid:12)(cid:12) ≥ ∆ (cid:19) ≤ (cid:18) − ∆ a · n (cid:19) (4.2) .1 Number of blocks as renewal process For k > k blocks in X as: S Xk = B X + · · · + B Xk Let us define the number of blocks used in a sequence of length t in X as: N Xt = max { k > S Xk ≤ t } (4.3)Note that there might be at the end of X a block which has length smaller than l −
1. Sincethis is at most one block it plays little role and we will not mention it every time, only whenit is relevant (the same will apply to Y in what follows).Due to the standard theory of renewal processes, for every k, t > N Xt ≥ k ⇔ S Xk ≤ t. (4.4)In the same way we define for Y the same variables as before: S Yk = B Y + · · · + B Y k N Yt = max { k > S Yk ≤ t } where still the relation N Xt ≥ k ⇔ S Xk ≤ t , for every k, t > C n be the event that the number of blocks in X and in Y lies in the interval I n := (cid:104) nl − n . , nl + n . (cid:105) . Lemma 4.1
There exists a constant b > depending on l such that: P( C nc ) ≤ e − b · n . for every n > large enough. Proof.
It is easy to see that: C n = { N Xn ∈ I n } ∩ { N Yn ∈ I n } (4.5)It is sufficient to compute directly P( { N Xn ∈ I n } c ):P( { N Xn ∈ I n } c ) ≤ P (cid:16) N Xn ≤ nl − n . (cid:17) + P (cid:16) N Xn ≥ nl + n . (cid:17) (4.6) ow let us compute each expression separately. Let m := (cid:6) nl − n . (cid:7) be an auxiliar variable.We have at the beginning: P (cid:16) N Xn ≤ nl − n . (cid:17) ≤ P( N Xn ≤ m )(by using N Xt ≥ k ⇔ S Xk ≤ t ) = P (cid:0) S Xm ≥ n (cid:1) = P (cid:32) S Xm m − l ≥ nm − l (cid:33) (by 4.2 with P( | B X − l | ≤
1) = 1) ≤ (cid:32) − m (cid:18) nm − l (cid:19) (cid:33) (4.7)Now we need to bound m in order to get the right order for moderate deviations. Let usstart looking at the following: (cid:18) nm − l (cid:19) ≥ l (cid:18) nn − ln . + l − (cid:19) , by using m ≤ nl − n . + 1 ≥ l (cid:32) − ln . + ln − (cid:33) ≥ l (cid:18) n . − n (cid:19) (cid:32) − ln . + ln (cid:33) ≥ l n . (cid:18) − n . (cid:19) (cid:32) − ln . + ln (cid:33) (4.8)We have: lim n →∞ (cid:18) − n . (cid:19) (cid:32) − ln . + ln (cid:33) = 1 > n large enough, the expression on the right hand side of 4.8 is larger than l / (4 n . )so that: (cid:18) nm − l (cid:19) ≥ l n . (4.9)Also, for n > m = (cid:108) nl − n . (cid:109) ≥ nl − n . + 1 ≥ n l + 1 = n l (cid:18) ln (cid:19) ≥ n l (4.10) inally we can use 4.9, 4.10 in 4.7 to get:P (cid:16) N Xn ≤ nl − n . (cid:17) ≤ (cid:32) − m (cid:18) nm − l (cid:19) (cid:33) ≤ (cid:18) − m · l n . (cid:19) ≤ (cid:18) − n l · l n . (cid:19) ≤ (cid:18) − l · n . (cid:19) (4.11)for n > m := (cid:4) nl + n . (cid:5) be an auxiliar variable anddo the same as before. We have at the begining:P (cid:16) N Xn ≥ nl + n . (cid:17) ≤ P( N Xn ≥ m )(by using N Xt ≥ k ⇔ S Xk ≤ t ) = P (cid:0) S Xm ≤ n (cid:1) = P (cid:32) S Xm m − l ≤ nm − l (cid:33) (by 4.2 with P( | B X − l | ≤
1) = 1) ≤ (cid:32) − m (cid:18) nm − l (cid:19) (cid:33) (4.12)Now we need to bound m in order to get the right order for moderate deviations. Let usstart looking at the following: (cid:18) nm − l (cid:19) ≥ l (cid:18) nn + ln . − (cid:19) , by using m ≤ nl + n . ≥ l (cid:32)
11 + ln . − (cid:33) ≥ l n . (cid:32)
11 + ln . (cid:33) (4.13)where the very last inequality was obtained by assuming n large enough and noticing that:lim n →∞ (cid:32)
11 + ln . (cid:33) ≥ n > m = (cid:106) nl + n . (cid:107) ≥ n l (4.14) inally we can use 4.13, 4.14 in 4.12 to get:P (cid:16) N Xn ≥ nl + n . (cid:17) ≤ (cid:32) − m (cid:18) nm − l (cid:19) (cid:33) ≤ (cid:18) − m · l n . (cid:19) ≤ (cid:18) − n l · l n . (cid:19) ≤ (cid:18) − l · n . (cid:19) (4.15)Then combining 4.6, 4.11 and 4.15 we obtain:P( { N Xn ∈ I n } c ) ≤ (cid:18) − n . · l (cid:19) and by symmetry we finally get:P( C cn ) ≤ (cid:18) − n . · l (cid:19) for every n > b = l > Let F n ( q ) denote the already defined event that any optimal alignment of X and Y leavesout at most a proportion q of blocks in X as well as in Y . Lemma 4.2
For any q satisfying q > l − , we have that there exists β > such that: P( F nc ( q )) ≤ e − βn for all n . Note that here q does not depend on n and also β > does not depend on n but on l and q . Lemma 4.3
For every < δ < l there exists a constant b > depending on δ and l butnot on n such that: P( K nc ( δ )) ≤ e − b · n for n large enough. X and Y Let X m be the sequence X ∞ taken up to the m -th block. Similarly, let Y m be the sequence Y ∞ taken up to the m -th block. Let D m ( δ ) be the event that the proportions of blocks in m and Y m of length l − l and l + 1 are not further from 1 / δ . Let D n ( δ ) be theevent: D n ( δ ) = (cid:92) m ∈ I n D m ( δ )where we defined the interval I n = (cid:2) n/l − n . , n/l + n . (cid:3) . Lemma 4.4
For every δ > we have that: P( D nc ( δ )) ≤ n . (cid:18)
11 + 3 δ (cid:19) n (1+3 δ )2 l for n large enough. Recall that N Xn (resp. N Yn ) is the number of blocks in X (resp. in Y ) having lenghts in { l − , l, l + 1 } as in expression 4.3. Let G n ( δ ) be the event that the following inequalityholds: N Yn N Xn ≤ δ. Lemma 4.5
For every δ > there exist two constantsl b , b > depending on l and on δ such that: P( G nc ( δ )) ≤ e − b · n . + 2 e − b · n . for every n large enough. Let J n ( δ ) denote the event that the proportion of left out blocks at the end of X or Y in anyoptimal alignment is at most a proportion δ of the total number of blocks in each of thesesequences. As all events before, we want to prove that J n ( δ ) has high probability to happenfor every δ > n is large enough. We need an extra definition and a previous lemmain order to show the high probability of J n ( δ ).For an integer number s ∈ [1 , n ] we denote: L s := | LCS( X X · · · X s , Y Y · · · Y n ) | (4.16) Lemma 4.6
Given δ > , there exists a constant c ∗ > not depending on n but on δ suchthat: E[ L n − L n − δn ] ≥ c ∗ · n (4.17) for every n > large enough. roof. Given n > t ∈ [ − ,
1] let us define the number γ ( t, n ) > γ ( t, n ) := E[ | LCS( X · · · X n + nt , Y · · · Y n − nt ) | ] n This number γ ( t, n ) is a kind of extension for the Chvatal-Sankoff constant γ (see [1]), ormore precisely in the case of our paper an extension of γ l defined as in expression 3.11. Anextended motivation for this definition can be found in [30]. For any fixed t ∈ [ − ,
1] it isknown that γ ( t, n ) converges as n → ∞ (see [15] or [30]), let us denote that limit by γ ( t ) := lim n →∞ γ ( t, n ) . The speed of convergence to that limit is also known due to theorem 2.1 in [15]. This theoremsays that there exists θ > n such that: | γ ( t, n ) − γ ( t ) | ≤ θ ln( n ) √ n (4.18)for any fixed t ∈ [ − ,
1] provided n > t ∈ [ − , (cid:55)→ γ ( t ) ∈ [0 ,
1] is concave and symmetric in the origin (see [30]). Hence,for every t ∈ [ − ,
1] we have γ (0) ≥ γ ( t ) (4.19)Let us set an auxiliar variable n ∗ as follows: n ∗ := n (cid:18) − δ (cid:19) Note that with the last definition, the inequality n ln( n ) √ n = √ n ln( n ) ≥ √ n ∗ ln( n ∗ ) = n ∗ ln( n ∗ ) √ n ∗ (4.20)holds due to √· and ln( · ) being increasing functions. By using the previous definitions, theinequality for the speed of convergence 4.18, the concave inequality 4.19 and inequality 4.20(following this order), we can write:E[ L n − L n − δn ] = n γ (0 , n ) − n ∗ γ ( t ∗ , n ∗ ) ≥ n (cid:18) γ (0) − θ ln( n ) √ n (cid:19) − n ∗ (cid:18) γ ( t ∗ ) + θ ln( n ∗ ) √ n ∗ (cid:19) ≥ n (cid:18) γ (0) − θ ln( n ) √ n (cid:19) − n ∗ (cid:18) γ (0) + θ ln( n ∗ ) √ n ∗ (cid:19) = ( n − n ∗ ) γ (0) − θ (cid:18) n ln( n ) √ n + n ∗ ln( n ∗ ) √ n ∗ (cid:19) ≥ ( n − n ∗ ) γ (0) − θ n ln( n ) √ n = n δ γ (0)2 − θ n ln( n ) √ n = (cid:18) δ γ (0)2 − θ ln( n ) √ n (cid:19) n ≥ δ γ (0)4 n (4.21) here the very last inequality above holds for n large enough, sincelim n →∞ (cid:18) θ ln( n ) √ n (cid:19) = 0 < δ γ (0)4To finish the proof we take c ∗ = δ γ (0)4 .Now comes the main result of this section which establishes the high probability of the event J n ( δ ): Proposition 4.1
For every δ > , there exists a constant θ > not depending on n but on δ such that: P( J nc ( δ )) ≤ e − θ · n for every n > large enough. Proof.
With the notation as in 4.16 we write:P( J nc ( δ )) ≤ | LCS( X · · · X n − δn , Y · · · Y n ) | − L n ≥ L n − δn − L n ≥ L n − δn − L n − E[ L n − δn − L n ] ≥ E[ L n − L n − δn ] ) (4.22)Let us define M n ( δ ) := L n − δn − L n − E[ L n − δn − L n ]It is not difficult to see that M n ( δ ) is a martingale with respect to the filtration F n = σ { ( X k , Y k ) : k ≤ n } and that M = 0. The following inequality also holds: | M n ( δ ) − M n − ( δ ) | ≤ δ > a i = 4 and v = E[ L n − L n − δn ] to estimate:P( L n − δn − L n − E[ L n − δn − L n ] ≥ v ) ≤ (cid:18) − v · n (cid:19) = 2 exp − n (cid:32) E[ L n − L n − δn ] n (cid:33) (by 4.17 and c ∗ from lemma 4.6) ≤ (cid:18) − c ∗ · n (cid:19) Taking θ = c ∗ > .6 Optimal events In theorem 2.3, q represents the proportion of left out blocks in X and in Y . In reality,typically, the proportion of left out blocks in X will not be exactly equal to the proportionof left out blocks in Y . Because of this, q will designate the proportion of left out blocksin X and q will designate the proportion of left out blocks in Y . We will have that q canbe made as close to q as we want to by taking a large n . Now we need to rewrite all ourconditions as in theorem 2.3 with q and q instead of q .Let us define the following events: • Given any m , m , q , q , let E m ,m ,q ,q ( (cid:15) ) denote the event that there is no optimalalignment of X m with Y m leaving out a proportion of q blocks in X m and a propor-tion of q blocks in Y m and such that: H ( q ) + H ( q ) + (1 − max { q + 3 q , q + q } ) (ln(1 /
9) + H ( p )) ≤ − (cid:15). (4.23) • Let E n ( (cid:15) ) be the event : E n ( (cid:15) ) = (cid:92) m ,m ∈ I n ,q ,q E m ,m ,q ,q ( (cid:15) ) . (4.24)If δ designates the difference between q and q , then note that the systemmin (cid:20) p l − ,l + p l − ,l +1 p l − ,l − + p l − ,l + p l − ,l +1 (cid:18) − q (1 / − δ (cid:19) + − (cid:18) − δ + 2( q − δ )(1 / δ ) (cid:19) p l +1 ,l +1 p l +1 ,l − + p l +1 ,l + p l +1 ,l +1 − q · δ (1 / − δ (cid:21) | q − q | ≤ δH ( q ) + H ( q ) + (1 − max { q + 3 q , q + q } ) (ln(1 /
9) + H ( p )) ≥ δ goes to zero (when q is as close to q as we want to by taking a large q ). Note also that replacing q and q by q and taking δ = 0 in the minimized function and in the last inequality of 4.25, they become equal to 2.18respectively 2.21. If the minimizing problem in theorem 2.3 has a strictly positive solution2 (cid:15) and if expression 2.18 is less than (cid:15) , this implies that 2.21 must be smaller than a − (cid:15) for (cid:15) > δ small enough. Lemma 4.7
Assume there exists < q < (1 / and (cid:15) > such that for all { p ij } i,j and q ∈ [0 , q ] satisfying all the conditions 2.19, 2.20 and 2.21 in theorem 2.3, we have thatexpression 2.18 is larger or equal to (cid:15) (in other words, the condition that the minimizingproblem in theorem 2.3 has a strictly positive solution (cid:15) is satisfied). Then, we have that here exists (cid:15) > and δ > such that for all { p ij } i,j ∈{ l − ,l,l +1 } and q , q ∈ [0 , q ] and δ ∈ [0 , δ ] satisfying 2.19 and 2.20, we have that if | q − q | ≤ δ and if p l − ,l + p l − ,l +1 p l − ,l − + p l − ,l + p l − ,l +1 (cid:18) − q (1 / − δ (cid:19) + − (cid:18) − δ + 2( q − δ )(1 / δ ) (cid:19) p l +1 ,l +1 p l +1 ,l − + p l +1 ,l + p l +1 ,l +1 − q · δ (1 / − δ ≤ (cid:15) (4.26) then H ( q ) + H ( q ) + (1 − max { q + 3 q , q + q } ) (ln(1 /
9) + H ( p )) ≤ − (cid:15) (4.27) Proof.
We are going to do the proof by reductio ad absurdum (reduction to the absurd).Assume for this that for all ( p ij ) i,j ∈ I and q ∈ [0 , q ] satisfying all the conditions 2.19, 2.20and 2.21 in theorem 2.3 we have that expression 2.18 is larger equal to 2 (cid:15) . Assume that therest of the lemma would not hold. Then for every δ > (cid:126)p : (cid:126)p := ( p l − ,l − , p l − ,l , . . . , p l +1 ,l +1 , q , q , δ )such that the components satisfy | q − q | ≤ δ , and the components of (cid:126)p satisfy 2.19, 2.20whilst inequality 4.26 is satisfied and we can take the expression H ( q ) + H ( q ) + (1 − max { q + 3 q , q + q } ) (ln(1 /
9) + H ( p )) (4.28)as close to zero as we want. Hence there exists a sequence (cid:126)p , (cid:126)p , . . . , (cid:126)p t , . . . of such vectorswith notation: (cid:126)p ( t ) := ( p l − ,l − ( t ) , p l − ,l ( t ) , . . . , p l +1 ,l +1 ( t ) , q ( t ) , q ( t ) , δ ( t ))so that for each t ∈ N the vector (cid:126)p ( t ) satisfies all the conditions 2.19, 2.20 and 4.26, whilstlim t →∞ | q ( t ) − q ( t ) | = 0and expression 4.28 converges to zero as t goes to infinity.The vectors (cid:126)p ( t ) are contained in a bounded domain and hence in a compact domain. Thisimplies that there exists a converging subsequence. Hence there exists an increasing map π : N → N so that (cid:126)p ( π ( t )) converges as t goes to infinity. Let the limit be denoted by (cid:126)p := ( p l − ,l − , p l − ,l , . . . , p l +1 ,l +1 , q , q , . We have that q = q , so let us denote q = q = q . We find that our limit satisfies allthe conditions 2.19, 2.20. Furthermore, at the limit expression 4.28 becomes equal to zero.Replacing then q and q in 4.28 by q , we find that condition 2.21 is satisfied. finally since forour sequence (cid:126)p ( π ( t )), we have that 4.26 is satisfied, by continuity it must also be satisfied forthe limit. Hence, noting that at the limit q = q = q and δ = 0, we get that expression 2.18is less or equal to (cid:15) . This contradicts our assumption, since our limit vector satisfies all theconditions2.19, 2.20 and 2.21 and should thus have expression 2.18 larger equal to 2 (cid:15) . Hence, e have that when all the conditions 2.19 and 2.20, are satisfied and when δ goes to zero thenexpression 4.26 should be bounded away from zero. This means that for δ > (cid:15) > − (cid:15) .Let us now show that event E m ,m ,q ,q ( (cid:15) ) holds with high probability. Lemma 4.8
Assume that there exists < q < (1 / and (cid:15) > such that for all { p ij } i,j and q ∈ [0 , q ] satisfying all the conditions 2.19, 2.20 and 2.21 in theorem 2.3, we have thatexpression 2.18 is larger or equal to (cid:15) . Then, for every (cid:15) > there exist a polynomial w ( n ) > and a constant ϑ > , both only depending on l such that: P( E nc ( (cid:15) )) ≤ w ( n ) e − ϑ · n for every n large enough. Proof.
Let (cid:126)a denote an alignment of the X m and Y m . Hence a consists of two binaryvectors (cid:126)a = ( (cid:126)a X , (cid:126)a Y ) the first one having length m and the second one having length m .Hence (cid:126)a x ∈ { , } m , (cid:126)a Y ∈ { , } m when the i -th entry a Xi of (cid:126)a X is a 1 that means thatthe i -th block of X m is discarded (entirely aligned with gaps)by the alignment (cid:126)a , otherwisethe i -th block of X m is not discarded. Similarly when a Y i = 1 then the i -th block of Y m is discarded. Here we use the same way of defining alignment as explained before in thefirst section: we specify which blocks get entirely discarded and then align the rest block byblock. Doing so and assuming that the alignment (cid:126)a is not random, we get that the alignedblock pairs are i.i.d.. For the lengths of aligned block pairs we have nine possibilities eachhaving the same probability. Hence, given the alignment a , the empirical frequencies of thealigned block pair lengths is simply a multinomial distribution. Let p = { p ij } i,j ∈{ l − ,l,l +1 } be a (non-random) probability distribution. Let E a ( p ) denote the event that the empiricaldistribution of the aligned block pairs by the alignment a is not p .From what we said we have that the probability P( E ca ( p )) is equal to the probability that a9-nomial variable with parameter m ∗ and all probability parameters equal to 1 / p . Here m ∗ designates the number of aligned block pairs by a . Hence,we get P( E ca ( p )) = (cid:18) m ∗ m ∗ p l − ,l − m ∗ p l − ,l . . . m ∗ p l +1 ,l +1 (cid:19) (cid:18) (cid:19) m ∗ (4.29)where (cid:18) aa . . . a k (cid:19) = a ! a ! · · · a k !is the multinomial factorial coefficient. Let us define B ( p ) := (cid:18) m ∗ p l − ,l − m ∗ p l − ,l m ∗ . . . p l +1 ,l +1 m ∗ (cid:19) M ( p ) := (cid:89) p i ∈{ p l − ,l − ,...,p l +1 ,l +1 } p p i i H ( p ) := (cid:88) p i ∈{ p l − ,l − ,...,p l +1 ,l +1 } p i ln(1 /p i ) = ln (cid:18) M ( p ) (cid:19) (4.30) ote that B ( p ) · ( M ( p )) m ∗ is the probability distribution of a multinomial random variablewith parameters m ∗ and vector ( m ∗ p l − ,l − , . . . , m ∗ p l +1 ,l +1 ). Hence B ( p ) · ( M ( p )) m ∗ ≤ . (4.31)Then, by using 4.31 we can bound expression 4.29 as follows: B ( p ) (cid:18) (cid:19) m ∗ = B ( p ) · ( M ( p )) m ∗ (cid:18) / M ( p ) (cid:19) m ∗ ≤ (cid:18) / M ( p ) (cid:19) m ∗ = exp (cid:18)(cid:20) ln (cid:18) (cid:19) + H ( p ) (cid:21) m ∗ (cid:19) (4.32)On the other hand, we have at least (1 − max { q + 3 q , q + q } ) min { m , m } aligned blockpairs. Let us give an intuition for this. There are three situations for aligning a fixed block in X with blocks in Y . First, when we align one block in X with one block in Y one to one, theresulting length contributing to the LCS is the minimun between their lenghts, so at most ifall the blocks of X and Y are aligned one to one then we will have at most a contributionof min { m , m } aligned blocks pairs. Second, when we align one block in X with severalblocks in Y then we at least leave q · m blocks in X . Third, when know that we cannotalign two adjacent blocks in X with the same block in Y , then we leave at least 2 q · m blocks in X also. In total, in the worse case, looking first at blocks in X , we are leaving(3 q + q ) min { m , m } blocks in both sequences X and Y . Similarly, but looking first at Y ,we can leave (3 q + q ) min { m , m } blocks in both sequences X and Y . Finally, at least wehave (1 − max { q + 3 q , q + q } ) min { m , m } aligned block pairs due to the considerationsabove.Since m , m ∈ I n , this gives the lower bound for m ∗ m ∗ ≥ (1 − max { q + 3 q , q + q } ) · (( n/l ) − n . ) . (4.33)and hence together with the bound 4.32, we obtainP(( E ca ( p )) ≤ exp (cid:0) (ln(1 /
9) + H ( p ))(1 − max { q + 3 q , q + q } )(( n/l ) − n . ) (cid:1) (4.34)Let A m ,m ,q ,q denote the set of all alignments aligning X m with Y m and leaving outa proportion of q blocks in X m and a proportion of q blocks in Y m . In other words,the set A m ,m ,q ,q is the set of all elements (cid:126)a = ( (cid:126)a X , (cid:126)a Y ) of { , } m × { , } m for which | (cid:126)a X | = q m and | (cid:126)a Y | = q m .Let P (cid:15),q ,q denote the set of those distributions p (for aligned block pairs, hence on the spaceΩ = { ( l − , l − , ( l − , l ) , . . . , ( l + 1 , l + 1) } ) for which inequality 4.23 is satisfied and whichare possible in our case. Before we continue with the proof, let us look at an example: Example 4.1
Assume we look at binary strings of length . Then there can be 0,1,2,3,4 or 5ones. Hence, the empirical distribution for side one when we flip a coin exactly five times canonly be , , , , or . In general for a string of length n and k symbols,there are no more than ( n + 1) k − possible empirical distributions (see [ ? ], Lemma 2.1.2 (a)). n the case above we have an empirical distribution for m ∗ aligned block pairs. For each blockpairs there are possibilities. Hence, there are no more than ( m ∗ + 1) possible empiricaldistributions. However m ∗ is not known. It could potentially take on any value between and ( n/l ) + n . . Hence, we find that for the number of empirical distributions we need toconsider the following upper bound: (( n/l ) + n . ) · (( n/l ) + n . + 1) ≤ (( n/l ) + n . + 1) . Let us continue with the proof. We have that: (cid:92) a ∈A m ,m ,q ,q , P (cid:15),q ,q E a ( p ) = E m ,m ,q ,q ( (cid:15) )and hence: P( E cm ,m ,q ,q ( (cid:15) )) ≤ (cid:88) a ∈A m ,m ,q ,q , p ∈P (cid:15),q ,q P( E ca ( p )) . (4.35)By using 4.34, the inequality 4.35 above becomes: P( E cm ,m ,q ,q ( (cid:15) )) ≤ (cid:88) a ∈A m ,m ,q ,q , p ∈P (cid:15),q ,q exp (cid:0) (ln(1 /
9) + H ( p ))(1 − max { q + 3 q , q + q } )(( n/l ) − n . ) (cid:1) . Note that the number of alignment considered in the sum on the right hand side of the lastinequality above can be bound as follows: |A m ,m ,q ,q | = (cid:18) m q m (1 − q ) m (cid:19)(cid:18) m q m (1 − q ) m (cid:19) ≤ (cid:18) q q (1 − q ) − q (cid:19) m (cid:18) q q (1 − q ) − q (cid:19) m = exp( H ( q ) m + H ( q ) m ) ≤ exp(( H ( q ) + H ( q )) m ∗ ) ≤ exp (cid:0) ( H ( q ) + H ( q ))(( n/l ) + n . ) (cid:1) (4.36)where for i = 1 , H ( q i ) := q i ln(1 /q i ) + (1 − q i ) ln(1 / (1 − q i )) . The number of distributions in P (cid:15),q ,q we need to consider is (as explained above) less orequal to (( n/l ) + n . + 1) . Combining all of this we find that P( E cm ,m ,q ,q ( (cid:15) )) is less orequal to: exp(( H ( q )+ H ( q ))(( n/l )+ n . )) · b · exp (cid:0) (ln(1 /
9) + H ( p ))(1 − max { q + 3 q , q + q } )(( n/l ) − n . ) (cid:1) . where b := (( n/l ) + n . + 1) . In other words, we found that: P( E cm ,m ,q ,q ( (cid:15) )) ≤ b exp (cid:16) nl ( H ( q ) + H ( q ) + (ln(1 /
9) + H ( p ))(1 − max { q + 3 q , q + q } ) + r ) (cid:17) (4.37) here the rest term r is equal to: r = ln − . ( H ( q ) + H ( q ) − (ln(1 /
9) + H ( p ))(1 − max { q + 3 q , q + q } ))being bounded as follows: | r | ≤ ln − . ( | H ( q ) | + | H ( q ) | + ( | ln(1 / | + | H ( p ) | )) = ln − . (3 + | ln(1 / | ) . Note that r is bounded from above by a constant times n − . where the constant does notdepend on l, q , q , p . Hence for n large enough: r ≤ (cid:15)/ p ∈ P (cid:15),q ,q hence satisfyinginequality 4.23. This implies that in the bound 4.37, we can assume that inequality 4.23holds. This then implies P( E cm ,m ,q ,q ( (cid:15) )) ≤ b exp (cid:16) nl ( − (cid:15) + r ) (cid:17) (4.39)Assuming now that 4.38 holds, we obtain:P( E cm ,m ,q ,q ( (cid:15) )) ≤ b exp (cid:16) nl ( − (cid:15)/ (cid:17) (4.40)Note that the bound on the right side of the last inequality above is negatively exponentiallysmall in n , since b is an expression which is only polynomial in n . Using the equation 4.24,we obtain: P( E nc ( (cid:15) )) ≤ (cid:88) m ,m ∈ I n ,q ,q P( E cm ,m ,q ,q ( (cid:15) )) . Applying inequality 4.40 to the last inequality above, we obtain:P( E nc ( (cid:15) )) ≤ (cid:88) m ,m ∈ I n ,q ,q b exp (cid:16) nl ( − (cid:15)/ (cid:17) . (4.41)Note that when m is given, the number of possibilities for the number of left out blocks in X m is at most m . Hence, for given m we have that q can take on at most m values.Similarly for given m we have that q can take on at most m values. But m and m are less then ( n/l ) + n . . Also, both m and m are in I n hence they can take on at most2 n . values. This implies that in the sum 4.41, the number of terms is bound above by theexpression: (cid:0) ( n/l ) + n . (cid:1) n . This upper bound applied to inequality 4.41 yields:P( E nc ( (cid:15) )) ≤ b (cid:0) ( n/l ) + n . (cid:1) n . exp (cid:16) nl ( − (cid:15)/ (cid:17) . (4.42)which is the negative exponential upper bound we where looking for. .7 Positive expected change in the score Let us recall the events that we have proven to have high probability: • C n is the event that the number of blocks in X and in Y lies in the interval I n = (cid:104) nl − n . , nl + n . (cid:105) . • D n ( δ ) is the intersection D n ( δ ) = (cid:92) m ∈ I n D m ( δ ) , where D m ( δ ) is the event that the proportion of blocks in X m and in Y m of length l − , l and l + 1 are not further from 1 / δ , where X m (resp. Y m ) denotes thesequence X ∞ taken up to the m -th block (resp. the sequence Y ∞ taken up to the m -thblock). • F n ( q ) is the event that any optimal alignment of X and Y leaves out at most a propor-tion q of blocks in X as well as in Y . • G n ( δ ) is the event that the following inequality holds: N Yn N Xn ≤ δ where N Xn (resp. N Yn ) is the number of blocks in X (resp. in Y ) having length in { l − , l, l + 1 } . • E n ( (cid:15) ) is the intersection E n ( (cid:15) ) = (cid:92) m ,m ∈ I n ; q ,q ∈ [0 , E m ,m ,q ,q ( (cid:15) )where E m ,m ,q ,q ( (cid:15) ) is the event that there is no optimal alignment of X m with Y m leaving out a proportion of q blocks in X m and a proportion of q blocks in Y m andsuch that: H ( q ) + H ( q ) + (1 − max { q + 3 q , q , q } )(ln(1 /
9) + H ( p )) ≤ − (cid:15) where (cid:15) > (cid:15), δ and q and comes from lemma 4.7, X m (resp. Y m )denotes the sequence X ∞ taken up to the m -th block (resp. the sequence Y ∞ takenup to the m -th block) and H ( p ) denotes the entropy as in 4.30 for an alignment.We can now formulate our combinatorial lemma based on those events: Lemma 4.9
Let us consider the constants q , (cid:15) , δ and (cid:15) from lemma 4.7. Assume that C n , D n ( δ ) , F n ( q ) , G n ( δ ) and E n ( (cid:15) ) all hold. Then, we have that E[ ˜ L n − L n | X, Y ] ≥ (cid:15) roof. For any x, y ∈ { , } n let L ( x, y ) denote the length of the LCS of x and y . Letnow x, y ∈ { , } be any two realizations so that if X = x and Y = y , then the events C n , D n ( δ ), F n ( q ) and E n ( (cid:15) ) all hold. Let a be a left most optimal alignment of x and y . Let˜ x denote the sequence x on which we performed our random changes. That is ˜ x is obtainedby selecting a block of length l − l and also selectinga block of length l + 1 at random and reducing it to length l . Let x ∗ be the sequence weobtain by applying to x only the first one of the two random changes. That is x ∗ is obtainedbe increasing the length of a randomly chosen block of x of length l − l . So, westart with x . Then we apply the first change and obtain x ∗ . Then in x ∗ we choose a block oflength l + 1 at random, decrease it by one unit to obtain ˜ x .For all i, j ∈ { l − , l, l + 1 } , let p ij denote the proportion of aligned block pairs with lengths( i, j ) in the alignment a of x and y . Let q , resp. q denote the proportion of blocks notaligned by a in x , resp. in y . Let p Il − denote the proportion of blocks which get aligned by a one block to one block, among all blocks of x of length 1 −
1. Let p IIl − denote the proportionamong all blocks of x of length l − y . Finally, let p IIIl − denote the proportion among the blocks of length l − x which are left out or aretogether with other blocks of x aligned with the same block of y . Note that when we increaseby one unit a block in this third category, then in general the score does not get any increase.On the other hand, assume that the block of x length l − l or l + 1 the score is going to increase. Let G l − ,I be the event that the blockof length l − L ( x ∗ , y ) − L ( x, y ) = 1 | G l − ,I ) ≥ p l − ,l + p l − ,l +1 p l − ,l − + p l − ,l + p l − ,l +1 . Note that by only adding a bit the score cannot decrease, so that the last inequality abovemeans: E[ L ( x ∗ , y ) − L ( x, y ) | G l − ,I ] ≥ p l − ,l + p l − ,l +1 p l − ,l − + p l − ,l + p l − ,l +1 . (4.43)When the block of length l − y at thesame time, then we will always observe and increase of one unit. This yields:E[ L ( x ∗ , y ) − L ( x, y ) | G l − ,II ] = 1 , (4.44)where G l − ,II denotes the event that the chosen block of length l − y . By law of total probability we find thus:E[ L ( x ∗ , y ) − L ( x, y )] ≥ P( G l − ,I ) p l − ,l + p l − ,l +1 p l − ,l − + p l − ,l + p l − ,l +1 + P( G l − ,II ) ≥ (1 − P( G l − ,III )) p l − ,l + p l − ,l +1 p l − ,l − + p l − ,l + p l − ,l +1 where G l − ,III denotes the event that the block of length l − y at the same time as other blocks of x . The last inequality above yields:E[ L ( x ∗ , y ) − L ( x, y )] ≥ (1 − p IIIl − ) p l − ,l + p l − ,l +1 p l − ,l − + p l − ,l + p l − ,l +1 . (4.45) ote that the proportion of left out blocks in x is q . There can not be two adjacent blocksof x aligned with the same block of y (this is so because a is an optimal left most alignment,see lemma 3.1). So between blocks of x aligned with the same block of y , there is at least oneleft out block of x . Hence the maximum proportion of blocks of x , which are aligned at thesame time as other blocks of x to the same block of y , can not exceed twice the number ofleft out blocks of x . This yields a lower bound equal to 2 q . This is as a proportion amongall blocks in x , but we are interested in the number as a proportion of the total number ofblocks of length l − x . So, we get as lower bound 2 q /p l − , where p l − is the proportionof blocks of x which have length l −
1. Adding the blocks in x which are left out and theblocks which are aligned with several blocks of y , we get: p IIIl − ≤ q p l − . (4.46)By D n ( δ ), we have that p l − ≥ (1 / − δ , so that together with 4.46, we obtain: p IIIl − ≤ q (1 / − δ . By using the above inequality in 4.45 we obtain:E[ L ( x ∗ , y ) − L ( x, y )] ≥ p l − ,l + p l − ,l +1 p l − ,l − + p l − ,l + p l − ,l +1 (cid:18) − q (1 / − δ (cid:19) . (4.47)Next we are going to investigate the effect of decreasing a randomly chosen block of length l + 1 by one unit. The score can decrease when the selected block of x of length l + 1 isaligned with a block of length l + 1 of y . If it is aligned with one block and that block haslength l or l −
1, then there is no decrease. This leads to:E[ L (˜ x, y ) − L ( x ∗ , y ) | G l +1 ,I ] ≥ − p l +1 ,l +1 p l +1 ,l − + p l +1 ,l + p l +1 ,l +1 , where G l +1 ,I denotes the event that the block of length l + 1 chosen is aligned one block withone block. When the selected block of x of length l + 1 is aligned with several blocks of y then the score decreases by one unit. When the selected block of length l + 1 in x is left outor is aligned at the same time as other blocks of x to the same block of y then there is nodecrease. This leads to:E[ L (˜ x, y ) − L ( x ∗ , y )] ≥ − P( G l +1 ,I ) p l +1 ,l +1 p l +1 ,l − + p l +1 ,l + p l +1 ,l +1 − P( G l +1 ,II ) , (4.48)where G l +1 ,II denotes the event that the selected block of length l + 1 is aligned with severalblocks of y at the same time. Let p l +1 denote the total proportion of blocks of length l + 1among all blocks of x . Let p l +1 ,I denote the proportion among all blocks of x of length l + 1,of blocks which are aligned one to one. There is a proportion of q totally left out blocks in x . At most a proportion δ are at the end of the alignment a contiguous group of left outblocks. That means, (assuming q ≥ δ ), the proportion of left out blocks in x which arenot adjacent to another left out block of x is at least q − δ . Going with each left out blockwhich is not adjacent to another left out block, there is at least one adjacent block which is ligned together with several other blocks of x to the same block of y . This gives a lowerbound for the blocks of x which are not aligned one block to one block of δ + 2( q − δ ). Thisis taken as a proportion among all blocks of x . This gives among all blocks of length l + 1 aproportion of at least: δ + 2( q − δ )(1 / δ ) , since by the event D n ( δ ) we know that among all blocks of x the proportion of the blocksof length l + 1 is less than (1 /
3) + δ . Hence,P( G l +1 ,I ) ≤ − δ + 2( q − δ )(1 / δ ) . (4.49)Next let us note that we can give an upper bound for the number of blocks of x aligned withseveral blocks of y . Since we never have several blocks aligned with several blocks, we havethat the number of blocks of x aligned with several blocks of y is not more than the totalnumber of left out blocks of y . This is so because between two blocks aligned with the sameblock there is always at least one left out block. The proportion of left out blocks in y is q .but this is taken as proportion among all the blocks of y . Since the total amount of blocks in x and y could not be exactly the same, that number can get slightly changed when we reportit as proportion of the total number of blocks in x . Let p l +1 denote the proportion amongthe blocks of x which are of length l + 1. We have thus that the probability to select a blockof length l + 1 of x which is aligned with several blocks of y is less or equal toP( G l +1 ,II ) ≤ q N Yn p l +1 N Xn . (4.50)By the event D n ( δ ) we have p l +1 ≥ − δ (4.51)and by the event G n ( δ ) we have N Yn N Xn ≤ δ . (4.52)Applying now 4.51 and 4.52 to 4.50, we findP( G l +1 ,II ) ≤ q · δ (1 / − δ (4.53)Finally, using inequalities 4.53, 4.49 in 4.48 we get:E[ L (˜ x, y ) − L ( x ∗ , y )] ≥ − (cid:18) − δ + 2( q − δ )(1 / δ ) (cid:19) p l +1 ,l +1 p l +1 ,l − + p l +1 ,l + p l +1 ,l +1 − q · δ (1 / − δ , (4.54) sing inequalities 4.47 and 4.54 together we find:E[ L (˜ x, y ) − L ( x, y )] ≥ E[ L (˜ x, y ) − L ( x ∗ , y )] + E[ L ( x ∗ , y ) − L ( x, y )] ≥ p l − ,l + p l − ,l +1 p l − ,l − + p l − ,l + p l − ,l +1 (cid:18) − q (1 / − δ (cid:19) − (cid:18) − δ + 2( q − δ )(1 / δ ) (cid:19) p l +1 ,l +1 p l +1 ,l − + p l +1 ,l + p l +1 ,l +1 − q · δ (1 / − δ , (4.55)Note next that we can apply lemma 3.2 with ∆ = n . because of C n and δ , δ ≤ δ thanksto E n ( δ ). Hence we find that: | q − q | ≤ . | δ | + 4 ln . n We assume that n is large enough so that: | q − q | ≤ | δ | . With the last inequality holding, we get from lemma 4.7 that if inequality 4.26 holds, then4.27 should be satisfied. By the event E n ( (cid:15) ), we have that 4.27 can not be satisfied. Hence,the inequality 4.26 cannot hold, which implies that the expression on the left side of 4.26 islarger or equal to (cid:15) . Together with inequality 4.55, this implies that:E[ L (˜ x, y ) − L ( x, y )] ≥ (cid:15) . Acknowledgments
The authors would like to thank the support of the German Science Foundation (DFG)through the International Graduate College ”Stochastics and Real World Models” (IRTG1132) at Bielefeld University and through the Collavorative Research Center 701 ”SpectralStructures and Topological Methods in Mathematics” (CRC 701) at Bielefeld University. eferences [1] V. Chvatal and D. Sankoff. Longest common subsequences of two random sequences . J.Appl. Probability, 12 : 306–315, 1975.[2] M. S. Waterman.
Estimating statistical significance of sequence alignments . Phil. Trans.R. Soc. Lond. B, 344:383-390, 1994.[3] J. Lember, H. Matzinger.
Standard Deviation of the Longest Common Subsequence . Ann.Probab. Volume 37, Number 3: 1192-1235, 2009.[4] M. S. Waterman,
Introduction to Computational Biology , Chapman & Hall,1995.[5] P. Pevzner.
Computational Molecular Biology , MIT Press, Cambridge, MA, 2000. Analgorithmic approach, A Bradford Book.[6] M. S. Waterman and M.Vingron,
Sequence comparison significance and Poisson approxi-mation , Statistical Science, 9(3):367–381,1994.[7] R. Arratia and M. S. Waterman.
A phase transition for the score in matching randomsequences allowing deletions . Ann. Appl. Probab., 4(1):200–225, 1994.[8] R.A. Baeza-Yates, R. Gavald, G. Navarro, and R. Scheihing.
Bounding the expected lengthof longest common subsequences and forests . Theory Comput. Syst., 32(4):435–452, 1999.[9] J.G. Deken,
Some limit results for longest common subsequences . Discrete Math.,26(1):17–31, 1979.[10] V. Dancik and M. Paterson,
Upper bounds for the expected length of a longest commonsubsequence of two binary sequences . Random Structures Algorithms, 6(4):449–458, 1995.[11] V. Dancik and M. Paterson,
Longest common subsequences . Lecture Notes in Comput.Sci, Volume 841:127–142. Springer, 1994.[12] R. Hauser, S. Martinez and H. Matzinger.
Large deviation based upper bounds for theLCS-problem . Advances in Applied Probability. Volume 38: 827-852, 2006.[13] R. Hauser, H. Matzinger and C. Durringer.
Approximation to the mean curve in theLCS problem . Stochastic Processes and their Applications. Volume 118, Number 1:629–648, 2008.[14] M. Kiwi, M. Loebl, and J. Matousek.
Expected length of the longest common subsequencefor large alphabets . preprint, 2003.[15] K. Alexander.
The rate of convergence of the mean length of the longest common subse-quence . Ann. Appl. Probab., 4(4):1074–1082, 1994.[16] C. Duringer, J. Lember and H. Matzinger.
Deviation from the mean in sequence com-parison with a periodic sequence . Alea. Volume 3: 1–29, 2007.[17] F. Bonetto and H. Matzinger.
Fluctuations of the longest common subsequence in thecase of 2- and 3-letter alphabets . Latin American Journal of Probability and Mathematics,Volume 2:195–216, 2006.[18] C. Houdre and H. Matzinger.
Fluctuations of the Optimal Alignment Score with andAsymmetric Scoring Function . [arXiv:math/0702036]
19] M. J. Steele.
An Efron- Stein inequality for non-symmetric statistics . Annals of Statistics,14:75–758, 1986.[20] J. Baik, P. Deift and K. Johansson
On the distribution of the length of the longestincreasing subsequence of random permutations . J. Amer. Math. Soc., 12(4):1119-1178,1999.[21] C.A. Tracy and H. Widom.
Level-spacing distributions and the Airy kernel . Comm. Math.Phys., 159:151?174, 1994.[22] C.A. Tracy and H. Widom. . Comm. Math. Phys., 207:665–685, 1999.[23] S.M. Ulam
Monte Carlo calculations in problems of mathematical physics . Modern math-ematics for the engineer. Second series: 261–281. McGraw-Hill, 1961.[24] D. Aldous, P. Diaconis
Longest increasing subsequences: from patience sorting to theBaik-Deift-Johansson theorem . Bull. Amer. Math. Soc. (N.S.), 36(4):413–432, 1999.[25] J.M. Hammersley
A few seedlings of research . Proceedings of the Sixth Berkeley Sym-posium on Statistics and Probability (Univ. California, Calif., 1970/1971), Vol. I: Theoryof statistics. 345-394, 1972.[26] B.F. Logan, L.A. Shepp
A variational problem for random Young tableaux . Advances inMath. 26(2): 206–222, 1977.[27] H. Matzinger, Torres, F.
Random modification effect in the size of the fluctuation of theLCS of two sequences of i.i.d. blocks . Submitted, 2010.[28] Torres, F.
On the probabilistic longest common subsequence problem for sequencesof independent blocks . Ph.D. thesis, University of Bielefeld. March 2009. Onlinehttp://bieson.ub.uni-bielefeld.de/volltexte/2009/1473/[29] G. Grimmett. and D. Strizaker.
Probability and Random Processes , Oxford UniversityPress, 2001. Third edition.[30] S. Amsalu, H. Matzinger and M. Vachkovskaia.
Thermodynamical Approach to theLongest Common Subsequence Problem . Journal. Journal