An Improvement to Levenshtein's Upper Bound on the Cardinality of Deletion Correcting Codes
aa r X i v : . [ c s . I T ] J u l An Improvement to Levenshtein’s Upper Bound onthe Cardinality of Deletion Correcting Codes
Daniel Cullina,
Student Member, IEEE and Negar Kiyavash,
Member, IEEE
Abstract —We consider deletion correcting codes over a q -aryalphabet. It is well known that any code capable of correcting s deletions can also correct any combination of s total insertionsand deletions. To obtain asymptotic upper bounds on code size,we apply a packing argument to channels that perform differentmixtures of insertions and deletions. Even though the set of codesis identical for all of these channels, the bounds that we obtainvary. Prior to this work, only the bounds corresponding to the allinsertion case and the all deletion case were known. We recoverthese as special cases. The bound from the all deletion case, dueto Levenshtein, has been the best known for more than forty fiveyears. Our generalized bound is better than Levenshtein’s boundwhenever the number of deletions to be corrected is larger thanthe alphabet size. I. I
NTRODUCTION D ELETION channels output only a subsequence of theirinput while preserving the order of the transmittedsymbols. Deletion channels are related to synchronizationproblems, a wide variety of problems in bioinformatics, andthe communication of information over packet networks. Thispaper concerns channels that take a fixed length input stringof symbols drawn from a q -ary alphabet and delete a fixednumber of symbols. In particular, we are interested in upperbounds on the cardinality of the largest possible s -deletioncorrecting codebook.Levenshtein derived asymptotic upper and lower bounds onthe sizes of binary codes for any number of deletions [8].These bounds easily generalize to the q -ary case [14]. Heshowed that the Varshamov Tenengolts (VT) codes, whichhad been designed to correct a single asymmetric error [15],[16], could be used to correct a single deletion. The VT codesestablish the asymptotic tightness of the upper bound in thecase of a binary alphabet and a single deletion.Since then, a wide variety of code constructions, whichprovide lower bounds, have been proposed for the deletionchannel and other closely related channels. One recent con-struction uses constant Hamming weight deletion constructingcodes [3]. In contrast, progress on upper bounds has beenrare. Levenshtein eventually refined his original asymptoticbound (and the parallel nonbinary bound of Tenengolts) into The material in this paper was presented (in part) at the InternationalSymposium on Information Theory, Istanbul, Turkey, July 2013 [2]. This workwas supported in part by NSF grants CCF 10-54937 CAR and CCF 10-65022Kiyavash.Daniel Cullina is with the Department of Electrical and Computer En-gineering and the Coordinated Science Laboratory, University of Illinois atChampaign-Urbana, Urbana, Illinois 61801 (email: [email protected]).Negar Kiyavash is with the Department of Industrial and Enterprise SystemsEngineering and the Coordinated Science Laboratory, University of Illinois atChampaign-Urbana, Urbana, Illinois 61801 (email: [email protected]). a nonasymptotic version [10]. Kulkarni and Kiyavash recentlyproved a better upper bound for an arbitrary number ofdeletions and any alphabet size [7].Another line of work has attacked some related combina-torial problems. These include characterization of the sets ofsuperstrings and substrings of any string. Levenshtein showedthat the number of superstrings does not depend on the startingstring [9]. He also gave upper and lower bounds on the numberof substrings using the number of runs in the starting string [8].Calabi and Hartnett gave a tight bound on the number ofsubstrings of each length [1]. Hirschberg extended the boundto larger alphabets [5]. Swart and Ferreira gave a formula forthe number of distinct substrings produced by two deletionsfor any starting string [13]. Mercier et al showed how togenerate corresponding formulas for more deletions and gavean efficient algorithm to count the distinct substrings of anylength of a string [12]. Liron and Langberg improved and uni-fied existing bounds and constructed tightness examples [11].Some of our intermediate results contribute to this area.
A. Upper bound technique
To derive our upper bounds, we use a packing argument thatcan be applied to any combinatorial channel. Any combinato-rial channel can be represented by a bipartite graph. Channelinputs correspond to left vertices, channel outputs correspondto right vertices, and each edge connects an input to an outputthat can be produced from it. If two channel inputs share acommon output, they cannot both appear in the same code.The degree of an input vertex in the graph is the number ofpossible channel outputs for that input. If the degree of eachinput is at least r and there are N possible outputs, any codecontains at most N/r codewords. For a channel that makesat most s substitution errors, this argument leads to the wellknown Hamming bound.Any code capable of correcting s deletions is also capable ofcorrecting any combination of s total insertions and deletions(See Lemma 3). Despite this equivalence, this packing argu-ment produces different upper bounds for channel that performdifferent mixtures of insertions and deletions. Let C q,s,n be thesize of the largest q -ary n -symbol s -deletion correcting code.Prior to this work, the bounds on C q,s,n coming from the s -insertion channel and the s -deletion channel were known.For the s -insertion channel, each q -ary n -symbol input hasthe same degree. For fixed q and s , the degree is asymptoticto (cid:0) ns (cid:1) ( q − s (See (3)). There are q n + s possible outputs, so C q,s,n . q n + s (cid:0) ns (cid:1) ( q − s . (1) The s -deletion case is slightly more complicated becausedifferent inputs have different degrees. For instance, the inputstrings consisting of a single symbol repeated n times haveonly a single possible output: the string with that symbolrepeated n − s time. Consequently, using the minimum degreeover all of the inputs yields a worthless bound. Using thefollowing argument [8], Levenshtein showed that C q,s,n . q n (cid:0) ns (cid:1) ( q − s . (2)The average degree of an input is asymptotic to (cid:16) q − q (cid:17) s (cid:0) ns (cid:1) and most inputs have a degree close to that. The inputs canbe divided into two classes: those with degree at least − ǫ times the average degree and those with smaller degree. Foran appropriately chosen ǫ that goes to zero as n goes toinfinity, the vast majority of inputs fall into the former class.Call members of the former class the typical inputs. Theminimum degree argument can be applied to bound the numberof typical inputs that can appear in a code. There are q n − s possible outputs, so the number of typical inputs in a codeis asymptotically at most (2). We have no information aboutwhat the fraction of the atypical inputs can appear in a code,but the total number of atypical inputs is small enough to notaffect the asymptotics of the upper bound.The bounds (1) and (2) have the same growth rates, but thebound on deletion correcting codes is a factor of q s better thanthe bound on insertion correcting codes, despite the fact thatany s -deletion correcting code is an s -insertion correcting codeand vice versa. Note that there is no possible improvementto the insertion channel bound from dividing the inputs intotypical and atypical classes.We extend this bounding strategy to channels that performboth deletions and insertions. We obtain a generalized upperbound that includes Levenshtein’s bound as a special case.Recall that Levenshtein’s bound is known to be tight for onedeletion and alphabet size two. The new bound improves uponthe Levenshtein’s bound whenever the number of deletions isgreater than the alphabet size.The rest of the paper is organized as follows. In Section II,we present some notation and basic results on deletion andinsertion channels. In Section III, we construct a class of well-behaved edges in the channel graph. Together with an upperbound on the number of edges in the channel graph, the sizeof this class establishes the asymptotics of the average inputdegree. In Section IV, we prove a lower bound on the degreeof each input vertex and use it to establish our main result: anupper bound on the size of a q -ary s -deletion correcting code.II. P RELIMINARIES
A. Notation
Let N be the set of nonnegative integers. Let [ n ] be the setof nonnegative integers less than n , { , , . . . , n − } . Let [ q ] n be the set of q -ary strings of length n . Let [ q ] ∗ be the set of q -ary strings of all lengths. More generally, for a set S , let S n be the set of lists of elements S of length n and let S ∗ be theset of lists of elements of S of any length. We will need the following asymptotic notation: let a ( n ) ∼ b ( n ) denote that lim n →∞ a ( n ) b ( n ) = 1 and a ( n ) . b ( n ) denotethat lim n →∞ a ( n ) b ( n ) ≤ . We will use the following asymptoticequality frequently: for fixed c , (cid:0) nc (cid:1) ∼ n c c ! . B. Deletion distance
The substring relation is a partial ordering of [ q ] ∗ . Conse-quently for strings x and y , we write x (cid:22) y if x is a substringof y . Definition 1.
For x ∈ [ q ] n and y ∈ [ q ] m , define the deletiondistance between them to be d L ( x, y ) = n + m − l , where l is the length of their longest common substring. It is well known that deletion distance is a metric. We willneed a slightly stronger property. The following lemma is thesource of the nice properties of the deletion distance.
Lemma 1.
For l, m, n ∈ N with l ≤ m and l ≤ n , let x ∈ [ q ] n and y ∈ [ q ] m . Then there exists z ∈ [ q ] l such that x (cid:23) z and y (cid:23) z if and only if there exists w ∈ [ q ] m + n − l such that w (cid:23) x and w (cid:23) y .Proof: Given x , y , and w , a canonical z can be con-structed by a simple greedy algorithm. Given x , y , and z , atleast one w can be constructed by a similar algorithm.The next lemma is a strengthening of the triangle inequality. Lemma 2.
For l, m, n ∈ N with l ≤ m and l ≤ n , let a = n − l and b = m − l . For x ∈ [ q ] n and y ∈ [ q ] m , the following areequivalent: A There exists z ∈ [ q ] ∗ such that d L ( x, z ) ≤ a and d L ( y, z ) ≤ b . B d L ( x, y ) ≤ a + bC For all ≤ i ≤ a + b , there exists z i ∈ [ q ] l +2 i such that d L ( x, z i ) ≤ a and d L ( y, z i ) ≤ b .Proof: ( A ⇒ B ) Let the length of z be k . Because d ( x, z ) = a , x and z have a common substring u of length ( n + k − a ) / . Similarly y and z have a common substring v oflength ( m + k − b ) / . By Lemma 1, u and v have a commonsubstring w of length ( n + k − a ) / m + k − b ) / − k =( m + n − a − b ) / l . Because w is a substring of both x and y , d ( x, y ) ≤ a + b . ( B ⇒ C ) Let z be a common substring of x and y oflength l . There are u i , v i ∈ [ q ] l + i such that x (cid:23) u i (cid:23) z and y (cid:23) v i (cid:23) z . By Lemma 1, u i and v i have a commonsuperstring z i of length l + i ) − l = l + 2 i . Because u i is a common substring of x and z i , d ( x, z i ) ≤ a . Similarly d ( y, z i ) ≤ b . ( C ⇒ A ) Trivial. Corollary 1.
Deletion distance is a metric.Proof:
Deletion distance is symmetric. Because x is asubstring of itself, d ( x, x ) = 0 . Because the only substring of x with the same length is x , d ( x, y ) = 0 implies x = y . FromLemma 2, deletion distance satisfies the triangle inequality. C. Deletion and insertion channels
We formalize the problem of correcting deletions and inser-tions by defining the following sets.
Definition 2.
For x ∈ [ q ] n , define S a, ( x ) = { z ∈ [ q ] n − a : z (cid:22) x } , the set of substrings of x that can be produced by a deletions. Define S ,b ( x ) = { w ∈ [ q ] n + b : w (cid:23) x } , the set ofsuperstrings of x that can be produced by b insertions. Define S a,b ( x ) = S z ∈ S a, ( x ) S ,b ( z ) . The a -deletion b -insertion channel takes a string of length n , finds a substring of length n − a , and outputs a superstringof that substring of length n − a + b . Consequently, for eachinput x to an n -symbol a -deletion b -insertion channel S a,b ( x ) is the set of possible outputs.The following graph completely describes the behavior ofthe ( l + a ) -symbol a -deletion b -insertion channel. Definition 3.
Let B q,l,a,b be a bipartite graph with left vertexset [ q ] l + a and right vertex set [ q ] l + b . Vertices are adjacent ifthey have a common substring of length l . If x is a left vertex of B q,l,a,b , then its neighborhood is S a,b ( x ) . When two inputs share common outputs they canpotentially be confused by the receiver. Definition 4. A q -ary n -symbol a -deletion b -insertion correct-ing code is a set C ⊂ [ q ] n such that for any two distinct strings x, y ∈ C , S a,b ( x ) ∩ S a,b ( y ) is empty. Lemma 3.
For a, b, n ∈ N , x, y ∈ [ q ] n , S a,b ( x ) ∩ S a,b ( y ) = ∅ if and only if d L ( x, y ) > a + b ) . Consequently a set C ⊂ [ q ] n is a q -ary n -symbol a -deletion b -insertion correcting code ifand only if for all distinct x, y ∈ C , d L ( x, y ) > a + b ) .Proof: Let s = a + b . Suppose there is some z ∈ S a,b ( x ) ∩ S a,b ( y ) . Then d L ( x, z ) ≤ s and d L ( y, z ) ≤ s , so d ( x, y ) ≤ s .If d L ( x, y ) ≤ s , then by Lemma 2 there is some w ∈ [ q ] n − a + b such that d L ( x, w ) ≤ s and d L ( y, w ) ≤ s .III. C ONSTRUCTING EDGES
To execute the strategy described in section I-A, we needa lower bound on the degree of each channel input. This isa lower bound on the degree of each left vertex of B q,l,a,b .To obtain this bound, we first construct a subset of the edgesof B q,l,a,b that is easier to work with than the complete edgeset. Our ultimate lower bound on the degree of an input willactually be a lower bound on the number of edges for thissubset incident to the input vertex.One way to get information about the size of a target set T is to find a construction function f : P → T , where P isan easily counted parameter set. If f is injective, then | P | = | f ( P ) | and | P | ≤ | T | . We can demonstrate the injectivity of f with a deconstruction function g : T → P that is a leftinverse of f . This means that g ( f ( p )) = p for all p ∈ P . Ifthe function g is given a constructible member of T , g recoversthe construction parameters that produce it. Similarly, if f issurjective, then we can find an injective g : T → P that is aright inverse of f , so | T | = | g ( T ) | and | P | ≥ | T | . If f is bothinjective and surjective, then | P | = | T | . In this section we apply this method to the edge set of B q,l,a,b . We give an upper bound on the number of edges andbriefly discuss why it is difficult to count the edges exactly.We explain our construction of a subset of the edges and provea lower bound on the size of this subset. Finally we show thatthe upper and lower bounds match asymptotically. A. An upper bound
By definition, two vertices in B q,l,a,b are adjacent if theyshare a substring of length l . This makes the common substringa natural construction parameter for the edge. We can constructan edge by starting with a string of length l , performing a arbitrary insertions to obtain the left vertex, and performing b arbitrary insertions to obtain the right vertex. Our upperbound will use the following fact about insertions due toLevenshtein [9]. Each x ∈ [ q ] n − s has the same number ofsuperstrings of length n : | S ,s ( x ) | = I q,s,n , (3)where I q,s,n = s X i =0 (cid:18) ni (cid:19) ( q − i . For fixed q and s , I q,s,n ∼ (cid:0) ns (cid:1) ( q − s . Lemma 4.
For all q, l, a, b ∈ N with s = a + b , the numberof edges in B q,l,a,b satisfies | E ( B q,l,a,b ) | ≤ q l I q,a,l + a I q,b,l + b ∼ q l (cid:18) la (cid:19) ( q − a (cid:18) lb (cid:19) ( q − b ∼ q l (cid:18) ls (cid:19)(cid:18) sa (cid:19) ( q − s . Proof:
There are q l I q,a,l + a I q,b,l + b triples ( z, x, y ) ∈ [ q ] l × [ q ] l + a × [ q ] l + b such that z (cid:22) x and z (cid:22) y . If x ∈ [ q ] l + a and y ∈ [ q ] l + b are adjacent in B q,l,a,b , then they have at leastone common substring of length l and appear in at least onetriple.This upper bound is not an equality because many pairsof strings ( x, y ) ∈ [ q ] l + a × [ q ] l + b have multiple commonsubstrings z ∈ [ q ] l . Pairs of strings with multiple commonsubstrings of length l fall into two classes. Pairs in the firstclass have a common substring of length more than l . Callthis string w . In this case, every substring of length l of w isa common substring of the pair. Pairs in the second class havemultiple maximum length common substrings. For example,the strings and have both and as substrings.To determine the exact number of edges in B q,l,a,b , it isnecessary to determine the sizes of both classes. The size ofthe first class can be found easily if the number of edges in B q,l + i,a − i,b − i is known for all i up to min( a, b ) . It is moredifficult to characterize the vertex pairs of the second class.Consequently, our lower bound will also not be tight. x yz Fig. 1. An example of an edge ( x, y ) ∈ E ( B , , , ) constructed from acommon substring z ∈ [3] . B. Constructing edges at most once each
Our lower bound uses a different construction. To constructan edge ( x, y ) ∈ E ( B q,l,a,b ) , start with a string z ∈ [ q ] l . Asbefore, z will be a substring of both endpoints of the edge.Let s = a + b . Partition z into s + 1 nonempty intervals. Toproduce x , select a of the s boundaries between intervals andinsert one new symbol into z at each. To produce y , insert onenew symbol into z at each of the other b boundaries. Figure 1gives an example.Each way to partition z corresponds to a composition of l with s + 1 parts. Definition 5.
A composition of l with t parts is a list of t nonnegative integers with sum l . Let M ( t, l, k ) be the familyof compositions of l with t parts and each part of size at least k : M ( t, l, k ) = λ ∈ ( N \ [ k ]) t (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∈ [ t ] λ i = l . A standard argument shows that | M ( t, l, k ) | = (cid:0) l − kt + t − t − (cid:1) .Thus the parameter set for this construction is [ q ] l × M ( s + 1 , l, × (cid:18) [ s ] a (cid:19) × [ q ] s where (cid:0) [ s ] a (cid:1) is the family of a element subsets of [ s ] . The sizeof this set is (cid:0) l − s (cid:1)(cid:0) sa (cid:1) q l + s .It is clear that there are many edges that this constructionproduces multiple times. We will show that if the followingtwo restrictions are added to construction procedure, each edgewill be produced at most once: • Each inserted symbol must differ from the leftmost sym-bol in the interval to its right. • Each interval of z must be nonalternating.The first restriction is well posed because the intervals arenonempty. This restriction is needed because inserting a newsymbol anywhere within a run of that same symbol has thesame effect. Under the restriction, a run in z can only beextended by inserting a matching symbol at the right end. Toimplement this restriction, for each insertion point we pick δ ∈ [ q ] \ { } and make the inserted symbol equal to δ plus itssuccessor.The size of the parameter set for the construction under thefirst restriction is q l (cid:0) l − s (cid:1)(cid:0) sa (cid:1) ( q − s , which it very similar tothe asymptotic upper bound of Lemma 4. C ONSTRUCT (11 , ( L, , , ( R, , , ( L, , I NSERT ( L, , NSERT ( R, , NSERT ( L, , = = Fig. 2. An example of the construction procedure for a pair of strings. TheI
NSERT function is applied to each triple ( LR × ([ q ] \{ } ) × [ q ] ∗ ) to producea pair of string segments. C ONSTRUCT concatenates these to produce the finalpair.
Definition 6.
A string is alternating if some u ∈ [ q ] appearsat all even indices, some v ∈ [ q ] appears at all odd indices,and u = v . Let A q,n be the set of nonalternating q -ary stringsof length n . The empty string and all strings of length one are triviallyalternating, so the shortest nonalternating strings have lengthtwo. For each length n ≥ , each of the q choices for u and q − choices for v results in a unique string, so | A q,n | = q n − q ( q − .To explain the purpose of the second restriction, we mustfirst describe the deconstruction procedure. Start with an edge ( x, y ) . Beginning at the left, find the longest matching prefixof x and y and delete it from both. This prefix is the firstinterval of z . Now the first symbols of x and y differ. One ofthese symbols is part of the next interval of z and the otherwas an insertion, but we do not know which is which.To resolve this situation, apply the following heuristic.Delete the first symbol of x and determine the length of thelongest common prefix of y and the rest of x . Then do thesame with the roles of x and y reversed. Assume that thedeleted symbol that resulted in the longer common prefix wasthe insertion and that the longer prefix was the next intervalof z . After removing this prefix, either the first symbols of x and y again differ or x and y are both the empty string. Applythis heuristic until the latter case is achieved.We will show that this heuristic is always correct whenapplied to edges produced under the second restriction. C. Formalization of the construction and deconstruction func-tions
Our construction function, C
ONSTRUCT , is specified inAlgorithm 1 and our deconstruction function, D
ECONSTRUCT ,is specified in Algorithm 2. Example of the construction anddeconstruction algorithms are provided in Figures 2 and 3.The functions treat strings as lists of symbols. We representthe empty list as ǫ . We write the concatenation of x and y as x : y . The function H EAD returns the first symbol of a nonemptylist and the function T
AIL returns everything except the head.The function L
ENGTH returns the number of symbols in thestring. D ECONSTRUCT (cid:18) (cid:19) M ATCH (cid:18) (cid:19) = 11
ELETE (cid:18) (cid:19) M ATCH (cid:18) (cid:19) = 102 X M ATCH (cid:18) (cid:19) = ǫ = ( L, , ELETE (cid:18) (cid:19) M ATCH (cid:18) (cid:19) = 121
ATCH (cid:18) (cid:19) = 21211 X = ( R, , ELETE (cid:18) (cid:19) M ATCH (cid:18) (cid:19) = 021 ǫǫ X M ATCH (cid:18) (cid:19) = 2 = ( L, , ǫǫ = 11 , ( L, , , ( R, , , ( L, , Fig. 3. An example of the deconstruction process. First, M
ATCH strips offthe common prefix. The D
ELETE function tests whether it a longer commonprefix is achieved by deleting the head of the first string or the second string.The check marks indicate the longer match. It produces a triple specifyingthat deletion and prefix.
The C
ONSTRUCT function produces a pair of strings. Asits input, C
ONSTRUCT takes s + 1 intervals of arbitrarylengths, a subset of [ s ] , and s nonzero q -ary symbols. LetLR = { L EFT , R IGHT } . We represent the subset T ⊆ [ s ] as astring t ∈ LR s , where t i = L EFT if i ∈ T and t i = R IGHT if i T . Thus the input to C ONSTRUCT is an element of ([ q ] ∗ ) s +1 × LR s × ([ q ] \{ } ) s = [ q ] ∗ × ( LR × ([ q ] \ { } ) × [ q ] ∗ ) s . The I
NSERT function takes one of the triples ( LR × ([ q ] \ { } ) × [ q ] ∗ ) as an argument and outputstwo strings. Let w be the string from the triple. One of theoutput strings is w and the other is w with a single symbolhas been inserted at the head. C ONSTRUCT applies I
NSERT to each triple, concatenates the results, and prepends theremaining input string to each output.
Algorithm 1
Construct an edgeC
ONSTRUCT : [ q ] ∗ × ( LR × ([ q ] \ { } ) × [ q ] ∗ ) s → [ q ] ∗ × [ q ] ∗ C ONSTRUCT ( w , t ) ( x, y ) ← C ( t ) return ( w : x, w : y ) C : ( LR × ([ q ] \ { } ) × [ q ] ∗ ) s → [ q ] ∗ × [ q ] ∗ C( t ) if t = ǫ thenreturn ( ǫ, ǫ ) else ( u, v ) ← I NSERT ( H EAD ( t ))( x, y ) ← C ( T AIL ( t )) return ( u : x, v : y ) end if I NSERT : LR × ([ q ] \ { } ) × [ q ] ∗ → [ q ] ∗ × [ q ] ∗ I NSERT ( lr, δ, w ) w ′ ← ( δ + H EAD ( w )) : w if lr = L EFT thenreturn ( w ′ , w ) elsereturn ( w, w ′ ) end if The M
ATCH function takes two strings x and y , findstheir longest common prefix, and outputs the prefix and thetwo corresponding suffixes. The D ECONSTRUCT uses M
ATCH to remove the common prefix of the input strings, thenrepeatedly calls D
ELETE . D
ELETE takes a pair of strings x and y that differ in their first symbol and each applicationof D ELETE undoes the effect of an I
NSERT . D
ELETE callsM
ATCH on ( T AIL ( x ) , y ) and on ( x, T AIL ( y )) and then pre-forms the deletion that resulted in a longer common prefix.The information about the deletion and prefix become a triple ( LR × ([ q ] \ { } ) × [ q ] ∗ ) . D ELETE returns this triple alongwith two suffixes from the match.
D. Deconstruction
Now we will show that D
ECONSTRUCT is a left inverse ofC
ONSTRUCT . The first step is to look at the inner functions:I
NSERT and D
ELETE . Lemma 5.
For lr ∈ LR, δ ∈ [ q ] \ { } , and w ∈ A q,m , let ( x, y ) = I NSERT ( lr, δ, w ) . Let u, v ∈ [ q ] ∗ such that if both arenonempty, they have different first symbols. Then D ELETE ( x : u, y : v ) = (( lr, δ, w ) , u, v ) .Proof: Let w = w m − = ( w , w , . . . , w m − ) . Withoutloss of generality let lr = L EFT , so x = ( w + δ ): w and y = w .First, D ELETE ( x : u, y : v ) computes g = ( w + δ ) − w = δ .Next, it evaluates M ATCH ( w : u, w : v ) = ( w, u, v ) becauseeither u = v or one of u and v is the empty string. Thusthe length of the first match is L ENGTH ( w ) = m . Second, itevaluates M ATCH (( w + δ ): w : u, w m − : v ) . If the length of thismatches is at least m − , then w + δ = w and w i = w i +2 Algorithm 2
Deconstruct an edgeD
ECONSTRUCT : [ q ] ∗ × [ q ] ∗ → [ q ] ∗ × ( LR × ([ q ] \{ } ) × [ q ] ∗ ) s D ECONSTRUCT ( x, y ) ( w , x, y ) ← M ATCH ( x, y ) return ( w , D ( x, y )) D : [ q ] ∗ × [ q ] ∗ → ( LR × ([ q ] \ { } ) × [ q ] ∗ ) s D( x, y ) if x = ǫ ∨ y = ǫ thenassert x = ǫ ∧ y = ǫ return ǫ else ( w, x, y ) ← D ELETE ( x, y ) return ( w : D ( x, y )) end if D ELETE : [ q ] ∗ × [ q ] ∗ → ( LR × ([ q ] \ { } ) × [ q ] ∗ ) × [ q ] ∗ × [ q ] ∗ D ELETE ( x, y ) g = H EAD ( x ) − H EAD ( y )( a, b, c ) ← M ATCH ( T AIL ( x ) , y )( d, e, f ) ← M ATCH ( x, T AIL ( y )) assert L ENGTH ( a ) = L ENGTH ( d ) if L ENGTH ( a ) > L ENGTH ( d ) thenreturn (( L EFT , g, a ) , b, c ) elsereturn (( R IGHT , ( − g ) , d ) , e, f ) end if M ATCH : [ q ] i × [ q ] j → [ q ] k × [ q ] i − k × [ q ] j − k M ATCH ( x, y ) w ← ǫ while x = ǫ ∧ y = ǫ ∧ H EAD ( x ) = H EAD ( y ) do w ← w : H EAD ( x ) x ← T AIL ( x ) y ← T AIL ( y ) end whilereturn ( w, x, y ) for ≤ i ≤ m − . This would make w alternating, so thelengths of the second match is at most m − . The first matchis longer than the second, so the first branch of the if statementis taken and the function returns (( L EFT , δ, w ) , u, v ) . Definition 7.
For all q, l, a, b ∈ N , let s = a + b . Let P q,l,s bethe set [ c ∈ M ( s +1 ,l, A q,c × s Y i =1 ( LR × ([ q ] \ { } ) × A q,c i ) and let P q,l,a,b be the subset of P q,l,s with exactly a appear-ances of L EFT . Lemma 6.
For all q, l, s ∈ N and p ∈ P q,l,s , D ECONSTRUCT ( C ONSTRUCT ( p )) = p .Proof: Let p = ( w , t s , . . . , t ) where t i = ( lr i , δ i , w i ) .The initial call to M ATCH in D
ECONSTRUCT finds w , so D ECONSTRUCT ( C ONSTRUCT ( p )) = ( w , D ( C ( t , . . . , t s ))) .We show that D ( C ( t s , . . . , t ))) = ( t , . . . , t s ) by induction.For the base case, D ( C ( ǫ )) = D ( ǫ, ǫ ) = ǫ . For the inductionstep, note that ( u, v ) = C ( t i , . . . , t ) can be taken to be the u and v in the statement of Lemma 5 because they are eitherboth empty of they have different first symbols. Then Lemma 5gives D ( C ( t i +1 , . . . , t )) = t i +1 : D ( C ( t i , . . . , t )) . Lemma 7.
For all q, l, a, b ∈ N , s = a + b , and p ∈ P q,l,a,b , C ONSTRUCT ( p ) ∈ E ( B q,l,a,b ) .Proof: Let ( x, y ) = C ONSTRUCT ( p ) . Let p =( w , t , . . . , t s ) where t i = ( lr i , δ i , w i ) . One output ofI NSERT ( lr i , δ i , w i ) is a strict superstring w i and the other is w i . Thus both x and y are superstrings of w : w : . . . : w s .The longer output of I NSERT becomes part of x a times, sothe length of x is l + a . Similarly the length of y is l + b . E. The lower bound
Lemma 8.
For fixed q, a, b ∈ N , | P q,l,a,b | & q l (cid:0) ls (cid:1)(cid:0) sa (cid:1) ( q − s .Proof: Refactor P q,l,s asLR s × ([ q ] \ { } ) s × [ λ ∈ M ( s +1 ,l, s Y i =0 A q,c i . In P q,l,a,b , the element of LR s is one of the (cid:0) sa (cid:1) strings withexactly a appearances of L EFT . There are ( q − s possibilitiesfor ([ q ] \ { } ) s . For λ i ≥ , | A q,λ i | = q λ i − q ( q − , so thesize the union is X λ ∈ M ( s +1 ,l, s Y i =0 ( q λ i − q ( q − ≥ X λ ∈ M ( s +1 ,l, s Y i =0 ( q λ i − q )= q l X λ ∈ M ( s +1 ,l, s Y i =0 (cid:0) − q − λ i (cid:1) ≥ q l X λ ∈ M ( s +1 ,l, q l ) s Y i =0 (cid:0) − q − λ i (cid:1) ≥ q l (cid:18) l − (1 + log q l )( s + 1) − s (cid:19) (1 − l − ) s +1 ∼ q l (cid:18) ls (cid:19) . Thus | P q,l,a,b | & q l (cid:0) ls (cid:1)(cid:0) sa (cid:1) ( q − s .Our bounds establish the asymptotic growth of the numberof edges. Theorem 1.
For fixed q, a, b ∈ N , the number of edges in B q,l,a,b satisfies | E ( B q,l,a,b ) | ∼ q l (cid:0) ls (cid:1)(cid:0) sa (cid:1) ( q − s . The averageof S a,b ( x ) over all x ∈ [ q ] n is asymptotic to (cid:0) ns (cid:1)(cid:0) sa (cid:1) ( q − s q − a .Proof: From Lemma 6 and Lemma 7, | E ( B q,l,a,b ) | ≥| P q,l,a,b | . Lemma 4 provides the asymptotic upper bound andLemma 8 provides the asymptotic lower bound.For x ∈ [ q ] n , the set S a,b ( x ) is the neighborhood of x in B q,n − a,a,b . Each edge involves exactly one of the q n leftvertices and (cid:0) n − aa (cid:1) ∼ (cid:0) na (cid:1) . Now we can conclude that most edges are constructable byour method. This is a necessary condition for the asymptotictightness of our ultimate lower bound on input degree.IV. B
OUNDS ON I NPUT D EGREE AND C ODE S IZE
Lemma 9.
Let x ∈ [ q ] n be a string with r runs. Let c be thelength of the longest alternating interval of x . Then | S a,b ( x ) | ,the number of unique strings that can be produced from x by a deletions and b insertions, is at least (cid:18) r − ( a + 1)( c + 1) a (cid:19)(cid:18) n − − a ( c + 1) − ( b + 1) cb (cid:19) ( q − b . Proof:
For each x ∈ [ q ] n , we identify a subset P x ⊆ P q,n − a,a,b such that for all p ∈ P x , C ONSTRUCT ( p ) = ( x, y ) .From Lemma 7, all y produced this way are in S a,b ( x ) . FromLemma 6, | S a,b ( x ) | ≥ | P x | .To produce an element of P x , we select a symbols of x fordeletion, select b spaces in x for insertion, and specify the b new symbols. The symbols selected of deletion and the spacesselected for insertion partition x into s + 1 intervals. To ensurethat none of these intervals are alternating, we will require thatall of the intervals contain at least c + 1 symbols.There are many equivalent ways to extends a run byinserting a matching symbol. C ONSTRUCT extends runs byadding a symbol at the right end, so we only select symbolsfor deletion from those at the right end of a run. We needthere to be at least c + 1 symbols between consecutive deletedsymbols. It is easier to enforce the stronger condition that thereare c + 1 end of run symbols between consecutive deletedsymbols. There are (cid:0) r − ( a +1)( c +1) a (cid:1) ways to pick the symbolsfor deletion that satisfy this condition.There are n − potential spaces in which an insertion canbe made. Insertions cannot be performed in the c + 1 spacesbefore and after a deleted symbol. In the worst case, all ofthese forbidden spaces are distinct, leaving n − − a ( c + 1) spaces to choose from. There must be c + 1 symbols betweenany two consecutive chosen spaces, before the first chosenspace, and after the last chosen space. Thus there must be atleast c spaces in each of these b + 1 intervals. Again, it iseasier to enforce the stronger condition that there are at least c spaces not near a deletion in each interval. Thus there arealways at least (cid:0) n − − a ( c +1) − ( b +1) cb (cid:1) ways to pick the spaces.Finally, for each of the b insertion points, we must specifythe difference inserted symbol and its successor. Thus, thereare ( q − b choices for this step.The following argument, very similar to Lemma 4, showsthat this degree lower bound is asymptotically tight. This is ageneralization of a lemma of Levenshtein [8], Lemma 10.
For all q, n, r, a, b ∈ N with s = a + b , if x ∈ [ q ] n has r runs, then | S a,b ( x ) | ≤ (cid:18) r + a − a (cid:19) I q,b,n − a + b . Proof:
Any substring of x can be the number of symbolsdeleted from each run. This is a composition of a with r parts, so | S a, ( x ) | ≤ | M ( a, r, | = (cid:0) r − ar − (cid:1) = (cid:0) r + a − a (cid:1) . Eachstring in S a,b ( x ) is a superstring of one of these substrings. Each substring has exactly I q,b,n − a + b superstrings of length n − a + b .If r = pn for fixed p , both bounds are asymptotic to (cid:18) ra (cid:19)(cid:18) nb (cid:19) ( q − b . To apply Lemma 9 to a string, we need two statisticsof that string: the number of runs and the length of thelongest alternating interval. The next two lemmas concern thedistributions of these statistics.
Lemma 11.
The number of q -ary strings of length n with analternating interval of length at least c is at most ( n − c +1) q n − c +1 ( q − .Proof: A string of length n contains n − c + 1 intervals oflength c . If some interval of length at least c is alternating, atleast one of intervals of length exactly c is alternating. Thereare q ( q − choices for the symbols in the alternating intervaland q n − c choices for the remaining symbols. Lemma 12.
The number of q -ary strings of length n with (cid:16) q − q − ǫ (cid:17) ( n −
1) + 1 or fewer runs is at most q n e − n − ǫ .Proof: For x ∈ [ q ] n , let x ′ ∈ [ q ] n − be the string of firstdifferences of x . That is, let x ′ i = x i +1 − x i mod q . If x has r runs, then x ′ i is nonzero at the r − boundaries between runs.Thus there are q (cid:0) n − r − (cid:1) ( q − r − strings with exactly r runs.The number of strings with few runs is ( q − q − ǫ ) ( n − X i =0 (cid:18) n − i (cid:19) ( q − i = q n − ( q − q − ǫ ) ( n − X i =0 (cid:18) n − i (cid:19) (cid:18) q − q (cid:19) i (cid:18) q (cid:19) n − − i ≤ q n − e − n − ǫ . The upper bound comes from the application of Hoeffding’sinequality to the binomial distribution [6].Now we have all of the ingredients required to execute thestrategy described in Section I-A.
Lemma 13.
Let q, a, b ∈ N be fixed and let s = a + b . Forall t ∈ N , there is a sequence of subsets T n ⊆ [ q ] n such that | T n | is O ( q n /n t ) and min x ∈ [ q ] n \ T n | S a,b ( x ) | & ( q − s q a (cid:18) ns (cid:19)(cid:18) sb (cid:19) Proof:
We form two classes of bad strings: strings with along alternating interval, and strings with few runs. Call theseclasses T ′ n and T ′′ n respectively. Let T n = T ′ n ∪ T ′′ n .A string falls into T ′ n if it has an alternating subinterval oflength at least c . If we let c = ( t +1) log q n , then by Lemma 11we have | T ′ n | < nq n − c +1 ( q −
1) = n − t q n +1 ( q − which is O ( q n /n t ) . Over all strings in [ q ] n , the averagenumber of runs is q − q ( n −
1) + 1 . A string falls into T ′′ n if it has at most (cid:16) q − q − ǫ (cid:17) ( n −
1) + 1 runs. If we let ǫ = q t log n n − , then by Lemma 12 we have | T ′′ n | ≤ q n e − n − ǫ = q n e − t log n = q n /n t . For fixed t , this ǫ is o (1) , so (cid:16) q − q − ǫ (cid:17) ( n −
1) + 1 ∼ ( q − nq . Now we can apply Lemma 9 to lower bound the degree ofthe strings in [ q ] n \ T n . The first multiplicative term in thelower bound is asymptotic to (cid:18) q − q n − ( a + 1)(( t + 1) log q n + 1) a (cid:19) ∼ (cid:18) q − q na (cid:19) ∼ (cid:18) q − q (cid:19) a (cid:18) na (cid:19) . The second term is asymptotic to (cid:18) n − − a − (2 a + b + 1)( t + 1) log q nb (cid:19) ∼ (cid:18) nb (cid:19) . Thus min x ∈ [ q ] n \ T n | S a,b ( x ) | & (cid:18) q − q (cid:19) a (cid:18) na (cid:19)(cid:18) nb (cid:19) ( q − b ∼ ( q − s q a (cid:18) ns (cid:19)(cid:18) sb (cid:19) . Our main theorem follows easily.
Theorem 2.
For fixed q, s ∈ N , the number of codewords inan n -symbol q -ary s -deletion correcting code satisfies C q,s,n . min ≤ b ≤ s q n + b ( q − s (cid:0) ns (cid:1)(cid:0) sb (cid:1) . Proof:
Consider an a -deletion b -insertion channel with a + b = s . By Lemma 3, any code for this channel can alsocorrect s deletions. There are q n − a + b possible outputs, so forany T n ⊆ [ q ] n , C q,s,n . q n − a + b min x ∈ [ q ] n \ T n | S a,b ( x ) | + | T n | . By setting t = s + 1 in Lemma 13 we obtain an asymptoticupper bound of C q,s,n . q n − a + b ( q − s q a (cid:0) ns (cid:1)(cid:0) sb (cid:1) + O (cid:18) q n n s +1 (cid:19) ∼ q n + b ( q − s (cid:0) ns (cid:1)(cid:0) sb (cid:1) . This improves (2), Levenshtein’s upper bound, by a factor of (cid:0) sb (cid:1) q − b . By setting b to zero we recover Levenshtein’s bound.Whenever s > q , (cid:0) s (cid:1) q − > (cid:0) s (cid:1) q = 1 so setting b to one inthe generalized bound offers an improvement. Corollary 2. If q + 1 divides s , the size of a q -ary s -deletioncorrecting code satisfies C q,s,n . √ sq n + s + ( q + 1) s +1 ( q − s (cid:0) ns (cid:1) . Proof:
We optimize over b in Theorem 2. The factor (cid:0) sb (cid:1) q − b is a constant times a binomial distribution: (cid:18) q + 1 q (cid:19) s (cid:18) sb (cid:19) (cid:18) q + 1 (cid:19) b (cid:18) qq + 1 (cid:19) s − b . The maximum is achieved by b = j s +1 q +1 k . When q + 1 divides s , the maximum is at least (cid:18) q + 1 q (cid:19) s s q + 1 qs/ ( q + 1) = ( q + 1) s +1 √ sq s + by Stirling’s approximation. See Appendix A for details.V. C ONCLUDING REMARKS
In this paper, we extended Levenshtein’s strategy for ob-taining an upper bound on the size of deletion codes. Leven-shtein’s bound arises from the deletion channel. We derived thecorresponding bounds from channels that perform a mixtureof deletions and insertions. This results in an improvementwhenever the number of errors, s , is larger than the alphabetsize, q . The best version of our bound uses a channel wherethe ratio of deletions to insertions is q to one.Our argument relies on the fact that the channel graphsare approximately regular in the asymptotic regime where thenumber of errors is fixed. A natural question is whether itcan be extended to the regime where the number of errorsis a constant fraction of the input length. However, it is notclear whether the graphs are approximately regular in thelatter regime. The argument of this paper relies on the typicaldistance between errors going to infinity. Any interactionbetween two errors, which occurs via an alternating interval,becomes rare. When the typical distance does not grow withinput length, interactions will not be rare and it will not bepossible to simply discard the cases where they occur. Insteadit will be necessary to understand the details of more types ofinteractions between errors.A PPENDIX
AOne form of Stirling’s approximation is [4] √ π ≤ n ! √ n (cid:0) ne (cid:1) n ≤ e. Then for α, β, n ∈ N , consider the binomial distributionproduced by ( α + β ) n trials and success probability α/ ( α + β ) .The most likely outcome is αn successes and the probabilityof that outcome is: max i (cid:18) ( α + β ) ni (cid:19) (cid:18) αα + β (cid:19) i (cid:18) βα + β (cid:19) ( α + β ) n − i = (cid:18) ( α + β ) nαn (cid:19) (cid:18) αα + β (cid:19) αn (cid:18) βα + β (cid:19) βn ≥ p π ( α + β ) n (cid:16) ( α + β ) ne (cid:17) ( α + β ) n e √ αn (cid:0) αne (cid:1) αn e √ βn (cid:16) βne (cid:17) βn α αn β βn ( α + β ) ( α + β ) n = √ πe s α + βαβn ≥ s α + βαβn . R EFERENCES[1] L. Calabi and W. E. Hartnett, “Some general results of coding theorywith applications to the study of codes for the correction of synchro-nization errors*,”
Information and Control , vol. 15, no. 3, p. 235249,1969.[2] D. Cullina and N. Kiyavash, “An improvement to levenshtein’s upperbound on the cardinality of deletion correcting codes,” in
IEEE Inter-national Symposium on Information Theory Proceedings , July 2013.[3] D. Cullina, A. Kulkarni, and N. Kiyavash, “A coloring approach toconstructing deletion correcting codes from constant weight subgraphs,”in
IEEE International Symposium on Information Theory Proceedings(ISIT) , July 2012, p. 513 517.[4] W. Feller,
An introduction to probability theory and its applications .John Wiley & Sons, 2008, vol. 2.[5] D. Hirschberg, “Bounds on the number of string subsequences,” in
Combinatorial Pattern Matching , 1999, p. 115122.[6] W. Hoeffding, “Probability inequalities for sums of bounded randomvariables,”
Journal of the American Statistical Association
IEEE Transactions on Information Theory ,2012. [Online]. Available: http://arxiv.org/abs/1211.3128[8] V. I. Levenshtein, “Binary codes capable of correcting deletions, inser-tions, and reversals,” in
Soviet physics doklady , vol. 10, 1966, p. 707710.[9] ——, “Elements of coding theory,”
Diskretnaya matematika i matem-aticheskie voprosy kibernetiki , p. 207305, 1974.[10] ——, “Bounds for deletion/insertion correcting codes,” in
IEEE Inter-national Symposium on Information Theory Proceedings , 2002, p. 370.[11] Y. Liron and M. Langberg, “A characterization of the number ofsubsequences obtained via the deletion channel,” in
IEEE InternationalSymposium on Information Theory Proceedings , 2012, p. 503507.[12] H. Mercier, M. Khabbazian, and V. Bhargava, “On the number ofsubsequences when deleting symbols from a string,”
IEEE Transactionson Information Theory , vol. 54, no. 7, pp. 3279–3285, 2008.[13] T. G. Swart and H. C. Ferreira, “A note on double insertion/deletioncorrecting codes,”
IEEE Transactions on Information Theory , vol. 49,no. 1, p. 269273, 2003.[14] G. Tenengolts, “Nonbinary codes, correcting single deletion or insertion(corresp.),”
IEEE Transactions on Information Theory , vol. 30, no. 5, p.766769, 1984.[15] R. Varshamov, “On an arithmetic function with an application in thetheory of coding,”
Doklady Akademii nauk SSSR , vol. 161, p. 540543,1965.[16] R. Varshamov and G. Tenengolts, “Codes which correct single asym-metric errors,”