Deterministic Sparse Suffix Sorting in the Restore Model
DDeterministic Sparse Suffix Sorting in the Restore Model ∗ Johannes Fischer , Tomohiro I , and Dominik K¨oppl Department of Computer Science, TU Dortmund, Germany Kyushu Institute of Technology, Japan
Abstract
Given a text T of length n , we propose a deterministic online algorithm computing the sparse suffixarray and the sparse longest common prefix array of T in O ( c √ lg n + m lg m lg n lg ∗ n ) time with O ( m )words of space under the premise that the space of T is rewritable, where m ≤ n is the number of suffixesto be sorted (provided online and arbitrarily), and c is the number of characters with m ≤ c ≤ n thatmust be compared for distinguishing the designated suffixes. Sorting suffixes of a long text lexicographically is an important first step for many text processing algo-rithms [36]. The complexity of the problem is quite well understood, as for integer alphabets suffix sortingcan be done in optimal linear time and in-place [29, 18]. In this article, we consider a variant of the problem:instead of computing the order of every suffix, we address the sparse suffix sorting problem . Given a text T [1 ..n ] of length n and a set P ⊆ [1 ..n ] of m arbitrary positions in T , the problem asks for the (lexicographic)order of the suffixes starting at the positions in P . The answer is encoded by a permutation of P , which iscalled the sparse suffix array (SSA) of T (with respect to P ) and denoted by SSA ( T, P ).Applications are found in external memory LCP-array construction algorithms [23], and in the search ofmaximal exact matches [26, 44], i.e., substrings found in two given strings that can be extended neither totheir left nor to their right without getting a mismatch.Like the “full” suffix arrays, we can enhance SSA ( T, P ) with the lengths of the longest common prefixes(LCPs) between adjacent suffixes in SSA ( T, P ). These lengths are stored in the sparse longest commonprefix array (SLCP) , which we denote by SLCP ( T, P ). In combination, SSA ( T, P ) and SLCP ( T, P ) storethe same information as the sparse suffix tree , i.e., they implicitly represent a compacted trie over allsuffixes starting at the positions in P . The sparse suffix tree is an efficient index for pattern matching [28].Based on classic suffix array construction algorithms [25, 33], sparse suffix sorting is easily conducted in O ( n ) time if O ( n ) words of additional working space is available. For m = o ( n ), however, the working spacemay be too large, compared to the final space requirement of SSA ( T, P ). Although some special choicesof P admit space-optimal O ( m )-words construction algorithms (e.g. [24], see also the related work listedin [7]), the problem of sorting arbitrary suffixes in small space seems to be much harder. We are aware of thefollowing results: As a deterministic algorithm, K¨arkk¨ainen et al. [25] gave a trade-off using O ( τ m + n √ τ )time and O ( m + n/ √ τ ) words of working space with a parameter τ ∈ [1 .. √ n ]. If randomization is allowed,there is a technique based on Karp-Rabin fingerprints, first proposed by Bille et al. [7] and later improvedby I et al. [20]. Gawrychowski and Kociumaka [16] presented an algorithm running with O ( m ) words ofadditional space in either O ( n √ lg m ) expected time, or in O ( n ) time as a Monte Carlo algorithm (i.e., theoutput is correct only with high probability). Most recently, Prezza [35] presented a Monte Carlo algorithmin the restore model [8] that runs with O ( m ) words of space in O ( n + m lg n ) expected time. ∗ Parts of this work have already been presented at the 12th Latin American Symposium [12]. a r X i v : . [ c s . D S ] F e b .1 Computational Model Let lg and log x denote the logarithm to the base two and to the base x for a real number x , respectively.Our computational model is the word RAM model with word size Ω(lg n ). Here, characters use d lg σ e bits,where σ is the alphabet size; hence, b log σ n c characters can be packed into one word. Comparing two strings X and Y therefore takes O (lcp( X, Y ) / log σ n ) time, where lcp( X, Y ) denotes the length of the LCP of X and Y .We assume that the text T of length n is loaded into RAM. We work with the restore model [8], wherealgorithms are allowed to overwrite parts of T , as long as they can restore T to its original form at termination.Apart from this space, we are only allowed to use O ( m ) words. The positions in P are assumed to arriveon-line, implying in particular that they need not be sorted. We aim at worst-case efficient deterministic algorithms. Our main algorithmic idea is to insert the suffixes starting at the positions of P into a self-balancing binarysearch tree [21]; since each insertion invokes O (lg m ) suffix-to-suffix comparisons, the time complexity is O ( t S m lg m ), where t S is the cost for a suffix-to-suffix comparison. If all suffix-to-suffix comparisons areconducted naively by comparing the characters ( t S = O ( n/ log σ n )), the resulting worst case time complexityis O ( nm lg m/ log σ n ). In order to speed this up, our algorithm identifies large identical substrings at differentpositions during different suffix-to-suffix comparisons. Instead of performing naive comparisons on identicalparts over and over again, we build a data structure (stored in redundant text space) to accelerate subsequentsuffix-to-suffix comparisons. Informally, when two (possibly overlapping) substrings in the text are detectedto be the same, one of them can be overwritten.To accelerate suffix-to-suffix comparisons, we devise a new data structure called hierarchical sta-ble parsing (HSP) tree that is based on edit sensitive parsing (ESP) [11]. The HSP tree sup-ports longest common extension (LCE) queries. An LCE query lce( i, j ) on an HSP tree asks for thelength lcp( T [ i.. ] , T [ j.. ]) of the LCP of two suffixes starting at the respective positions i and j of the text T on which the tree is built. Besides answering LCE queries, HSP trees are mergeable , allowing us to build adynamically growing LCE index on substrings read in the process of the sparse suffix sorting. Consequently,comparing two already indexed substrings is done by a single LCE query.In their plain form, HSP trees need more space than the text itself; to overcome this space problem, wedevise a truncated version of the HSP tree, yielding a trade-off parameter between space consumption andLCE query time. By choosing this parameter appropriately, the truncated HSP tree fits into the text space.With a text space management specialized on the properties of the HSP, we achieve the result of Thm. 1below.We make the following definition that allows us to analyze the running time more accurately. Define C := S p,p ∈P ,p = p [ p..p +lcp( T [ p.. ] , T [ p .. ])] as the set of positions that must be compared for distinguishing thesuffixes starting at the positions of P . Then sparse suffix sorting is trivially lower bounded by Ω( |C| / log σ n )time. With the definition of C , we now can state the main result of this article as follows: Theorem 1.
Given a text T of length n that is loaded into RAM, the SSA and SLCP of T for a setof m arbitrary positions can be computed deterministically in O ( |C| ( √ lg σ + lg lg n ) + m lg m lg n lg ∗ n ) = O ( |C| √ lg n + m lg m lg n lg ∗ n ) time, using O ( m ) words of additional working space.Excluding the loading cost for the text, the running time can be sublinear (when |C| = o ( n/ √ lg n ) and m lg m = o ( n/ lg n lg ∗ n )). To the best of our knowledge, this is the first algorithm that refines the worst-caseperformance guarantee. All previously mentioned (deterministic and randomized) algorithms take Ω( n ) timeeven if we exclude the loading cost for the text. Also, general string sorters (e.g., forward radix sort [2] ormultikey quicksort [4]), which do not take advantage of the overlapping of suffixes, suffer from the lowerbound of Ω( ‘/ log σ n ) time, where ‘ is the sum of all LCP values in the SLCP, which is always at least |C| ,but can in fact be Θ( nm ). 2 onstruction Data StructureTime Working Space Space Query Time Ref O ( nτ ) O (cid:0) nτ (cid:1) O (cid:0) nτ (cid:1) O (cid:0) τ lg min( τ, nτ (cid:1) [42] O (cid:0) n (cid:15) (cid:1) O (cid:0) nτ (cid:1) O (cid:0) nτ (cid:1) O ( τ ) [6] O (cid:16) n (cid:16) lg ∗ n + lg nτ + lg τ log σ n (cid:17)(cid:17) O (cid:0) max (cid:0) n lg n , τ lg 3 lg ∗ n (cid:1)(cid:1) O (cid:0) nτ (cid:1) O (cid:16) lg ∗ n (cid:16) lg (cid:0) ‘τ (cid:1) + τ lg 3 log σ n (cid:17)(cid:17) Thm. 3 O (cid:16) n (cid:16) lg ∗ n + lg nτ + lg τ log σ n (cid:17)(cid:17) O (cid:0) τ lg 3 lg ∗ n (cid:1) O (cid:0) nτ (cid:1) O (cid:16) lg ∗ n (cid:16) lg (cid:0) nτ (cid:1) + τ lg 3 log σ n (cid:17)(cid:17) Cor. 29
Figure 1: Deterministic LCE data structures with trade-off parameters. The length returned by an LCEquery is denoted by ‘ . (cid:15) and τ with (cid:15) > ≤ τ ≤ n are constants. Space is measured in words . Thecolumn Working Space lists the working space needed to construct a data structure, whereas the column
Space lists the final space needed by a data structure.
The LCE-problem is to preprocess a text T such that subsequent LCE queries can be answered efficiently.Data structures for LCE and sparse suffix sorting are closely related, as shown in the following observation: Observation 2.
Given a data structure that answers LCE queries in O ( τ ) time for τ >
0, we can computesparse suffix sorting for m positions in O ( τ m lg m ) time by inserting suffixes into a balanced binary searchtree [21]. Conversely, given an algorithm computing the SSA and the SLCP of a text T of length n for m positions in O ( f ( n, m )) time with O ( m ) words of space for a function f , we can construct a data structurein O (max( f ( n, m ) , n/m )) time with O ( m ) words of space, answering LCE queries on T in O ( n /m ) time. Proof.
The first claim is trivial. For the second claim, we use the data structure of [5, Theorem 1a] thatanswers LCE queries in O ( τ ) time. The data structure uses the SSA and SLCP values of those suffixes whosestarting positions are in a difference cover sampling modulo τ . This difference cover consists of O ( n/ √ τ )text positions, and can be computed in O ( √ τ ) time [9]. We obtain the claimed bounds on time and spaceby setting τ := n /m .There has been a great interest in devising deterministic LCE data structures with trade-off parame-ters (see Fig. 1), or in compressed space [43, 32, 19]. One of the currently best data structures with atrade-off parameter is due to Tanimura et al. [42], using O ( n/τ ) words of space and answering LCE queriesin O ( τ lg min( τ, n/τ )) time, for a trade-off parameter τ with 1 ≤ τ ≤ n . However, this data structure hasa preprocessing time of O ( nτ ), and is thus not helpful for sparse suffix sorting. We develop a new datastructure for LCE with the following properties. Theorem 3.
There is a deterministic data structure using O ( n/τ ) words of space that answers an LCEquery ‘ := lce( i, j ) for two text positions i and j with 1 ≤ i, j ≤ n on a text of length n in O (lg ∗ n (lg( ‘/τ ) + τ lg 2 / log σ n )) time, where 1 ≤ τ ≤ n . We can build the data structure in O ( n (lg ∗ n +(lg n ) /τ +(lg τ ) / log σ n ))time with additional O (max( n/ lg n, τ lg 3 lg ∗ n )) words during construction.The construction time of our data structure is upper bounded by O ( n lg n ), and hence it can be con-structed faster than the deterministic data structures in [42] when τ = Ω(lg n ). We start with Sect. 2 introducing the edit sensitive parsing, and giving a motivation for our hierarchicalstable parsing whose description follows in Sect. 3. Section 3.3 shows the general techniques for answeringLCE queries with the HSP tree. Subsequently, Sect. 4 introduces our algorithm for the sparse suffix sortingproblem with an abstract data type dynLCE that supports LCE queries and a merging operation. Theremainder of that section shows that the HSP tree from Sect. 3 fulfills all properties of a dynLCE; inparticular, HSP trees support the merging operation. The last part of this article is dedicated to the study3n how the text space can be exploited with the HSP technique to improve the memory footprint. This leadsus to truncated HSP trees with a merging operation that is tailored to working in text space (Sect. 5). Withthe truncated HSP trees we finally solve the sparse suffix sorting problem in the time and space as claimedin Thm. 1.
Let Σ be an ordered alphabet of size σ whose characters are represented by integers. For a string X ∈ Σ ∗ ,let | X | denote the length of X . For a position 1 ≤ i ≤ | X | in X , let X [ i ] denote the i -th character of X . Forpositions i and j with 1 ≤ i, j ≤ | X | , let X [ i..j ] = X [ i ] X [ i + 1] · · · X [ j ]. Given T = XY Z with
X, Y, Z ∈ Σ ∗ , X , Y and Z are called a prefix , substring , suffix of T , respectively. In particular, the suffix beginning atposition i is denoted by T [ i.. ]. A period of a string Y is a positive integer p < | Y | such that Y [ i ] = Y [ i + p ]for all integers i with 1 ≤ i ≤ | Y | − p .For a binary string T ∈ { , } ∗ we are interested in the operation T. rank ( j ) that counts the numberof ‘1’s in T [1 ..j ]. This operation can be performed in constant time by a data structure [22] that takes o ( | T | )extra bits of space, and can be constructed in time linear in | T | .An interval I = [ b..e ] is the set of consecutive integers from b to e , for b ≤ e . For an interval I , we usethe notations b ( I ) and e ( I ) to denote the beginning and the end of I ; i.e., I = [ b ( I ) .. e ( I )]. We write |I| todenote the length of I ; i.e., |I| = e ( I ) − b ( I ) + 1. The crucial technique used in this article is the so-called alphabet reduction. The alphabet reduction isused to partition a string deterministically into blocks. The first work introducing the alphabet reductiontechnique to the string context was done by Mehlhorn et al. [31]. They presented the so-called signatureencoding . The signature encoding is derived from a tree coloring approach [17]. It supports string equalitychecks in the scenario where strings can be dynamically concatenated or split. In the same context, Sahinalpand Vishkin [38] studied the maximal number of characters to the left and to the right of a substring Z of Y such that changing one of these characters to the left or to the right of Z can affect how Z is parsed bythe signature encoding of Y . In a later work, Alstrup et al. [1] enhanced signature encoding with additionalqueries like LCE. Recently, an LCE data structure using signature encoding in compressed space was shownby Nishimoto et al. [32]. A slightly modified version of signature encoding is proposed by Sakamoto et al. [39],showing that alphabet reduction can be used to build a grammar compressor whose approximation ratio tothe size of the smallest grammar is O (lg ∗ n lg n ).A modified parsing was introduced by Cormode and Muthukrishnan [11]. They modified the parsingby restricting the block size from two up to three characters, and named their technique edit sensitiveparsing (ESP). Initially used for approximating the edit distance with moves, the ESP technique has beenfound to be applicable to building self-indexes [41]. We stick to the ESP technique, because the size of thesubtree of a node in the ESP tree is bounded. In this section, we first introduce the ESP technique, and thengive a motivation for a modification of the ESP technique, which we call hierarchical stable parsing (HSP).Before that, we recall the alphabet reduction and the ESP trees. Given a string Y in which no two adjacent characters are the same, i.e., Y [ i − = Y [ i ] for every integer i with 2 ≤ i ≤ | Y | , we can partition Y (except at most the first lg ∗ σ positions) into blocks of size two or threewith a technique called alphabet reduction [11, Section 2.1.1]. It consists of three steps (see also Fig. 2):First, it reduces the alphabet size to at most eight, in which every character has a rank from zero to seven.Subsequently, it substitutes characters with ranks four to seven with characters having a rank between zeroand two. By doing so, it shrink the alphabet size to three. Finally, it identifies certain text positions aslandmarks that determine the block boundaries. 4or reducing the alphabet size, we assume that σ ≥
9, otherwise we skip this step. The task is togenerate a surrogate string Z on the alphabet { , , } such that the entry Z [ i ] depends only on the substring Y [ i..i + lg ∗ σ ], for 1 ≤ i ≤ | Y | − lg ∗ σ . To this end, we regard Y as an array of binary numbers, i.e., Y [ i ][ ‘ ] ∈ { , } for an integer ‘ with 1 ≤ ‘ ≤ d lg σ e . We create an array Z of length | Y | − .. d lg σ e − i with 2 ≤ i ≤ | Y | , we compare Y [ i ] with Y [ i − ‘ := lcp( Y [ i − , Y [ i ]), and write 2 ‘ + Y [ i ][ ‘ + 1] to Z [ i ] (remember that we treat Y [ i ] as a binarystring ). By doing so, no two adjacent integers are the same in Z [11, Lemma 1]. Having computed Z , werecurse on Z until Z stores integers of the domain { , . . . , } . Note that the alphabet cannot be reducedfurther with this technique, since 2 d lg x e ≥ x for every integer x with 2 ≤ x ≤
6. To obtain the final Z , werecurse at most lg ∗ σ times. Let r be the number of recursions. Then we have | Y | = | Z | + r .If we skipped this step because of a small alphabet size ( σ ≤ Z [ i ] to the rank of Y [ i ]induced by the linear order of Σ (e.g., Z [ i ] = 0 if Y [ i ] is the smallest character). Since | Y | = | Z | , we set r tozero.To reduce the domain further, we iterate over the values j = 3 , . . . , Z [ i ] = j with the lowest value of { , , } that does not occur in its neighboring entries ( Z [ i −
1] and Z [ i + 1], if they exist). Finally, Z contains only numbers between zero and two.In the final step we create the landmarks that determine the block boundaries. The landmarks obey theproperty that the distance between two subsequent landmarks is greater than one, but at most three. Theyare determined by local maxima and minima: First, each number Z [ i ] that is a local maximum is made intoa landmark. Second, each local minimum that is not yet neighbored by a landmark is made into a landmark.Finally, we create blocks by associating each position in Z with its closest landmark. Positions associatedwith the same landmark are put into the same block. As a tie breaking rule we favor the right landmarkin case that there are two closest landmarks. The last thing to do is to map each block covering Z [ i..j ] to Y [ i + r..j + r ].The tie breaking rule can cause a problem when Z [1] and Z [3] are landmarks, i.e., the leftmost blockcontains only one character. We circumvent this problem by fusing the blocks of the first and secondlandmark to a single block. If this block covers four characters, we split it evenly.Altogether, the alphabet reduction needs O ( | Y | lg ∗ σ ) time, since we perform r ≤ lg ∗ σ reduction steps,while determining the landmarks and computing the blocks take O ( | Y | ) time. The steps are summarized inthe following lemma: Lemma 4.
Given a string Y in which no two adjacent characters are the same, the alphabet reductionapplied on Y partitions Y into blocks, except at most d lg ∗ σ e positions at the left. It runs in O ( | Y | lg ∗ σ )time.The main motivation of introducing the alphabet reduction is the following lemma that shows thatapplying the alphabet reduction on a text Y and on a pattern X generates the same blocks in X as in alloccurrences of X in Y , except at the left and right borders of a specific length: Lemma 5 ([11, Lemma 4]) . Given a substring X of a string Y in which no two adjacent characters are the same, the alphabetreduction applied to X alone creates the same blocks as the blocksrepresenting the substring X in Y , except for at most ∆ L := d lg ∗ σ e + 5 characters at the left border, and ∆ R := 5 charactersat the right border. Y = X = X ∆ L ∆ R Given a block β , we call the substring Y [ b ( β ) − ∆ L .. e ( β ) + ∆ R ] the local surrounding of β , if it exists(i.e., b ( β ) − ∆ L ≥ e ( β )+ ∆ R ≤ | Y | ). Blocks whose local surroundings exist are also called surrounded .A consequence of Lemma 5 is the following: Given that X is the local surrounding of a surrounded block β ,then the blocking of every occurrence of X in Y is the same, except at most ∆ L and ∆ R characters atthe left and right borders, respectively. We conclude that the blocking of every occurrence of X has ablock X [1 + ∆ L .. ∆ L + | β | ] that is equal to Y [ b ( β ) .. e ( β )] (see Fig. 3). Fix an arbitrary rule whether Y [ i ][1] is the least significant or most significant bit. lph. red.alph. red.4 t s u k u m o g a m i1 1 1 0 1 0 0 0 0 0 00 0 0 1 0 1 1 0 0 1 11 0 1 0 1 1 1 1 0 1 00 1 0 1 0 0 1 1 0 0 00 1 1 1 1 1 1 1 1 1 11 2 3 2 7 3 6 2 5 4 Y = Z = Y Z landmarks
Figure 2: Alphabet reduction applied on the string Y = tsukumogami . We represent the characters with thefive lowest bits of the ASCII encoding. Left:
A single step of the alphabet reduction. The bit representationof each character Y [ i ] is shown vertically in the left figure (the most significant bit is on the top). Thealphabet reduction matches the least significant bits of two adjacent entries, and returns twice the numberof matched bits plus the mismatched bit of the right character (highest shaded bit). The resulting integerarray Z is the last row. Middle:
A second step of the alphabet reduction, where the result of the firstalphabet reduction stored in Z is put into Y . Right:
Computation of the blocks. Two steps of the alphabetreduction (seen in the left and in the middle image) yield a sequence consisting only of integers within thedomain { , . . . , } . Subsequently, all ‘4’s are replaced (in this case by ‘2’ since the neighboring values are ‘0’and ‘1’ in both cases), and the maxima and certain minima are made into landmarks (shaded). Finally, theboxes in the last row are the computed blocks. Y = βX b ( β ) e ( β ) ∆ L ∆ R Y = βX ∆ L ∆ R βX ∆ L ∆ R Figure 3:
Left:
Surrounded block β with local surrounding X contained in a string Y . Right:
Occurrencesof the local surrounding X of a surrounded block β in the string Y , which is partitioned into blocks (grayrectangles) by the edit sensitive parsing. Although the occurrences of X can be differently blocked at theirborders, they all have a block equal to β in common. Whenever a string Y contains a repetition of a character at two adjacent positions, we cannot parse Y withthe alphabet reduction. A solution is to additionally use an auxiliary parsing specialized on repetitions ofthe same character. With this auxiliary parsing, we can partition Y into substrings, where each substring iseither parsed with the alphabet reduction, or with the auxiliary parsing. It is this auxiliary parsing wherethe aforementioned signature encoding and the ESP technique differ. The main difference is that the ESPtechnique restricts the lengths of the blocks: It first identifies so-called meta-blocks in Y , and then furtherrefines these meta-blocks into blocks of length 2 or 3. The meta-blocks are created in the following 3-stageprocess (see also Fig. 4 for an example):(1) Identify maximal regions of repeated characters (i.e., maximal substrings of the form c ‘ for c ∈ Σ and ‘ ≥ type 1 meta-blocks.(2) Identify remaining substrings of length at least 2 (which must lie between two type 1 meta-blocks).Such substrings form the type 2 meta-blocks.(3) Every substring not yet covered by a meta-block consists of a single character and cannot have type 2 J type 2 B A A type 1
K B type M
L J type 2
B I type M ab ab aaa aa aa baa aaa bab ab aaa aab blocks Y =meta-blocks Figure 4: ESP of the string Y = ababaaaaaaabaaaaabababaaaaab . The string is divided into blocksrepresented by the rectangular boxes at the bottom. Each block gets assigned a new character representedby the capital letters in the rounded boxes. The white rectangular boxes on the top level represent the meta-blocks that group the blocks. The blocks are connected with horizontal lines if they belong to a repeatingmeta-block, or by diagonal lines if they belong to a type 2 meta-block.meta-blocks as its neighbors. Such characters are fused with a neighboring meta-block. The meta-blocks emerging from this fusing are called type M (mixed).Meta-blocks of type 1 and type M are collectively called repeating meta-blocks . For (3), we are free tochoose whether a remaining character should be fused with its preceding or succeeding meta-block (bothmeta-blocks are repeating). We stick to the following tie breaking rule :Rule M: Fuse a remaining character Y [ i ] with its succeeding meta-block, or, if i = | Y | , with itspreceding meta-block.Meta-blocks are further partitioned into blocks , each containing two or three characters from Σ. Blocksinherit the type of the meta-block they are contained in. How the blocks are partitioned depends on thetype of the meta-block: Repeating meta-blocks.
A repeating meta-block is partitioned greedily: create blocks of length threeuntil there are at most four, but at least two characters left. If possible, create a single block of lengthtwo or three; otherwise (there are four characters remaining) create two blocks, each containing twocharacters.
Type-2 meta-blocks. A type 2 meta-block µ is partitioned into blocks in O ( | µ | lg ∗ σ ) time by the alphabetreduction (Lemma 4). A block β generated by the alphabet reduction is determined by the characters Y [max( b ( β ) − ∆ L , b ( µ )) .. min( e ( β )+ ∆ R , e ( µ ))] due to Lemma 5. Given the number of reduction steps r in Sect. 2.1, the alphabet reduction does not create blocks for the first r characters of each meta-block.The ESP technique blocks the first r characters in the same way as a repeating meta-block. The bordercase r = 1 (one character remaining) is treated by fusing the remaining character with the first blockcreated by the alphabet reduction, possibly splitting this block in the case that its size is four.A block is called repetitive if it contains the same characters. All blocks of a type 1 meta-block and allblocks except at most the left- or rightmost block (these blocks can contain a fused character) in a type M meta-block are repetitive.Let esp : Σ ∗ → (Σ ∪ Σ ) ∗ denote the function that parses a string by the ESP technique. We regard theoutput of esp as a string of blocks. Applying esp recursively on its output generates a context free grammar (CFG) as follows. Let h Y i := Y bea string on an alphabet Σ := Σ. The output of h Y i h := esp ( h ) ( Y ) = esp ( esp ( h − ( Y )) is a sequence of blocks,which belong to a new alphabet Σ h with h ≥
1. We call the elements of Σ h with h ≥ names , and usethe term symbol for an element that is a name or a character. A block β ∈ Σ h contains a string of symbolswith length two or three ( ∈ Σ h − ∪ Σ h − ). We maintain an injective dictionary D : Σ h → Σ h − ∪ Σ h − The original version [11] prefers the left meta-block. string ( · ) I → aab a bJ → ab abK → baa ba L → bab babN → ba ba ESP DictionaryRule string ( · ) A → aa a B → aaa a C → AA a D → BB a E → BBB a F → DD a G → NN ( ba ) H → NNN ( ba ) M → CG a ( ba ) O → ANN a ( ba ) R → JJJ ( ab ) HSP DictionaryRule string ( · ) a → aa a a → aaa a P → a J a bQ → a I a b Figure 5: Names of the ESP (Sect. 2.2) and HSP (Sect. 3) nodes stored in the global dictionary of ourexamples. The common dictionary contains all names that are used by both ESP and HSP. Each nameoccurs on the left side only once across all dictionaries.
B B B B A A N ND D C GF M τ aaa aaa aaa aaa aa aa ba ba Y = Y = Y = Y =... ∈ Σ ∗ = Σ ∈ Σ ∗ ∈ Σ ∗ ... Figure 6: The ESP tree of the string Y = aaaaaaaaaaaaaaaababa . Like in Fig. 4, nodes belonging to thesame meta-block are connected by horizontal (repeating meta-block) or diagonal ( type 2 meta-block) lines.to map a block to its symbols. The dictionary entries are of the form β → xy or β → xyz , where β ∈ Σ h and x, y, z ∈ Σ h − . We write D ( X ) := D ( X [1]) · · · D ( X [ | X | ]) ∈ Σ ∗ h − for X ∈ Σ ∗ h . Each block on height h is contained in a meta-block µ on height h −
1, which is equal to a substring h Y i h − [ i..j ] ∈ Σ ∗ h − . We call h Y i h − [ i..j ] ∈ Σ ∗ h − the symbols of µ . Since each application of esp reduces the string length by at leastone half, there is an integer k with k ≤ lg | Y | such that h Y i k = esp ( h Y i k − ) is a single block τ ∈ Σ k . Wewrite V := S ≤ h ≤ k Σ h for the set of names in h Y i , h Y i , . . . , h Y i k . The CFG for Y is represented by thenon-terminals (i.e., the names) V , the terminals Σ , the dictionary D , and the start symbol τ . This grammarexactly derives Y .Throughout this article, we comply with the convention to write symbols, i.e., characters ( ∈ Σ ) andnames ( ∈ Σ h , h ≥ ESP tree ET ( Y ) of a string Y is the derivation tree of the CFG defined above. Its root node is thestart symbol τ . The nodes on height h are h Y i h for each height h ≥
1. In particular, the leaves are h Y i .Each leaf refers to a substring in Σ or Σ . The generated substring of a node h Y i h [ i ] is the substringof Y generated by the symbol h Y i h [ i ] (applying the h -th iterate of D to h Y i h [ i ], yields a substring of Y ,i.e., D ( h ) ( h Y i h [ i ]) ∈ Σ ∗ ). We denote the generated substring of h Y i h [ i ] by string ( h Y i h [ i ]). For instance,in Fig. 6, string ( M ) = aaaababa . A node v on height h is said to be built on h Y i h − [ b..e ] iff h Y i h − [ b..e ]contains the children of v . Like with blocks, nodes inherit the type of the meta-block on which they are8 = h Y i h = h Y i h − = Y = Y µ v string (v) Figure 7: h Y i h with a highlighted node v . The subtree rooted at v is depicted by the white, rounded boxes.The generated substring string ( v ) of v is the concatenation of the white rectangular blocks on the lowestlevel in the picture. The meta-block µ , on which v is built, is the rounded rectangle covering the childrenof v and all nodes connected by a horizontal hatching on height h − ψ φ δ · · · v · · · res lic ed · · · X · · · = · · · X = · · · α β γ δ · · · (cid:15) λ · · · u · · · re sl ic ed Figure 8: Excerpts of ET ( X · · · ) ( left ) and ET ( · · · X ) ( right ) with X := resliced . Under the assumptionthat lg ∗ σ = 8, the common substring X can be blocked differently in both trees (depending on the characterspreceding X in the right figure).built. An overview of the definitions is given in Fig. 7.In what follows, we present two shortcomings of the ESP trees. The first is that nodes with differentnames can have the same generated substring, i.e., D ( h ) : Σ h → Σ ∗ is not injective for h ≥ ET ( Y ) and ET ( Z ) are equal when Y is asubstring of Z . Both cause problems when comparing subtrees of two nodes, which we later do for answeringLCE queries.Given two nodes u and v , it holds that string ( u ) = string ( v ) if their names are equal. However, the otherway around is not true in general. With string ( u ) = string ( v ), it is not even assured that u and v are nodeson the same height. Suppose that Σ is a large alphabet with lg ∗ σ = 8, and that X := resliced occurs inthe text that we parse with ESP (see Fig. 8). We parse an occurrence of X either (a) with the alphabetreduction if it is within a type 2 meta-block, or (b) greedily if it is at the beginning of a type 2 meta-block.In the former case (a), we apply the alphabet reduction and end at a reduced alphabet with the characters { , , } . Suppose that this occurrence of X is reduced to the string in superscript of r e s l i c e d . Then ESPcreates the four blocks re | sl | ic | ed , whose boundaries are determined by the alphabet reduction. Furthersuppose that an application of esp creates two nodes of these blocks, which are put into a node u by anadditional parse such that string ( u ) = X . In the latter case (b), ESP creates the three blocks res | lic | ed greedy. Suppose that an additional parse puts these blocks in a node v such that string ( v ) = X . Although string ( v ) = string ( u ), the children of both nodes have different names, and therefore, both nodes cannothave the same name.The second shortcoming is that it is not clear how to transfer the property of the alphabet reductiondescribed in Lemma 5 from blocks to nodes. Given a substring Y of a string Z , the task is to analyze whethera node h Y i h [ i ] is also present in the tree ET ( Z ), i.e., we analyze changes of a node h Y i h [ i ] when prepending orappending (pre-/appending) characters to Y . For the sake of analysis, we distinguish the two terminologies block and node , although a node is represented by a block: When we analyze a block in esp ( X ) ∈ Σ ∗ h for a string X ∈ Σ ∗ h − , we let X to be subject to pre-/appending characters of Σ h − , whereas when weanalyze a node h Y i h [ i ] on a height h of ET ( Y ) of a string Y ∈ Σ ∗ , we let Y to be subject to pre-/appending9 B B B B B A A N N N N NE E C H Gaaa aaa aaa aaa aaa aaa aa aa ba ba ba ba baB B B B B B B A N N N N NE D D O Haaa aaa aaa aaa aaa aaa aaa aa ba ba ba ba ba Y = a Y = prepend a Figure 9: Excerpt of ET ( Y ) and ET ( a Y ) (higher nodes omitted), where Y = a k +4 ( ba ) k − = a ( ba ) for k = 2. For all k ≥
2, there is a unique node in h Y i with the name C . This name does not appear in ET ( a Y ).characters of Σ. In this terminology, a block in esp ( X ) is only determined by X , whereas h Y i h [ i ] is not onlydetermined by esp ( h − ( Y ) ∈ Σ ∗ h − , but also by Y itself. The difference is that a surrounded type 2 blockof esp ( X ) cannot be changed by pre-/appending characters to X due to Lemma 5, whereas we fail to findintegers ∆ L ,h and ∆ R ,h such that a type 2 node on height h built on h Y i h − [ ∆ L ,h .. ∆ R ,h ] cannot be changed bypre-/appending characters to Y . That is because the names inside h Y i h − and h a Y i h − for h ≥ Y := a k +4 ( ba ) k − with the names defined in Fig. 5, we obtain esp ( esp ( Y )) = esp ( B k AAN k − ) = E k CH k − G . Let us focus onthe unique occurrence of the name C , which is depicted in Fig. 9 for k = 2. On the one hand, there is a blockrepresenting the name C on height two. This block is surrounded for a sufficiently large k . Even for k ≥
1, itis easy to see that there is no way to change the name of this block by pre-/appending characters to the string B k AAN k − . On the other hand, there is a unique node in ET ( Y ) with name C on height two. Regardless ofthe value of k , prepending a to Y changes the name of v : esp ( esp ( a Y )) = esp ( B k +1 AN k − ) = E k − DDOH k − .Nevertheless, we introduce the notion of surrounded nodes, since they are helpful to find rules that determinethose nodes that cannot be changed by pre-/appending characters. v height h + 1 ∆ L ∆ R ∆ L ∆ R Surrounded Nodes.
Analogously to blocks we classify nodes assurrounded when they are neighbored by sufficiently many nodes:A leaf is called surrounded iff its generated substring is surrounded.The local surrounding of a leaf is the local surrounding of the blockrepresented by the leaf. Given an internal node v on height h + 1( h ≥
1) whose children are h Y i h [ β ], the local surrounding of v is the union of the nodes h Y i h [ b ( β ) − ∆ L .. e ( β ) + ∆ R ] and the local surrounding of each node in h Y i h [ b ( β ) − ∆ L .. e ( β ) + ∆ R ]. If all nodes inthe local surrounding of v are surrounded, we say that v is surrounded . Otherwise, we say that v is non-surrounded . Lemma 6.
There are at most ∆ L + ∆ R many non-surrounded nodes on each height, summing up to O (lg ∗ n lg n ) non-surrounded nodes in total. Proof.
We show that a node v on height h is surrounded if it has ∆ L preceding and ∆ R succeeding nodes. This is clear on height one by definition. Under the assumptionthat the claim holds for height h − v ’s preceding (resp. succeeding) nodes have atleast 2 ∆ L (resp. 2 ∆ R ) children in total, where at least the ∆ L rightmost nodes (resp. ∆ R leftmost nodes) are surrounded by the assumption. Hence, v is surrounded. v ≥ ∆ L ≥ ∆ R ≥ ∆ L ≥ ∆ R The above example contrasting blocks and nodes reveals that the property for surrounded blocks asshown on the right side of Fig. 3 cannot be transferred to surrounded nodes directly, since a surroundednode depends not only on its local surrounding, but also on the nodes on which it its built. Despite thisdiscovery, we show that surrounded nodes can help us to create rules that are similar to Lemma 5.10
B B B B B A A N N N N NE E C H Gaaa aaa aaa aaa aaa aaa aa aa ba ba ba ba baB B B B B B A A N N N N NE E C H Gaaa aaa aaa aaa aaa aaa aa aa ba ba ba ba baB B B B B B A A N N N N NE E C H Gaaa aaa aaa aaa aaa aaa aa aa ba ba ba ba ba Y = Figure 10: ET ( Y ) of Fig. 9 with fragile, semi-stable and stable nodes highlighted. The fragile nodes arecross-hatched, the semi-stable nodes are dotted, and the stable nodes have stars attached. The leftmostnodes of the tree change their names when prepending a b . When prepending a ’s, we observe that thechildren of the node with name C change. Assuming that Σ = { a , b } (and hence | Σ | = 2), only the rightmostnode of the meta-block containing nodes with name N is fragile. We now analyze which nodes of ET ( Y ) are still present in ET ( XY Z ) for all strings X and Z . A node h Y i h [ j ]in ET ( Y ) at a height h is said to be stable iff, for all strings X and Z , there exists a node h XY Z i h [ k ]in ET ( XY Z ) with the same name as h Y i h [ j ] and | X | + P j − i =1 | string ( h Y i h [ i ]) | = P k − i =1 | string ( h XY Z i h [ i ]) | .We also consider repeating nodes that are present with slight shifts; a non-stable repeating node h Y i h [ j ] in ET ( Y ) is said to be semi-stable iff, for all strings X and Z , there exists a node h XY Z i h [ k ] in ET ( XY Z )with the same name as h Y i h [ j ] and P k − i =1 | string ( h XY Z i h [ i ]) | − | S | < | X | + P j − i =1 | string ( h Y i h [ i ]) | < P k − i =1 | string ( h XY Z i h [ i ]) | + | S | , where S = string ( h Y i h [ j ]) = string ( h XY Z i h [ k ]).Nodes that are neither stable nor semi-stable are called fragile . By definition, the children of the(semi-)stable nodes (resp. fragile nodes) are also (semi-)stable (resp. fragile). Figure 10 shows an example,where all three types of nodes are highlighted. The rest of this section studies how many fragile nodes existin ET ( Y ).As a warm-up, we first restrict the ESP tree construction on strings that are square-free. A string Y is square-free iff there is no substring of Y occurring consecutively twice. Since a name of the ESP treeis determined by its generating substring, ET ( Y ) cannot contain two consecutive occurrences of the samename on any height. We conclude that ET ( Y ) has no repeating nodes, i.e., it consists only of type 2 nodes.When studying the stability of type 2 nodes, the following lemma is especially useful: Lemma 7 ([11, Lemma 8]) . A type 2 node is stable if (a) it is surrounded and (b) its local surroundingdoes not contain a fragile node.With Lemma 7 we immediately obtain: Lemma 8.
Given a square-free string Y , a fragile node of ET ( Y ) is a non-surrounded node. Proof.
According to Lemma 7, we can bound the number of fragile nodes by the number of those nodes thatdo not satisfy the conditions in Lemma 7. Since ET ( Y ) only contains type 2 nodes, we can show that afragile node is non-surrounded inductively for all heights of the ESP tree: Since leaves do not contain anynodes in their subtrees, surrounded leaves are stable due to Lemma 5. Therefore, the claim holds for h = 1.By definition, a node v on height h is surrounded if its local surrounding S on height h − h −
1, a node in S can only be fragile if it is not surrounded. This concludesthat v can be fragile only if it is not surrounded.Combining Lemma 8 with Lemma 6 yields: Corollary 9.
The number of fragile nodes of an ESP tree built on a square-free string of length n is O (lg ∗ n lg n ). On each height, it contains O (lg ∗ n ) fragile nodes.In Appendix A, we show that Cor. 9 cannot be generalized for arbitrary strings. There we show that theESP technique changes Ω(lg n ) nodes when changing a single character of a specific example string. A new upper bound.
With the examples in the appendix, we conclude that the O (lg ∗ n lg n )-bound onthe number of fragile nodes for square-free strings (Lemma 8) does not hold for general strings. To obtain a11 · · · · ·· · · ∆ L ∆ R · · · · · ·· · · surrounded nodes fragile Figure 11: Division of an ESP tree in surrounded and fragile nodes. The surrounded nodes form an innercone. Neighboring fragile blocks can appear in the non-surrounded areas. On each height, the ESP tree canhave a constant number of fragile surrounded nodes that do not have fragile nodes in their subtrees.general upper bound (we stick again to Rule M), we include the repeating meta-blocks in our study of fragilenodes. Fragile nodes can now be surrounded (trees of square-free strings do not have fragile surroundednodes according to Lemma 8). Remembering that a node is fragile if it has a fragile child, a consequenceis that a fragile type 2 node is not necessarily non-surrounded (e.g., one of its children can be a fragilesurrounded repeating node). Figure 11 sketches the possible occurrences of fragile surrounded nodes. A firstresult on a special case is given in the following lemma:
Lemma 10.
A surrounded node v is contained in the local surroundings of O (lg ∗ n lg n ) nodes. Given thatall those nodes are of type 2 , a change of v causes O (lg ∗ n lg n ) name changes. Proof.
We follow [11, Proof of Lemma 9]: We count the number of nodes that contain v in its local surround-ing. Given that v is a node on height i and u is v ’s parent, then vu height i ∆ R ∆ R ∆ L ∆ L there are at most ∆ R / ≤ ∆ R nodes preceding u and ∆ L / ≤ ∆ L nodes succeeding u that have v in its local surrounding. We countone on height i , and ( ∆ L + ∆ R + 1) / i + 1. Since the counted nodes on height i + 1 are consec-utive, there are at most ( ∆ L + ∆ R + 1) / i + 1.Consequently, there are at most ( ∆ L + ∆ R + 1) / ∆ L + ∆ R nodes on height i + 2 that have v in their localsurroundings. Iterating over all heights gives an upper bound of ( ∆ L + ∆ R +1) P lg n − ih =0 / h ≤ ∆ L + ∆ R +1)nodes on each height.Second, we narrow down the fragile blocks in repeating meta-blocks. The first block (cf. Fig. 12) andthe two rightmost blocks (cf. Fig. 13) of a repeating meta-block can be fragile. Due to the greedy parsing,all other blocks of a repeating meta-block are (semi-)stable. A repeating meta-block containing fragile surrounded blocks needs to start very early or end within the last symbol, as can be seen by the followinglemma: Lemma 11.
A repeating meta-block µ of esp ( Y ) with b ( µ ) ≥ e ( µ ) ≤ | Y | − B · · · B A type 1 aaa aaa · · · aaa aaK B · · ·
B B type M baa aaa · · · aaa aaaJ type 2 B · · · B B A type 1 ab aaa · · · aaa aaa aaA type 1 K · · · B B B type M aa baa · · · aaa aaa aaa a k = ba k = aba k = aaba k = Figure 12: Prepending the string aab to the text a k character by character. Each step is given as a row,in which we additionally computed the ESP of the current text. The last row shows an example, where aformer type 1 meta-block changes to type M , although it is right of a type 2 meta-block. Here, k mod 3 = 2. Proof.
Since b ( µ ) ≥
4, there are at least three symbols before µ that are assigned to one or more other meta-blocks. When prepending symbols, those meta-blocks can change, absorbing the new symbols or giving theleftmost symbol away to form a type 2 meta-block. In neither case, they can affect the parsing of µ , since µ is parsed greedily. Similarly, the succeeding meta-blocks of µ keep µ ’s blocks from changing when appendingsymbols. See Fig. 14 for a sketch. Corollary 12.
The edit sensitive parsing introduces at most two fragile surrounded blocks. These blocksare the two rightmost blocks of a repeating meta-block whose leftmost block is not surrounded.
Lemma 13.
Changing the symbol in a substring of h Y i h − on which a repeating node on height h is builtchanges O (1) names on height h . Proof.
Let u be a repeating node on height h . Since it is repeating, it is built on a substring X := h Y i h − [ b ( X ) .. e ( X )] of a repeating meta-block µ = h Y i h − [ b ( µ ) .. e ( µ )] with D ( u ) = X . Now change asymbol in X , say h Y i h − [ i u ] with b ( X ) ≤ i u ≤ e ( X ). This causes the name of u to change. Addition-ally, it causes the meta-block µ to split into a repeating meta-block h Y i h − [ b ( µ ) ..i u −
1] and a type M meta-block h Y i h − [ i u .. e ( µ )], causing the names of the two rightmost nodes built on the new meta-blocks tochange. Altogether, there are O (1) name changes on height h .An easy generalization of Lemma 13 is that changing k consecutive nodes on height h − h changes O ( k ) names on height h . With Lemma 13, the followinglemma translates the result of Cor. 12 for blocks to nodes: Lemma 14.
The ESP tree ET ( Y ) of a string Y of length n has O (lg n lg ∗ n ) fragile nodes, and O ( h lg ∗ n )fragile nodes on height h . Proof.
While computing h Y i h +1 from h Y i h , the ESP technique introduces O (1) fragile surrounded blocksaccording to Cor. 12. Each fragile surrounded block corresponds to a fragile surrounded node.13 · · · B B Baaa · · · aaa aaa aaaB · · ·
B B A Aaaa · · · aaa aaa aa aaB · · ·
B B B Aaaa · · · aaa aaa aaa aaB · · ·
B B B Baaa · · · aaa aaa aaa aaa a k +0 = a k +1 = a k +2 = a k +3 = Figure 13: Greedy blocking of a type 1 meta-block. The greedy blocking is related to the Euclidean divisionby three. The remainder k mod 3 is determined by the number of symbols in the last two blocks (here, k mod 3 = 0). In this example, the ESP technique creates a single, repeating meta-block on each input. R E · · ·
E D D GJJJ BBB · · ·
BBB BB BB NN b ( µ ) ≥ µ height h + 1height hY = e ( µ ) ≤ | Y | − Figure 14: Setting of Lemma 11. According to Lemma 11, a meta-block µ in esp ( Y ) of a string Y cannotcontain a surrounded fragile block if b ( µ ) ≥ e ( µ ) ≤ | Y | − O (lg ∗ n lg n )nodes fragile. Although we considered only type 2 nodes in Lemma 10, we can generalize this result for allfragile nodes with Lemma 13.To sum up, there are O ( h lg ∗ n ) fragile nodes on height h . Because ET ( X ) has a height of at most lg n ,there are O (lg ∗ n P lg nh =1 h ) = O (lg ∗ n lg n ) fragile nodes in total.Showing that the number of fragile nodes is indeed larger than assumed makes ESP trees a more unfa-vorable data structure, since fragile nodes are cumbersome when comparing strings with ESP trees as donein [11]. Fortunately, we can restore the claimed number of O (lg n lg ∗ n ) fragile nodes for a string of length n with a slight modification of the parsing, as shown in the following section. Our modification, which we call hierarchical stable parsing or HSP , augments each name with a sur-name and a surname-length , whose definitions follow: Given a name Z ∈ Σ h , let h with 0 ≤ h ≤ h be the largest integer such that D ( h ) ( Z ) consists of the same symbol, say D ( h ) ( Z ) = Y ‘ ∈ Σ ∗ h − h for asymbol Y ∈ Σ h − h and an integer ‘ ≥
1. Then the surname and surname-length of Z are the symbol Y andthe integer ‘ , respectively. For convenience, we define the surname of a character to be the character itself.Then all symbols in D ( j ) ( Z ) for every j with 1 ≤ j ≤ h share the same surname with Z .Having the surnames of the nodes at hand, we present the hierarchical stable parsing. It differs to ESP inhow a string of names is partitioned into meta-blocks, whose boundaries now depend on the surnames: Whenfactorizing a string of names into meta-blocks, we relax the check whether two names are equal; instead of By definition, the surname of Z is Z itself if ‘ = 1. a a a N N N N Na a N N a N α aaa aaa aaa aa ba ba ba ba baa ( ba ) = Figure 15: Hierarchical stable parsing. The repeating meta-blocks are determined by the surnames. a a a · · · a a a a a a N N · · · a · · · a a N aaa aaa aaa · · · aaa aaa aaa aaa aa aa ba ba · · · a a a · · · a a a a a a N N · · · a · · · a a N aaa aaa aaa · · · aaa aaa aaa aaa aaa aa ba ba · · · Y = a Y = prepend a Figure 16: Excerpt of HT ( Y ) and HT ( aY ) (higher nodes omitted), where Y = a k ( ba ) k with k = 18 + 9 i + 7for an integer i ≥ k ≥ Y creates a repeating meta-block consisting of a k , and a type 2 meta-block consisting of ( ba ) . For k ≥ This means that we allow meta-blocks of type 1 to containdifferent symbols as long as all symbols share the same surname. The other parts of the edit sensitive parsingdefined in Sect. 2.2 are left untouched; in particular, the alphabet reduction uses the symbols as before. Wewrite HT ( Y ) for the resulting parse tree, called HSP tree , when the HSP technique is applied to a string Y .Figure 15 shows HT ( a ( ba ) ). In the rest of this article (and as shown in Fig. 15), we give a repetitive nodewith surname Z and surname-length ‘ the name Z ‘ . We omit the surname-length if it is one (and thus, thelabel of a non-repetitive node is equal to its name). For the other nodes, we use the names of Fig. 5. We cando that because the name of a node can be identified by its surname and surname-length, as can be seen bythe following lemma: Lemma 15.
The name of a node is uniquely determined by its surname and surname-length.
Proof.
A node with surname-length one is not repetitive, and therefore, its name is equal to its surname.Given a repetitive node v with surname Z and surname-length ‘ , there is a height h such that D ( h ) ( v ) = Z ‘ .For every height h with 1 ≤ h ≤ h , D ( h ) ( v ) consists of the same symbol, and hence D ( h ) ( v ) is parsedgreedily by HSP. This means that the iterated greedy parsing of the string Z ‘ determines the name of v . The motivation of introducing the HSP technique becomes apparent with the three following facts:Fact 1: Given that the surnames of the repetitive nodes in a repeating meta-block µ are w , the generatedsubstring of each such repetitive node is a repetition of the form X k with the same X = string ( w ) ∈ The check is relaxed since names with different surnames cannot have the same name. J J J J J J JJ J J J ab ab ab ab ab ab ab ab height h J J J J J J J J JJ J J J ab ab ab ab ab ab ab ab ab prepend ab Y = ab Y = v v Figure 17: Comparison of HT ( Y ) and HT ( ab Y ) = HT ( Y ab ), where Y = ( ab ) . The node v with name J issemi-stable. Its generated substring shifts with a length of | string ( J ) | = 2. a a a a a a J a a N Na a P a N a βρ aaa aaa aaa aaa aaa aaa ab aaa aa ba baa a a a a a I a a N Na a Q a N a γϑ aaa aaa aaa aaa aaa aaa aab aaa aa ba ba Y = a Y = prepend a a a a a a a a K a N Na a a δ N a φω aaa aaa aaa aaa aaa aa aa baa aaa ba baa a a a a a a K a N Na a a δ N a φψ aaa aaa aaa aaa aaa aaa aa baa aaa ba ba Y = a Y = prepend a Figure 18: Impact of the tie breaking rule (Rule M) on emerging type M nodes. A type M node is created byfusing a single symbol with its sibling meta-block. Remember that Rule M prescribes to fuse the symbol withits right meta-block. To see why this rule is advantageous, the HSP trees on the left (resp. right ) use the tiebreaking rule choosing the left (resp. right ) meta-block. While on the right side only the fragile nodes of theleftmost meta-blocks on each height differ after prepending a (e.g., the unique occurrence of a changes to a ), the change is more dramatical on the left side. In the top left tree, which is built on Y = a ba ( ba ) ,the two rightmost nodes a and J of the type M meta-block on the bottom level are children of the leftmostnode P of the right meta-block on the next level. Prepending the character a to Y ( bottom left ) changes thenames of the nodes with names J and P to I and Q , respectively.Σ ∗ (or X = w in case w ∈ Σ), but with possibly different surname-lengths k (e.g., string ( N ) = ( ba ) and string ( N ) = ( ba ) in Fig. 15). Due to the greedy parsing of the repeating meta-blocks, thesurname-lengths of the last two nodes in µ cannot be larger than the surname-lengths of the generatedsubstrings of the other nodes (with the same surname) contained in µ . See Fig. 16 for an examplewhen prepending a character to the input.Fact 2: The shift of a semi-stable node is always a multiple of the length of its surname (recall that semi-stable nodes are defined like stable nodes, but with slight shifts, cf. Sect. 2.4): Let J be the surnameof a semi-stable node v ∈ h Y i h on height h . Given J ∈ Σ h for a height h with h ≥ D ( h − h ) ( v ) isa repetition of the symbol J on height h . A shift of v can only be caused by adding one or more J sto h Y i h . In other words, the shift is always a multiple of D ( h ) ( J ). Figure 17 shows an example ofa semi-stable node v .Fact 3: A non-repetitive type M block can be fragile only if it is non-surrounded. By definition, a repeatingmeta-block µ contains a non-repetitive block β iff µ is type M . The block β can only be located atthe beginning or ending of µ . Remembering Rule M, β ’s none-repetitiveness is caused by • fusing a symbol with its succeeding meta-block, or • fusing the last symbol with its preceding meta-block.In both cases, it is impossible that β is a surrounded block if b ( µ ) ≤ ∆ L . If β is surrounded, itis (semi-)stable due to Lemma 11. Note that with sticking to the choice made in Rule M, we alsoexperience a more stable behavior like in Fig. 18.16 B B B A A N ND D C GF M τ aaa aaa aaa aaa aa aa ba ba B B B B B A N NE D O δ aaa aaa aaa aaa aaa aa ba baa a a a a a N Na a N (cid:15) aaa aaa aaa aaa aa aa ba ba a a a a a a N Na a N λ aaa aaa aaa aaa aaa aa ba ba prepend a prepend a Y = a Y = Y = a Y = Figure 19:
Left: ET ( Y ) ( top ) and HT ( Y ) ( bottom ) of the string Y defined in Fig. 6. Right: ET ( a Y ) ( top )and HT ( a Y ). Unlike the two ESP trees on the top, the two HSP trees below share the same tree topology. ψ u · · · a v a N · · · LJ a a a · · · a a a a a a a NN · · ·≥ ∆ L b ( µ ) ≤ µ height h + 1height h Figure 20: Setting of Cor. 16. According to Lemma 11, a meta-block µ can contain a surrounded fragile blockif b ( µ ) ≤ v is fragile, since prepending L changes its name. Accordingto Cor. 16, there is a non-surrounded node u whose generated substring has the generated substring of v asa prefix.These facts make the HSP technique more stable than the ESP technique, as can be seen in Fig. 19, forinstance. In the following, we study the number of fragile surrounded nodes (like in Sect. 2.4 for the ESPtrees), and show the invariant (Claim 3 in Lemma 18) that the generated substring of a fragile surroundednode is always the prefix of the generated substring of a name that is already stored in D . On block level,this is an easy conclusion of Lemma 11 and Facts 1 and 3. Corollary 16.
Given n > µ having a fragile surrounded block β , µ has atleast one block preceding β that contains three symbols with the same surname. In particular, the leftmostof these preceding blocks is non-surrounded. Proof.
Since β is surrounded, the condition | µ | ≥ ∆ L − ∆ L in Lemma 5, ∆ L − ≥ n >
4. Assuming that the repetitive blocks in µ have the surname Z , this means that thereis at least one repetitive block γ with surname Z preceding β that contains three symbols of µ . But thefragile surrounded block β is also a repetitive block according to Fact 3. This means that the surname-lengthof β is at most as long as the surname-length of γ due to Fact 1, i.e., the generated substring of the nodecorresponding to β is a prefix of the generated substring of the node corresponding to γ . Let γ be theleftmost such block. Remembering that µ can start with a non-repetitive node in case that µ is of type M , itis not obvious that γ is non-surrounded. However, according to Lemma 11, b ( µ ) ≤ b ( γ ) ≤ ≤ ∆ L , so γ is non-surrounded. See Fig. 20 for a sketch (with Z = a ).In general, the aforementioned invariant does not hold for ESP trees, but is essential for the sparsesuffix sorting in text space. There, our idea is to create an HSP or ESP tree on a newly found re-occurringsubstring. We would like to store the ESP tree in the space of one of those substrings, which we can doby truncating the tree at a certain height (removing the lower nodes), and changing the pointer of each17 B B B A A N ND D C GF M τ aaa aaa aaa aaa aa aa ba ba · · · a a a a a a b b a b a a · · · string ( G ) string ( C ) string ( D )search for string ( O ) = aababa Y = T = B B B B B A N NE D O δ aaa aaa aaa aaa aaa aa ba ba prepend aa Y = Figure 21: Problem with dynamic updates of ESP trees stored in text space. Truncated nodes are grayedout. Each leaf of the truncated trees is assigned a pointer to its generated substring, which are substringsof the text T ( left ). Suppose that we have built ET ( Y ) ( top right ) on a substring Y of T ( Y defined as inFig. 19), and that the names D , C and G are already present in the dictionary (hence, they have differentgenerated substrings). Further suppose that the space of Y in T has been overwritten. When prependingan a to ET ( Y ) to form ET ( a Y ) ( bottom right ), the node G changes to O , for which we need to search itsgenerated substring (assuming that O is not yet stored in the dictionary). The example can be elaboratedsuch that G and O become surrounded nodes (prepend a k and append b k for a sufficiently large k ≥ Lemma 17.
There is no surrounded HSP node v whose name changes when appending characters. Proof.
Assume that v ’s name changes on appending characters. Moreover, assume that v ’s local surroundingdoes not contain a fragile node (otherwise swap v with this node). First, since there is no fragile node in v ’slocal surrounding, it has to be a repeating node according to Lemma 7. Second, according to Cor. 12, it hasto be one of the last two nodes built on a repeating meta-block µ . But there is no way to change the namesof the last two blocks of µ by appending characters unless these blocks are non-surrounded. So a surroundednode cannot have a node in its surrounding whose name changes when appending characters. Lemma 18.
Let v be a fragile surrounded node of an HSP tree. ThenClaim 1: v is a repetitive node,Claim 2: pre-/appending characters cannot change v ’s surname, andClaim 3: the generated substring of v is always a prefix of the generated substring of an already existingnode belonging to the same meta-block as v . Proof.
To show the lemma, let n > ∆ L + ∆ R , otherwise there are no surrounded nodes. There are two(non-exclusive) possibilities for a node to be fragile and surrounded: • it belongs to the last two nodes built on a repeating meta-block (due to Cor. 12), or18 ( Z ) · · · u ( Z ) v ( Z ) µ · ( Z ) u ( Z ) ∆ L w ( Z ) · ( Y ) · ( Y ) · · · · ( Y ) ν · ( Y ) height h + 1height h + 1 Figure 22: Sketch of the HSP tree used to show Lemma 18. In the sketch, we give the repetitive nodesof the meta-block ν the surname Y . Repetitive nodes are labeled with their surnames, which are put intoparentheses. • its subtree contains a fragile surrounded node, since by definition, – a node is fragile if it contains a fragile node in its subtree, and – all nodes in the subtree of a surrounded node are surrounded.We iteratively show the claim for all heights, starting at the bottom: Let v be one of the lowest fragilesurrounded nodes in HT ( Y ) ( lowest meaning that there is no fragile node in v ’s subtree). Suppose that v is a node on height h + 1 with h ≥
0. Since there is no fragile surrounded node in v ’s subtree, v is one ofthe last two nodes built on a repeating meta-block h Y i h [ µ ] (i.e., Y [ µ ] for h = 0). Due to Fact 3, Claim 1holds for v ; let Z be its surname. Since v is fragile, b ( µ ) ≤ v is surrounded, there is a repetitive node u with surname Z preceding v that is builton three symbols ( D ( u ) ∈ Σ h ) of µ due to Cor. 16. In particular, the leftmost repetitive node s of µ is notsurrounded.We only consider prepending a character (appending is already considered in Lemma 17). Assume that v ’s name changes when prepending a specific character. By Fact 1, the HSP technique assigns a new nameto v , but it does not change its surname (so Claim 2 holds for v ). Additionally, string ( v ) is a substring of string ( u ), where u is one of v ’s preceding nodes having the surname Z , and therefore Claim 3 holds for v .For example, let v be the node with name a in HT ( Y ) of Fig. 16, then string ( v ) = a , which is a prefix of string ( a ) = a . After prepending the character a , v ’s name becomes a with string ( v ) = a . Still, string ( v )is a prefix of string ( a ).Due to this behavior, the node v is always assigned to µ , regardless of what character is prepended. Thismeans that it is only possible to extend or shorten µ on its left side, or equivalently, µ ’s right end is fixed ;the parsing of a meta-block succeeding µ cannot change. This means that the parsing assures that everysurrounded node located to the right of h Y i h [ µ ] is (semi-)stable. We conclude that the claim holds for theheights 1 , . . . , h + 1. 19ext, we show that the claim holds for all height h + 2 , . . . , h , where h + 1 is the height of the lowestcommon ancestor w of s and v . Figure 22 gives a visual representation of the following observations: Whenfollowing the nodes from v up to w , there is a path of ancestor nodes with surname Z . Except for w , eachsuch ancestor node u has a neighbor with surname Z . On changing the name of v , all nodes on the heightof u are unaffected, except u . That is because the ancestor of s on the same height as u is put with u in thesame repeating meta-block, which comprises all neighboring nodes with surname Z . By the analysis above,changing the name of u cannot change the parsing of the other nodes on the same height. We conclude thatthe claim holds for the heights h + 2 , . . . , h .Let us focus on the nodes on height h + 1: The node w is not surrounded, because it contains thenon-surrounded node s in its subtree. Having neighbors with different surnames, w is either blocked in a type 2 or type M meta-block. • In the former case ( type 2 ), the analysis of Lemma 10 shows that w only affects the parsing of thenon-surrounded nodes. There can be a non-surrounded meta-block on a height h > h + 1 having afragile surrounded node v . But then v cannot contain a fragile node (the descendants of w are the lastfragile surrounded nodes, and w is non-surrounded). This means that we can apply the same analysisto v as for v . • In the latter case ( type M ), w is fused with a repeating meta-block to form a type M meta-block ν ,changing the names of the leftmost and two rightmost nodes of ν , where the leftmost node is w .Assume that the two rightmost nodes of ν are fragile and surrounded (otherwise we conclude with theprevious case that there are no fragile surrounded nodes on height h + 1). Under this assumption,the rightmost nodes of ν are repeating nodes due to Fact 3. Hence, we can apply the same analysis asfor v , and conclude the claim for all heights above h .A direct consequence is that there are O (1) fragile surrounded nodes on each height. With Lemma 14we get the following theorem: Theorem 19.
The HSP tree HT ( Y ) of a string Y of length n contains at most O (lg ∗ n ) fragile nodes oneach height.Having a bound on the number of fragile nodes, we start to study the algorithmic operations of anHSP tree. The first operation is how to actually build an HSP tree. For that, we have to think about itsrepresentation: Unlike Cormode and Muthukrishnan, who use hash tables to represent the dictionary D , we follow a deter-ministic approach. In our approach, we represent D by storing the HSP tree as a CFG. A name (i.e., anon-terminal of the CFG) is represented by a pointer to a data field (an allocated memory area), which iscomposed differently for leaves and internal nodes: Leaves.
A leaf stores a position i and a length ‘ ∈ { , } such that Y [ i..i + ‘ −
1] is the generated substring.
Internal nodes.
An internal node stores the length of its generated substring, and the names of its children.If it has only two children, we use a special, invalid name ⊥ for the non-existing third child such thatall data fields are of the same length.This information helps us to navigate from a node to its children or its generated substring in constant time,and to navigate top-down in the HSP tree by traversing the tree from the root in time linear in the heightof the tree.To accelerate substring comparisons, we want to give nodes with the same children (with respect to theirorder and names) the same name, such that the dictionary D is injective. To keep the dictionary injective,we do the following: Before creating a new name for the rule b → xyz (we set z = ⊥ if the rule is b → xy ),we check whether there already exists a name for xyz . To perform this lookup efficiently, we need also the20 T ( X ) HT ( Y ) i Y i X uv T X T Y Figure 23: Conception of the proof of Lemma 21. To compute the longest common prefix of X [ i X .. ] and Y [ i Y .. ] (arrow in the center), we walk down the trees HT ( X ) and HT ( Y ) (depicted by the upper and thelower triangle, respectively) on the paths towards the leaves containing X [ i X ] and Y [ i Y ], respectively, bysimultaneously visiting two nodes on the same height of both trees. The nodes u and v in the figure areon these paths. Suppose that they are on the same height and have the same surname. On visiting bothnodes, we know that the longest common prefix is at least min( | string ( u ) | , | string ( v ) | ) long. We update thedestination of our traversal accordingly, such that we follow the paths from u and v to the leaves coveringthe not-yet checked parts of the longest common prefix that we want to compute. reverse dictionary of D , with the right hand side of the rules as search keys. We want the reverse dictionaryto be of size O ( | Y | ), supporting lookup and insert in O ( t look ) (deterministic) time for a t look = t look ( n )depending on n . For instance, a balanced binary search tree has t look = O (lg n ).With this tree representation, we can build HSP trees within the following time and space bounds: Lemma 20.
The HSP tree HT ( Y ) of a string Y of length n can be built in O ( n (lg ∗ n + t look )) time. It takes O ( n ) words of space. Proof.
A name is inserted or looked-up in t look time. Due to the alphabet reduction technique (see Lemma 4),applying esp on a substring of length ‘ takes O ( ‘ lg ∗ n ) time, returning a sequence of blocks of length atmost ‘/ Like the trees [1, 32] based on signature encoding, we show that HSP trees are good at answering LCE queries.The idea is to compare the names of two nodes to test whether the generated substrings of both nodes are thesame. Remembering that two nodes with the same generated substring can have different names (cf. the endof Sect. 2.3), we want to have a rule at hand saying when two nodes with different names must have differentgenerated substrings. It is easy to provide such a rule when the input string is square-free: In this case, allfragile nodes are non-surrounded according to Lemma 8, and thus we know that the surrounded nodes arestable. Since each height consists of exactly one type 2 meta-block, the equality of two substrings X and Y can be checked by comparing the names of two surrounded nodes whose generated substrings are X and Y , respectively. For general strings, we need additional information about the generated substring of eachrepeating node. That is because the names of two repeating nodes at the same height already differ when thegenerated substring of one node is a proper prefix of the generated substring of the other node. Fortunately,this additional information is given by the surnames and surname-lengths (see Fact 2 in Sect. 3.1):Having a common dictionary D for all HSP trees that stores the length of the string D ( h ) ( Z ) for eachname Z ∈ Σ h , we explain how HSP trees can answer LCE queries efficiently. Lemma 21.
Given HT ( X ) and HT ( Y ) built on two strings X and Y with | X | ≤ | Y | ≤ n and two textpositions 1 ≤ i X ≤ | X | , ≤ i Y ≤ | Y | , we can compute lcp( X [ i X .. ] , Y [ i Y .. ]) in O (lg n lg ∗ n ) time.21 roof. We use the following property: If two nodes have the same surname Z , then the generated substringsof both nodes are Z i and Z j , respectively, with the respective surname-lengths i and j , where Z = string ( Z ).This means that the generated substring of one node is a prefix of the generated substring of the other. Inthe particular case i = j , both nodes share the same subtree and consequently have the same name accordingto Lemma 15. In summary, this property allows us to omit the comparison of the subtrees of two nodeswith the same surname, and thus speeds up the LCE computation, which is done in the following way (cf.Fig. 23):(1) We start with traversing the two paths from the roots of HT ( X ) and HT ( Y ) to the leaves λ X and λ Y whose generated substrings contain h X i [ i X ] and h Y i [ i Y ], respectively:(2) We traverse the two paths leading to the leaves λ X and λ Y , respectively, in a simultaneous mannersuch that we always visit a pair ( u, v ) of nodes on the same height belonging to HT ( X ) and HT ( Y ),respectively.(3) Given that u and v share the same surname Z ∈ Σ h , we know the lengths of their generated substrings( (cid:12)(cid:12) D ( h ) ( Z ) (cid:12)(cid:12) ‘ u and (cid:12)(cid:12) D ( h ) ( Z ) (cid:12)(cid:12) ‘ v ) by having their surname-lengths ‘ u and ‘ v at hand. As a consequence,we know that X [ i X .. ] and Y [ i Y .. ] have a common prefix of at least min( (cid:12)(cid:12) D ( h ) ( Z ) (cid:12)(cid:12) ‘ u , (cid:12)(cid:12) D ( h ) ( Z ) (cid:12)(cid:12) ‘ v ).We update the variables λ X and λ Y to be the leaves whose generated substrings contain h X i [ i X + (cid:12)(cid:12) D ( h ) ( Z ) (cid:12)(cid:12) ‘ u ] and h Y i [ i Y + (cid:12)(cid:12) D ( h ) ( Z ) (cid:12)(cid:12) ‘ v ], respectively. Subsequently, we continue our tree traversalsfrom u and v to the updated destinations ‘ X and ‘ Y , respectively. Since λ X and λ Y are not in therespective subtrees of u and v , we climb up the tree to the lowest common ancestor of u (resp. v )and λ X (resp. λ Y ), and recurse on (2).(4) If we end up at a pair of leaves (i.e., u = λ X and v = λ Y ), we compare their generated substringsnaively. If we find a mismatching character in both generated substrings, we can determine the valueof ‘ and terminate. We also terminate if there is no mismatch, but λ X or λ Y is the rightmost leaf of HT ( X ) or HT ( Y ), respectively. In all other cases, we set λ X and λ Y to their respectively succeedingleaves, climb up to the parents of u and v , and recurse on (2).During the traversals of both trees, we spend constant time for each navigation operation, i.e., (a) selecting achild, and (b) climbing up to the parent of a node: On the one hand, we select a child of a node v in constanttime by following the pointer of the name of v (defined in Sect. 3.2). On the other hand, we maintain, foreach tree, a stack storing all ancestors of the currently visited node during the traversal of the respectivetree: Each stack uses O (lg n ) words, and can return the parent of the currently visited node in constanttime.To upper bound the running time of the traversals, we examine the nodes visited during the traversals.Starting at both root nodes, we follow the path from the root of HT ( X ) (resp. HT ( Y )) down to the roots ofthe minimal subtree T X of HT ( X ) (resp. T Y of HT ( Y )) covering X [ i X ..i X + ‘ ] (resp. Y [ i Y ..i Y + ‘ ]). Afterentering the subtrees T X and T Y , we will never visit nodes outside of T X and T Y . The question is howmany nodes of T X and T Y differ. This can be answered by studying the tree HT ( Z ) built with the samedictionary D , where Z := X [ i X ..i X + ‘ −
1] = Y [ i Y ..i Y + ‘ − HT ( Z ) has O (lg ∗ n ) fragilenodes on each height according to Thm. 19. On the other hand, each (semi-)stable node in HT ( Z ) is foundin both T X and T Y with the same name and surname. This means that when traversing HT ( X ) and HT ( Y )within their respective subtrees T X and T Y , we only visit O (lg ∗ n ) pairs of nodes per height (remember thatwe follow the two paths to the leaves λ X and λ Y , respectively, up to the point where the surnames of thevisited pair of nodes match).To sum up, we (a) compute paths from the roots to h X i [ i X ] and h Y i [ i Y ], respectively, in O (lg | Y | )time, and (b) compare the children of at most O (lg ∗ n ) nodes per height. Since both trees have a height of O (lg | Y | ), we obtain our claimed running time. We assume that i X + ‘ ≤ | X | and i Y + ‘ ≤ | Y | . Otherwise, let T X and T Y cover X [ i X ..i X + ‘ −
1] and Y [ i Y ..i Y + ‘ − τ = 1: Corollary 22.
Given HT ( X ) and HT ( Y ) built on two strings X and Y with | X | ≤ | Y | ≤ n and two textpositions 1 ≤ i X ≤ | X | , ≤ i Y ≤ | Y | , we can compute ‘ := lcp( X [ i X .. ] , Y [ i Y .. ]) in O (lg ‘ lg ∗ n ) time. Proof.
Our idea is to enhance an HSP tree with a data structure such that climbing up from a child to itsparent can be performed in constant time. This can be achieved when we represent the tree topology ofan HSP tree with a pointer based tree, in which each node stores its name and the pointer to its parent.The leaves are stored sequentially in a list. A bit vector with the same length as the input string is used tomark the borders of the generated substrings of the leaves. Given a text position i , we can access the leafwhose generated substring contains i in constant time with a rank-support on the bit vector. The bit vectorwith rank-support takes n + o ( n ) bits. The pointer based tree can be built with the HSP tree without anadditional time overhead, and takes O ( n ) words of space.In the next section, we describe a preliminary version of our sparse suffix sorting algorithm that does notexploit the text space yet. The sparse suffix sorting problem asks for the order of suffixes starting at certain positions in a text T . Inour case, these positions need only be given online, i.e., sequentially and in an arbitrary order. We collectthem conceptually in a dynamic set P with m := |P| . The online sparse suffix sorting problem is to keepthe suffixes starting at the positions stored in the incrementally growing set P in sorted order. Due to theonline setting, we represent the order of Suf ( P ) by a dynamic, self-balancing binary search tree (e.g., anAVL tree). Each node of the tree is associated with a distinct suffix in Suf ( P ); the lexicographic order isused as the sorting criterion.The technique of Irving and Love [21] augments an AVL tree on a set of strings S with LCP values sothat ‘ Y := max { lcp( X, Y ) | X ∈ S} can be computed in O ( ‘ Y / log σ n + lg |S| ) time for a string Y . Insertinga new string Y into the tree is supported in the same time complexity ( ‘ Y is defined as before). Irving andLove called this data structure the suffix AVL tree on S ; we denote it by SAVL ( S ).Remembering Sect. 1.2, our goal is to build SAVL ( Suf ( P )) efficiently. However, inserting m suffixesnaively takes Ω( |C| m/ log σ n + m lg m ) time. How to speed up the comparisons by exploiting a data structurefor LCE queries is the topic of this section. Starting with an empty set of positions P = ∅ , our algorithm updates SAVL ( Suf ( P )) on the input ofevery new text position, involving LCE computations between the new suffix and suffixes already stored in SAVL ( Suf ( P )). A crucial part of the algorithm is performed by these LCE computations, for which an LCEdata structure is advantageous to have. In particular, we are interested in a mergeable LCE data structurethat is mergeable in such a way that the merged instance answers queries faster than performing a queryon both former instances separately. We call this a dynamic LCE data type (dynLCE) ; it supports thefollowing operations: • dynLCE ( I ) constructs a dynLCE data structure M on the substring T [ I ]. Let M. ival denote theinterval I . • LCE ( M , M , p , p ) computes lce( p , p ), where p i ∈ M i . ival for i = 1 , • merge ( M , M ) merges two dynLCEs M and M such that the output is a dynLCE built on the stringconcatenation of T [ M . ival ] and T [ M . ival ]. 23e use the expression t C ( |I| ) to denote the construction time of such a data structure on the substring T [ I ].We assume that the construction of dynLCE ( I ) takes at least as long as scanning all characters on Y , i.e.,Property 1: t C ( |I| ) = Ω( |I| / log σ n ).We use the expressions t L ( | X | + | Y | ) and t M ( | X | + | Y | ) to denote the time for querying and the time formerging two such data structures built on two given strings X and Y , respectively. Querying two dynLCEsfor a length ‘ is faster than the word-packed character comparison iff ‘ = Ω( t L ( ‘ ) lg n/ lg σ ). Hence, weobtain the following property:Property 2: A dynLCE on a text smaller than g := Θ( t L ( g ) lg n/ lg σ ) is always slower than theword-packed character comparison.In the following, we build dynLCEs on substrings of the text. Each interval of the text that is covered bya dynLCE is called an LCE interval . The LCE intervals are maintained in a self-balancing binary searchtree L of size O ( m ). The tree L stores the starting and the ending positions of each LCE interval, and usesthe starting positions as keys to answer the queries • whether a position is covered by a dynLCE, and • where the next text position starts that is covered by a dynLCE,in O (lg m ) time. Additionally, each LCE interval is assigned to one dynLCE data structure (a dynLCE can beassigned to multiple LCE intervals) such that L can not only retrieve the next position covered by a dynLCE,but actually return a dynLCE that covers that position. The dynLCE is retrieved by augmenting an LCEinterval I with a pointer to its dynLCE data structure M , and with an integer i such that T [ M. ival ∩ [ i..i + |I| − T [ I ] (since M could be built on a text interval M. ival = I that contains an occurrence of T [ I ]).Given a new position ˆ p
6∈ P with 1 ≤ ˆ p ≤ | T | , updating SAVL ( Suf ( P )) to SAVL ( Suf ( P ∪ { ˆ p } )) involvestwo parts: first locating the insertion node for ˆ p in SAVL ( Suf ( P )), and then updating the set of LCE intervals. Locating.
The insertion operation performs an LCE computation for each node encountered in
SAVL ( Suf ( P ))while locating the insertion point of ˆ p . Suppose that the task is to compare the suffixes T [ i.. ] and T [ j.. ] fortwo text positions i and j with 1 ≤ i, j ≤ | T | . We perform the following steps to compute lce( i, j ):(1) Check whether the positions i and j are contained in an LCE interval, in O (lg m ) time with the searchtree L . • If both positions are covered by LCE intervals, then query the respective dynLCEs for the length ‘ of the LCE starting at i and j . Increment i and j by ‘ . Return the number of compared characterson finding a mismatch while computing the LCE. • Otherwise (if i or j are not contained in an LCE interval), find thesmallest length ‘ such that i + ‘ and j + ‘ are covered by LCEintervals. Increment i and j by ‘ , and naively compare ‘ characters.Return the number of compared characters on a mismatch. ij ‘ LCE (2) Return the total number of matched positions if a mismatch is found in (1). Otherwise, repeat theabove check again (with the incremented values of i and j ).After locating the insertion point of ˆ p in SAVL ( Suf ( P )), we obtain ¯ p := mlcparg(ˆ p ) and ‘ := mlcp(ˆ p ) as abyproduct, where mlcparg( p ) := argmax p ∈P ,p = p lcp( T [ p.. ] , T [ p .. ]) and mlcp( p ) := lcp( T [ p.. ] , T [mlcparg( p ) .. ])for each text position p with 1 ≤ p ≤ | T | . We insert ˆ p into SAVL ( Suf ( P )), and use the position ¯ p and thelength ‘ to update the LCE intervals. Updating.
The LCE intervals are updated dynamically, subject to the following properties (see Fig. 24):24 mlcp p LCE interval LCE interval ≤ g ≤ g ≥ g ≥ g ≥ g Figure 24: Sketch of two LCE intervals with Properties 3 to 5. ˆ p ˆ p + i I ˆ p + j ¯ p ¯ p + i J ¯ p + jT = K ‘ ‘ Figure 25: Application of Rules 1 to 4 for preserving the properties. The interval I := [ˆ p + i.. ˆ p + j ] isnot yet covered by an LCE interval, but is contained in [ˆ p.. ˆ p + ‘ −
1] — a conflict with Property 4. Theconflict is resolved based on the LCE intervals covering the positions of J := [¯ p + i.. ¯ p + j ]. The intervalswith the horizontal lines are the LCE intervals, and the intervals with the diagonal lines are the intervals of[ˆ p.. ˆ p + ‘ − \ U . Here, J intersects with an LCE interval K . This case is treated in Rule 2.Property 3: The length of each LCE interval is at least g (defined in Property 2).Property 4: For every p ∈ P , the interval [ p..p + mlcp( p ) −
1] is covered by an LCE interval, exceptat most g positions at its left and right ends.Property 5: There is a gap of at least g positions between every pair of LCE intervals.After adding ˆ p to P , we perform the following instructions to satisfy the properties. If ‘ ≤ g , we donothing, because all properties are still valid (in particular, Property 4 still holds). Otherwise, we need torestore Property 4. There are at most two positions in P that possibly invalidate Property 4 after adding ˆ p ,and these are ˆ p and ¯ p (otherwise, by transitivity, we would have created a longer LCE interval previously).We introduce an algorithm that does not restore Property 4 directly, but first ensures thatProperty 4’: the intervals [ˆ p.. ˆ p + ‘ −
1] and [¯ p.. ¯ p + ‘ −
1] are covered by one or multiple LCEintervals.In a later step, we restore Property 4 by merging LCE intervals that are in conflict with Property 5, andthus restore all properties: Let U ⊂ [1 ..n ] be the set of all positions that belong to an LCE interval. Theset [ˆ p.. ˆ p + ‘ − \ U can be represented as a set of disjoint intervals of maximal length. For each interval I := [ˆ p + i.. ˆ p + j ] ⊂ [ˆ p.. ˆ p + ‘ −
1] of that set, apply the following rules with J := [¯ p + i.. ¯ p + j ] (for integers i, j with 0 ≤ i ≤ j ≤ ‘ −
1, see Fig. 25) sequentially:Rule 1: If J is a sub-interval of an LCE interval K , then declare I as an LCE interval and let itrefer to the dynLCE of K .Rule 2: If J intersects with an LCE interval K , enlarge the dynLCE on T [ K ] to cover T [ K ∪ J ](create a dynLCE on T [ J \ K ] and merge it with the dynLCE on T [ K ]). Apply Rule 1.Rule 3: Otherwise (there is no LCE interval K with J ∩ K 6 = ∅ ), create dynLCE ( J ), and make I and J to LCE intervals referring to dynLCE ( J ).25e satisfy Property 4’ on [¯ p.. ¯ p + ‘ −
1] by updating U , computing the set of disjoint intervals [¯ p.. ¯ p + ‘ − \ U ,and applying the same rules on it. However, Rule 1 or Rule 3 can create LCE intervals shorter than g ,violating Property 3. By construction, such a short LCE interval is adjacent to another LCE interval (therules compute a cover of [ˆ p.. ˆ p + ‘ −
1] and [¯ p.. ¯ p + ‘ −
1] with LCE intervals). This means that we can restoreProperty 3 by restoring Property 5. We do that by applying the following rule subsequently to Rule 3:Rule 4: Merge a newly created or extended LCE interval violating Property 5 with its nearest LCEinterval (ties can be broken arbitrarily). Merge those LCE intervals and their dynLCEs.Rule 4 also restores Property 4 (since Property 4’ and Property 5 hold). After applying all rules, we haveintroduced at most two new LCE intervals that cover the intervals [ˆ p + g.. ˆ p + ‘ − − g ] and [¯ p + g.. ¯ p + ‘ − − g ],respectively, to satisfy Properties 3 to 5. The running time of this algorithm is analyzed in the followinglemma: Lemma 23.
Given a text T of length n and a set of m arbitrary positions P in T , the suffix AVLtree SAVL ( Suf ( P )) with the suffixes of T starting at the positions P can be computed deterministicallyin O ( t C ( |C| ) + t L ( |C| ) m lg m + t M ( |C| ) m ) time. Proof.
The analysis is split into managing the dynLCEs, and the LCE queries: • We build dynLCEs on substrings covering at most |C| characters of the text, taking at most t C ( |C| ) timefor constructing all dynLCEs. During the construction of the dynLCEs we spend O ( |C| / log σ n ) = O ( t C ( |C| )) time on naive searches due to Property 1. • The number of merge operations on the LCE intervals is upper bounded by 2 m in total, since we createat most two new LCE intervals for every position in P . In total, we spend at most 2 t M ( |C| ) m time forthe merging in total. • The algorithm performs O ( m lg m ) LCE queries. LCE queries involve either (a) naive character com-parisons or (b) querying a dynLCE. Given that we have δ < m LCE intervals, we switch betweenboth techniques at most 4 δ + 1 times for an LCE query.(a) On the one hand, the overall time for the naive character comparisons is bounded by O ( t C ( |C| ) + t L ( |C| ) m lg m ): – By Property 3, all substrings T [ p..p + mlcp( p ) −
1] are covered by an LCE interval, except atmost at 2 g positions. This means that all substrings that are not covered by an LCE interval,but have been subject to a naive character comparison, are shorter than 2 g . For a naivecharacter comparison with one of those substrings, we spend at most O ( gm lg m/ log σ n ) = O ( t L ( g ) m lg m ) = O ( t L ( |C| ) m lg m ) time. In the case that g > |C| , we do not create anyLCE interval, and spend O ( |C| / log σ n + m lg m ) = O ( t C ( |C| ) + m lg m ) overall time due toProperty 1. – If we compare more than g characters for an LCE query, we create at most two LCE intervals,possibly involving the construction of dynLCEs on the compared substrings. The constructionof a dynLCE on an interval I takes t C ( |I| ) = Ω( |I| / log σ n ) time due to Property 1.(b) On the other hand, querying the dynLCEs take at most O ( t L ( |C| ) m lg m ) overall time. Supposethat we look up d < δ LCE intervals for an LCE query. Since we look up an LCE interval in O (lg m ) time with L , we spend O ( d lg m ) time on the lookups during this LCE query. However,we subsequently merge all d looked-up LCE intervals, reducing the number of LCE intervals δ by d −
1. Consequently, we perform a look-up of an LCE interval at most 2 m times in total. The number of new LCE intervals could be indeed two: Although ¯ p ∈ P , we would not have created an LCE intervalcovering [¯ p + g.. ¯ p + ¯ ‘ − − g ] if mlcp(¯ p ) was smaller than g at the time when we inserted ¯ p in P with ¯ ‘ := mlcp(¯ p ). SSA := SSA ( T, P ) and SLCP := SLCP ( T, P ) from SAVL ( Suf ( P )) by travers-ing SAVL ( Suf ( P )) and performing LCE queries on the already computed dynLCEs: The SAVL ( Suf ( P ))is a binary search tree storing all elements of Suf ( P ) in lexicographically sorted order. This means thatwe can compute SSA with an in-order traversal of
SAVL ( Suf ( P )). Afterwards, we compute SLCP [ i ] =lce( SSA [ i ] , SSA [ i − SSA [ i ] .. SSA [ i ] + SLCP [ i ] − SSA [ i − .. SSA [ i −
1] +
SLCP [ i ] − SLCP [ i ] = O ( g ) due to Property 3, and we spendat most O ( g/ log σ n ) time on computing SLCP [ i ] by naive character comparisons. Otherwise, we spend O ( g/ log σ n + t L ( SLCP [ i ])) = O ( t L ( SLCP [ i ])) time by querying a single dynLCE due to Property 4. Queryingwhether both text intervals are covered by an dynLCE costs O (lg m ) time with L . In total, we can compute SLCP [ i ] for each integer i with 2 ≤ i ≤ m in O ( t L ( |C| ) + m lg m ) time, since O ( g/ log σ n ) = O ( t L ( g )) due toProperty 2. The following corollary of Lemma 23 summarizes the achievements of this section: Corollary 24.
Given a text T of length n that is loaded into RAM, the SSA and SLCP of T for a set of m arbitrary positions can be computed deterministically in O ( t C ( |C| ) + t L ( |C| ) m lg m + t M ( |C| ) m ) time. Weneed O ( m ) words of space, and space to store dynLCE on |C| positions. HT ( X ) HT ( Y ) · · · ∆ R · · · ∆ R · · · ∆ L · · · ∆ L......... ...
We show that the HSP tree is a dynLCE data structure.Remembering that the algorithm from Sect. 4.1 depends onthe merging operation of dynLCE, we now introduce themerging of HSP trees. A naive way to merge two HSP trees HT ( X ) and HT ( Y ) is to build HT ( XY ) completely fromscratch. Since only the fragile nodes of HT ( X ) and HT ( Y )can change when merging both trees, a more sophisticatedapproach would reparse only the fragile nodes of both trees.Remembering the properties studied in Sect. 2.4, we showsuch an approach in the following lemma: Lemma 25.
Merging HT ( X ) and HT ( Y ) of two strings X, Y ∈ Σ ∗ into HT ( XY ) takes O ( t look ( ∆ R lg | X | + ∆ L lg | Y | )) time. Proof.
First assume that HT ( X ) and HT ( Y ) only contain type 2 nodes. In this case, we examine therightmost nodes of HT ( X ) and the leftmost nodes of HT ( Y ) from the bottom up to the root: At each height h , we merge the nodes h X i h and h Y i h to h XY i h by reparsing the ∆ R rightmost nodes of h X i h , and the ∆ L leftmost nodes of h Y i h . By doing so, we reparse all nodes of HT ( X ) (resp. HT ( Y )) whose local surroundingon the right (resp. left) side does not exist. Nodes of HT ( X ) (resp. HT ( Y )) that have a local surrounding onthe right (resp. left) side are not changed by the parsing. In total, we spend O ( t look ( ∆ R lg | X | + ∆ L lg | Y | ))time on merging two trees consisting of type 2 nodes.Next, we allow repeating nodes. Lemma 17 shows that there are no fragile surrounded nodes in HT ( X )that need to be fixed. The remaining problem is to find and recompute the ∆ L s ( Z ) · ( Z ) ··· · ( Z ) · ( Z ) · ( Z ) µu ( Z ) surrounded nodes in HT ( Y ) whose names change on merging both trees. Thelowest of these nodes belong to a repeating meta-block due to Lemma 7 andCor. 12. To find this meta-block, we adapt the strategy of the first paragraphconsidering only type 2 meta-blocks. On each height h , we reparse the ∆ L leftmost nodes of h Y i h . If the rightmost of these ∆ L nodes are contained ina repeating meta-block µ that does not end within those ∆ L leftmost nodes,chances are that the names of some nodes in µ change. Due to Cor. 12, it is sufficient to reparse the tworightmost nodes of µ . This is done as follows:1. Take the leftmost repetitive node s of µ (which exists due to Cor. 16, and is one of the ∆ L + 1 leftmostnodes on height h ). 27. Given that s has the surname Z , climb up the tree to find the highest ancestor u with surname Z . Theancestor u is the lowest common ancestor of s and the rightmost repetitive node of µ .3. Walk down from u to the rightmost nodes of µ .4. Reparse µ ’s two rightmost nodes.5. Reparse all ancestors of these two nodes that are surrounded.6. Check whether the reparsed ancestors invalidate the parsing of their meta-blocks; fix the parsing forthose meta-blocks recursively.Climbing up to find u and walking down to the rightmost nodes of µ takes O ( t look lg | µ | ) = O ( t look lg( n/ h ))time, reparsing the surrounded ancestor nodes of the two rightmost nodes of µ takes O ( t look lg( n/ h )) time.Given that the highest nodes of this reparsing are on a height h > h , Lemma 18 states that up to theheight h + 1, there is no need to reparse a fragile surrounded node (we follow the paths of fragile nodes asdepicted in Fig. 22). Given that there are µ , . . . , µ k such meta-blocks (for which we apply Steps 1 to 6), wehave O ( t look P ki =1 lg | µ i | ) = O ( t look lg n ) due to P ki =1 lg µ i ≤ lg n . Hence, we spend O (( ∆ L + ∆ R ) t look lg | Y | )time overall.The following theorem combines the results of Cor. 24 and Lemma 25. Theorem 26.
Given a text T of length n and a set of m text positions P , SSA ( T, P ) and SLCP ( T, P ) canbe computed in O ( |C| (lg ∗ n + t look ) + m lg m lg n lg ∗ n ) time. We need O ( n + m ) words of space. Proof.
We have • t C ( |C| ) = O ( |C| (lg ∗ n + t look )) due to Lemma 20, • t L ( |C| ) = O (lg ∗ n lg n ) due to Lemma 21, and • t M ( |C| ) = O ( t look lg n lg ∗ n ) due to Lemma 25.Actually, the time cost for merging is already upper bounded by the cost for the tree creation. To seethis, let δ ≤ m be the number of LCE intervals. Since each HSP tree covers at least g characters, δg is at most |C| , and we obtain δ t M ( |C| ) = O ( |C| t M ( |C| ) /g ) = O ( |C| t look ) overall time for merging, where g = Θ( t L ( |C| ) lg n/ lg σ ) = Θ(lg ∗ n lg n/ lg σ ). Plugging the times t C ( |C| ), t L ( |C| ), and the refined analysis ofthe merging time cost in Cor. 24 yields the claimed time bounds. Remembering the outline in the introduction, the key idea to solve the limited space problem is storingdynLCEs in text space. Taking two LCE intervals of the text containing the same substring, we free up thespace of one part while marking the other part as a reference. The freed space could be used to store anHSP tree whose leaves refer to substrings of the other LCE interval. By doing so, we would use the textspace for storing the HSP trees, while using only O ( m ) additional words for storing SAVL ( Suf ( P )) and thesearch tree L of the LCE intervals. However, an HSP tree built on a string of length n takes O ( n lg n ) bits,while the string itself provides only n lg σ bits. Our solution is to truncate the HSP tree at a fixed height η ,discarding the nodes in the lower part. The truncated version tHT η ( Y ) stores just the upper part, whileits new leaves refer to (possibly long) substrings of Y . The resulting tree is called the η -truncated HSPtree (tHT η ) , whose definition follows: 28 a a a a a N Na a N (cid:15) aaa aaa aaa aaa aa aa ba ba . . . a a a a a a a a a b a b a string ( N ) string ( a ) string ( a ) η -nodeslower nodesupper nodes Y = Y = Figure 26: The η -truncated HSP tree tHT η ( Y ) of the substring Y defined in Fig. 6 with η = 2. Like inFig. 21, the lower nodes are grayed out. An η -node is a leaf in tHT η ( Y ), and has a generated substring witha length between four and nine. J J J J a a J J a α ab ab ab ab aaa aaa a a a a a a a a a a a a aaa aaa aaa aaa aaa aaa aaa aaaJ J J J a a a a a a a a a a J J a a a a J a a γ ab ab ab ab aaa aaa aaa aaa aaa aaa aaa aaa aaa aaa merge( ab ) a = a =( ab ) a = µ Figure 27: Merging HT (( ab ) a ) with HT ( a ) (both at the top) to HT (( ab ) a ) (bottom tree). Reparsingthe repeating meta-block µ on height one of the right tree is done by rearranging µ ’s two rightmost nodes. We define a height η and delete all nodes at height less than η , which we call lower nodes . A node higherthan η is called an upper node . The nodes at height η form the new leaves and are called η -nodes . Similarto the former leaves, their names are pointers to their generated substrings appearing in Y . Rememberingthat each internal node has two or three children, an η -node generates a string of length at least 2 η andat most 3 η . The maximum number of nodes in an η -truncated HSP tree of a string of length n is n/ η .Figure 26 shows an example with η = 2.Similar to leaves in untruncated HSP trees, we use the generated substring X of an η -node v for storingand looking up v : While the leaves of the HSP tree have a generated substring of constant size (two or threecharacters), the generated substring of an η -node can be as long as 3 η . Storing such long strings in a binarysearch tree representing the reverse dictionary of D is inefficient; it would need O ( ‘ lg σ ) time for a lookup orinsertion of a key of length ‘ . Instead, we want a dictionary data structure storing O ( | Y | ) elements in O ( | Y | )words of space , supporting lookup and insert in O ( t look + ‘/ log σ n ) time for a key of length ‘ . For instance,Franceschini and Grossi’s data structure [13] with word-packing supports the desired time and space boundswith t look = O (lg n ). Lemma 27.
We can build an η -truncated HSP tree tHT η ( Y ) of a string Y of length n in O ( n (lg ∗ n + η/ log σ n + t look / η )) time, using O (3 η lg ∗ n ) words of working space. The tree takes O ( n/ η ) words of space. The data structure is not necessarily stored in consecutive space like an array. roof. Instead of building the HSP tree level by level, we compute the η -nodes one after another, from leftto right. We can split the parsing of the whole string into several parts. Each part computes one η -node.First assume that tHT η ( Y ) only contains type 2 nodes. Then the name of an η -node v is determined by v ’s local surrounding (as far as it exists) due to Lemma 7. This means that it is sufficient to keep v ’s localsurrounding at height η −
1, which we denote by X v , in memory. X v is a string of lower nodes. To parse astring of lower nodes by HSP, we have to give each lower node a name. Unfortunately, storing the namesof all lower nodes in a dictionary would take too much space. Instead, we create the name of a lower nodetemporarily by setting the name of a lower node to its generated substring. This means that we cannotretrieve their names later. Luckily, we only need the names of the lower nodes for constructing X v . Weconstruct X v as follows: Given that we parsed the local surrounding of v at height h (0 ≤ h ≤ η −
3) withHSP, we store the borders of the blocks on height h + 1 in an integer array such that we can access the name(i.e., the generated substring) of the i -th block on height h + 1. With this integer array, we can parse theblocks on height h + 1 to obtain the blocks on height h + 2, whose borders are again stored in an integerarray. Having the borders of the blocks on height h + 2, we can remove the integer array on height h + 1.The blocks on height η − X v .In the general case (when tHT η ( Y ) contains repeating nodes), it can happen that the name of a greedilyparsed node (i.e., a repeating node or one of the ∆ L leftmost nodes of a type 2 meta-block) depends notnecessarily on its local surrounding, but on the length of its repeating meta-block, its surname and itschildren (in case of a type M node). This means that when computing X v of an η -node v , we additionallyhave to consider the case when nodes in the local surrounding of v are contained in a meta-block µ onheight h < η that extends over the nodes in v ’s surrounding at height h . It is sufficient to use a countingvariable that tracks the position of the last block of µ belonging to the subtree of the preceding η -node of v (remember that the greedy parsing determines the blocks by an arithmetic progression). Another necessityis to maintain the surnames of the lower nodes. In our approach, each array storing the borders of the blockson the heights below η is accompanied with two arrays. The first array stores the length of the prefix of thegenerated substring of each block β that is equal to β ’s surname; the second array stores the surname-lengthof each block. Working Space.
We compute v after computing X v . To compute X v , we apply the HSP technique ( η − X v . Since the nodes of X v cover at most 3 η ( ∆ L + ∆ R )characters, we need O (3 η ( ∆ L + ∆ R )) words of working space to maintain the integer arrays storing theborders of the blocks at two consecutive heights. To cope with the meta-blocks extending over the borderof the subtrees of two η -nodes, we store the last position of each such meta-block belonging to the localsurrounding of the previous η -node. These positions take O ( η ) words, since such a meta-block can exist onevery height below η . Time.
The time bound O ( n lg ∗ n ) for the repeated application of the alphabet reduction is the same asin Lemma 20. The new part is the construction of an η -node by constructing X v : To construct the lowernodes X v , we apply the HSP technique ( η − string ( v ). The HSP technique compares lowernodes by their generated substrings (instead of comparing by a name stored in D ). It always comparestwo adjacent lower nodes during the construction of X v . To bound the number of comparisons of the lowernodes, we focus on all lower nodes on a fixed height h with 1 ≤ h ≤ η −
1: Since the sum of the lengths ofthe generated substrings of the lower nodes on height h is always n , the comparisons of the lower nodes onheight h take O ( n/ log σ n ) time, independent of the number of nodes on height h . Summing over all heights,these comparisons take O ( nη/ log σ n ) time in total. By the same argument, maintaining the names of all η -nodes takes O ( n/ log σ n + t look n/ η ) time.A name is looked-up in O ( t look ) time for an upper node. Since the number of upper nodes is at most n/ η , maintaining the names of the upper nodes takes O ( t look n/ η ) time. This time is subsumed by thelookup time for the η -nodes. Surnames.
Augmenting the (remaining) nodes of the η -truncated HSP tree with surnames cannot bedone as trivially as in the standard HSP tree construction, since a repetitive node can have a surname equalto the name of a lower node (remember that lower nodes are generated only temporarily, and hence are notmaintained in the reverse dictionary). To maintain the surnames pointing to lower nodes, we need to save30he names of certain lower nodes in a supplementary reverse dictionary D of D . This is only necessarywhen one of the remaining nodes (i.e., the upper nodes and the η -nodes) in the η -truncated HSP tree has asurname that is the name of a lower node. If such a remaining node v is an upper node having a surnameequal to the name of a lower node, the η -nodes in the subtree rooted at v have also the same surname.Hence, the number of entries in D is upper bounded by the number of η -nodes. The dictionary D is filledwith the surnames of the children of all η -nodes, whose number is at most 3 n/ η . Filling or querying D takes the same time as maintaining the η -nodes.Similar to the standard HSP trees, we can conduct LCE queries on two η -truncated HSP trees in thefollowing way: Lemma 28.
Let X and Y be two strings with | X | , | Y | ≤ n . Given that tHT η ( X ) and tHT η ( Y ) are builtwith the same dictionary, and given two text positions 1 ≤ i X ≤ | X | , ≤ i Y ≤ | Y | , we can computelcp( X [ i X .. ] , Y [ i Y .. ]) in O (lg ∗ n (lg( n/ η ) + 3 η / log σ n )) time using O (lg( n/ η )) words of working space. Proof.
Lemma 21 gives the time bounds for computing the longest common prefix with two HSP trees. Thelemma describes an LCE algorithm that uses the surnames to compare the generated substring of two nodes.By doing so, it accelerates the search for the first pair of mismatching characters in X [ i X .. ] and Y [ i Y .. ]. Tofind this mismatching pair, it examines the subtrees of the two nodes if both nodes mismatch. Since wecannot access a child of an η -node in our η -truncated HSP trees without rebuilding its subtree (as we donot store the lower nodes in D ), we treat the η -nodes as the leaves of the tree. This means that we comparetwo η -nodes (given their surnames are different) with a naive comparison of their generated substrings in O (3 η / log σ n ) time, remembering that the length of the generated substring of an η -node is at most 3 η . Forthe upper nodes, the algorithm works identically to the original version such that it takes O (lg ∗ n (lg( ‘/ η ))time for traversing those.Applying the idea of Cor. 22 to Lemma 28 gives the following corollary: Corollary 29.
Let X and Y be two strings with | X | , | Y | ≤ n . Given that tHT η ( X ) and tHT η ( Y ) are builtwith the same dictionary, we can augment both trees with a data structures such that given two text positions1 ≤ i X ≤ | X | , ≤ i Y ≤ | Y | , we can compute ‘ := lcp( X [ i X .. ] , Y [ i Y .. ]) in O (lg ∗ n (lg( ‘/ η ) + 3 η / log σ n ))time using O (lg( n/ η )) words of working space. The additional data structures can be constructed in O ( n )time with O ( n/ lg n ) words of space. Their space bounds are within the space bounds of the HSP trees. Proof.
To support accessing the parent of a node in constant time, we construct a pointer based tree structureof the truncated tree during its construction. Since tHT η ( Y ) contains at most n/ η nodes, the pointer basedtree structure takes O ( n/ η ) words.Given that η ≤ lg lg n , we augment the tree structure with a bit vector to jump from a text position to an η -node like in Cor. 22: We create a bit vector of length n marking the borders of the generated substrings ofthe η -nodes such that a rank-support data structure on this bit vector allows us to jump from a position Y [ i ]to the η -node h Y i η [ j ] with 1 + P j − k =1 string ( h Y i η [ k ]) ≤ i ≤ P jk =1 string ( h Y i η [ k ]) in constant time. The bitvector with rank-support takes O ( n/ lg n ) words, which is too much to obtain the space bounds of O ( n/ η )words when η = Ω(lg lg n ).Instead, we compute a sorted list of pairs if η ≥ lg (lg n ). During the construction of a truncated tree,we collect pairs of constructed η -nodes and their starting positions in a list. This list is automatically sortedby the starting positions as we construct the tree from left to right. The list takes O ( n/ η ) words, and wecan find the η -node whose generated substring covers a given position in O (lg( n/ η )) = O (lg n ) time bybinary searching the starting positions. This time is bounded by the time O (lg ∗ n η / lg σ n ) for scanning thegenerated substrings of all η -nodes during an LCE query, which is O (lg ∗ n lg n lg σ ) time when η ≥ lg (lg n ).It is left to consider the case that lg lg n < η < lg lg n . Let k be the number of η -nodes such that n/ η ≤ k ≤ n/ η . We build the above bit vector in the representation of Pagh [34]. In this representation,the rank-support answers rank queries in constant time. The bit vector together with its rank-support takes O ( k lg( n/k ) + k /n + k (lg lg k ) / lg k ) = O ( kη ) bits (which are O ( n/ η ) words) when k = n/ lg c n for a31 = u tHT η ( X ) tHT η ( Y ) v w string ( v ) string ( w ) string ( u ) string ( v ) η -nodes LCE intervalLCE intervalLCE interval Figure 28: Problem with generated substrings when merging tHT η ( X ) and tHT η ( Y ). Assume that we wantto merge tHT η ( X ) and tHT η ( Y ), and thus compute the bridging η -nodes (like u ) between both trees. Onthe one hand, the generated substrings of the non-surrounded η -nodes (like v ) and of the bridging nodesare marked protected, because we cannot find a surrogate substring in general. Although there is a secondoccurrence of string ( v ) to the right, string ( v ) can be extended or shortened when prepending characters(e.g., suppose that string ( v ) = a k , and that there is an a to the left of the left occurrence of string ( v ), butnot to the left of the right occurrence). On the other hand, the space of the recyclable interval can be usedfor storing the η -truncated HSP trees, because here we find suitable surrogate substrings for the generatedsubstrings of the η -nodes (like for w ).constant c > c exists, because n/ lg n < n/ η ≤ k ≤ n/ η < n/ lg n .However, the construction needs O ( n/ lg n ) words of space.With τ := 2 η we obtain the claim of Thm. 3. Remark 30.
In the following, we stick to the result obtained in Lemma 28 instead of Cor. 29. AlthoughLemma 28 has a slower running time for longest common prefixes that are short, the additional rank-supportdata structures of Cor. 29 makes it difficult to achieve our aimed running time for merging two trees (andtherefore would restrain us from achieving our final goal stated in Thm. 1). To merge two trees, whereeach tree is augmented with the bit vector and its rank-support data structure, the task would be to builda rank-support data structure on the concatenation of the bit vectors (preferably in logarithmic time).Unfortunately, we are not aware of a rank-support data structure that is efficiently mergeable (a naive waywould be to build the rank-support data structure of the large bit vector from scratch in linear time).
To use the η -truncated HSP trees as dynLCEs stored in text space , we have to think about how to mergethem. Like with HSP trees, merging two η -truncated HSP trees involves a reparsing of the nodes at thefacing borders (cf. Fig. 27). However, the reparsing of the η -nodes on that borders is especially problematic,as can be seen in Fig. 28: Suppose that we rename an η -node v from N to N with | string ( N ) | < | string ( N ) | .If the name N is not yet maintained in the dictionary, we have to create N , i.e., a pointer to a substring X of the text with X = string ( N ). The critical part is to find X in the not-yet overwritten parts of the text:Although we can create a suitably long string containing X by concatenating the generated substrings of v ’s preceding and succeeding siblings, these η -nodes may point to text intervals that are not consecutive.Since the name of an η -node is the representation of a single substring, we would have to search X in the entire remaining text. In the case that v is surrounded, Lemma 18 shows that X is a prefix of the generatedsubstring of a sibling η -node (unlike in Fig. 21, where the generated substring of the ESP node with name O cannot be easily determined). With this insight, we finally show an approach that proves Thm. 1. For that,32t remains to implement Rule 3 and Rule 4 from Sect. 4.1 in the context that we maintain η -truncated HSPtrees in text space : We explainGoal 1: how the parameter η has to be chosen such that tHT η ( Y ) fits into | Y | lg σ bits (needed for Rule 3),andGoal 2: how to merge two η -truncated HSP trees without the need for extra working space (needed forRule 4).Our first goal is to store tHT η ( T [ I ]) in a text interval I . Since tHT η ( T [ I ]) can contain nodes with |I| / η distinct names, it requires O ( |I| / η ) words, i.e., O ( |I| lg n/ η ) bits of space that might not fit in the |I| lg σ bits of T [ I ]. Declaring a constant α (independent of n and σ , but dependent on the size of a singlenode), we can solve this space issue by setting η := log ( α lg n/ lg σ ): Lemma 31.
The number of nodes of an η -truncated HSP tree on a substring of length ‘ is bounded by O ( ‘ (lg σ ) . / (lg n ) . ) with η = log ( α lg n/ lg σ ). Proof.
To obtain the upper bound on the number of nodes, we first compute a lower bound on the numberof bits taken by the generated substring of an η -node, which is already lower bounded by 2 η lg σ bits. Webegin with changing the base of the logarithm from 3 to 2 /
3, and reformulate η = log ( α lg n/ lg σ ) =(log −
1) log / ( α lg n/ lg σ ) = log / ( α lg n/ lg σ ) log − . This gives2 η lg σ = 3 η (2 / η lg σ = α ( α lg n/ lg σ ) log − lg n = ( α log )(lg n ) (lg σ ) − log . With the estimate 0 . < log < . α log )(lg n ) (lg σ ) − log > α . (lg n ) . (lg σ ) . . Hence, the generated substring of an η -node takes at least 2 η lg σ ≥ α . (lg n ) . (lg σ ) . bits.Finally, the number of nodes is bounded by ‘/ η ≤ ‘ lg σ/ ( α . (lg n ) . (lg σ ) . ) = ‘ (lg σ ) . / ( α . (lg n ) . ) . A consequence is that an η -node with η = log ( α lg n/ lg σ ) generates a substring containing at most3 η = α (lg n ) / (lg σ ) characters.Plugging this value of η in Lemma 27 and Lemma 28 yields two corollaries for the η -truncated HSP trees: Corollary 32.
We can compute an η -truncated HSP tree on a substring of length ‘ in O ( ‘ lg ∗ n + t look ‘/ η + ‘ lg lg n ) time. The tree takes O ( ‘/ η ) words of space. We need a working space of O (lg n lg ∗ n/ lg σ )characters. Proof.
The tree has at most ‘/ η nodes, and thus takes O ( ‘/ η ) words of space. According to Lemma 27,constructing an η -node uses O (3 η lg ∗ n ) = O (lg n lg ∗ n/ lg σ ) characters as working space. Corollary 33.
An LCE query on two η -truncated HSP trees can be answered in O (lg ∗ n lg n ) time. Proof.
LCE queries are answered as in Lemma 28, where the time bound depends on η . Since an η -nodegenerates a substring of at most 3 η = α lg n/ lg σ characters, we can compare the generated substrings of two η -nodes in O ( α lg n ) time. Overall, we compare O (lg ∗ n ) many times two η -nodes, such that these additionalcosts are bounded by O (lg ∗ n lg n ) time overall, and do not slow down the running time O (lg ∗ n lg( n/ η ) +lg ∗ n lg n ) = O (lg ∗ n lg n ).Our second and final goal is to adapt the merging used in the sparse suffix sorting algorithm (Sect. 4.1).Suppose that our algorithm finds two intervals [ i..i + ‘ −
1] and [ j..j + ‘ −
1] with T [ i..i + ‘ −
1] = T [ j..j + ‘ − tHT η ( T [ i..i + ‘ − j..j + ‘ − T [ i..i + ‘ −
1] untouchedso that parts of this substring can be referenced by the η -nodes. Unfortunately, Rules 1 to 4 cannot be applieddirectly due to our working space limitation. Since we additionally use the text space as working space, wehave to be careful about what to overwrite. In particular, we focus on how to33 CE interval LCE interval T = f f ≥ g f f Figure 29: Division of LCE intervals in protected (horizontal lines) and recyclable (vertical lines) parts.(a) partition the LCE intervals such that the generated substrings of the fragile non-surrounded η -nodesare protected from becoming overwritten,(b) keep enough working space in text space available for merging two trees,(c) construct tHT η ( T [ i..i + ‘ − j..j + ‘ −
1] when the intervals [ i..i + ‘ −
1] and[ j..j + ‘ −
1] overlap, and how to(d) bridge the gap T [ e ( I ) + 1 .. b ( J ) −
1] when merging tHT η ( T [ I ]) and tHT η ( T [ J ]) to tHT η ( T [ b ( I ) .. e ( J )])for two intervals I and J with b ( I ) < b ( J ) and | [ e ( I ) + 1 .. b ( J ) − | < g , as performed in Rule 4. (a) Partitioning of LCE intervals. To merge two η -truncated HSP trees, we have to take special careof those η -nodes that are fragile, because their names can change due to a merge. If the parsing changes thename of an η -node v , we first check whether v ’s new name is present in the dictionary. If it is not, we haveto create v ’s new name consisting of a text position i and a length ‘ such that T [ i..i + ‘ −
1] = string ( v ). Thenew name of a fragile surrounded η -node v can be created easily: According to Lemma 18, the generatedsubstring of v is always a prefix of the generated substring of an already existing η -node w , which is foundin the reverse dictionary of the η -nodes. Hence, we can create a new name of v with string ( w ).Unfortunately, the same approach does not work with the non-surrounded η -nodes, because those nodeshave generated substrings that are found at the borders of T [ j..j + ‘ −
1] (remember Fig. 28). If the charactersaround the borders are left untouched (meaning that we prohibit overwriting these characters), they can beused for creating the names of the fragile non-surrounded η -nodes during a reparsing. To prevent overwritingthese characters, we mark both borders of the interval [ j..j + ‘ −
1] as protected. Conceptually, we partitionan LCE interval into (1) recyclable and (2) protected intervals (see Fig. 29); we free the text of a recyclableinterval for overwriting, while prohibiting write access on a protected interval. The recyclable intervals aremanaged in a dynamic, global list. We keep the property thatProperty 6: f := (cid:6) α lg n ∆ L / lg σ (cid:7) = Θ( g ) text positions of the left and right ends of each LCEinterval are protected .This property solves the problem for the non-surrounded nodes, because a non-surrounded η -node has agenerated substring that is found in T [ j..j + f −
1] or T [ j + ‘ − − f..j + ‘ − (b) Reserving Text Space. We can store the upper part of the η -truncated HSP tree in a recyclableinterval, because it needs ‘/ η lg n ≤ ‘α . (lg σ ) . / (lg n ) . = o ( ‘ lg σ ) bits. Since f depends on α and g ,we can choose g (the minimum length of a substring on which an η -truncated HSP tree is built) and α (relative to the number of words taken by a single η -truncated HSP tree node) appropriately to always leave f lg σ/ lg n = O (lg ∗ n lg n ) words on a recyclable interval untouched, sufficiently large for the working spaceneeded by Cor. 32. Therefore, we precompute α and g based on the input text T , and set both as global constants dependent on T . Since the same amount of free space is needed during a subsequent merging whenreparsing an η -node, we add the following property:Property 7: Each LCE interval has f lg σ/ lg n words of free space left on a recyclable interval.In our algorithm for sparse suffix sorting, a special problem emerges when two computed LCE intervalsoverlap. For instance, this can happen when the LCE of a position i ∈ P with a position j ∈ P overlaps, i.e.,34 p p pi I T = jf J p f ≤ f fT = [ b ( I ) . . e ( J )] η -nodes p pp + f ≤ fT = i b e + 1 i + p + f j + ‘ − fη -node Figure 30:
Left : Overlapping LCE intervals I = [ i..i + ‘ −
1] and J = [ j..j + ‘ − Right : Finding thegenerated substring T [ b..e ] of an η -node in a protected interval. Given that p is the smallest period of T [ I ∪ J ], it is sufficient to make f + p characters on the left protected to find the generated substring of all η -nodes of tHT η ( T [ i..j + ‘ − T [ i..i + p + f − i..i + lce( i, j ) − ∩ [ j..j + lce( i, j ) − = ∅ . The algorithm would proceed with merging both overlappingLCE intervals to satisfy Property 5. However, the merged LCE interval cannot respect Property 6 and 7 ingeneral (consider that each interval has a length of 3 g , and both intervals overlap with 2 g characters). Inthe case of overlapping, we exploit the periodicity caused by the overlap to make an η -truncated HSP treefit into both intervals (while still assuring that Property 4 and Property 5 hold, and that we can restore thetext). (c) Interval Overlapping. In a more general setting, suppose that the intervals I := [ i..i + ‘ −
1] and J := [ j..j + ‘ −
1] with T [ I ] = T [ J ] overlap, without loss of generality i < j . Given ‘ > g , our task isto create tHT η ( T [ i..j + ‘ − T [ I ] = T [ J ], the substring T [ i..j + ‘ −
1] has a period p with 1 ≤ p ≤ j − i , i.e., T [ i..j + ‘ −
1] = X k Y , where | X | = p and Y is a(proper) prefix of X , for an integer k with k ≥ k > j ≤ i + ‘ −
1, otherwise i > j or I ∩ J = ∅ ).First, we compute the smallest period p ≤ j − i of T [ i..j + ‘ −
1] in O ( ‘ ) time [27]. By definition, eachsubstring of T [ i + p..j + ‘ −
1] appears also p characters earlier. We treat the substring T [ i..i + p + f − T [ i..i + p + f − η -node by an arithmetic progression. This can be seen by twofacts: First, the length of the generated substring of an η -node is at most 3 η = α lg n/ lg σ ≤ f /
2. Second,given an η -node with the generated substring T [ b..e ] with i + p + f ≤ e ≤ j + ‘ −
1, we find an integer k with k ≥ T [ b..e ] = T [ b − p k ..e − p k ] and [ b − p k ..e − p k ] ⊆ [ i..i + p + f −
1] (since e − b ≤ f / i + p + f + 1 ..j + ‘ − − f ] recyclable , which is at least as large as f , since |I ∪ J | ≥ j − i + 2 g ≥ p + 2 g is at least p + 3 f for a sufficiently large g . The partitioning into protected andrecyclable intervals is illustrated in Fig. 30.For the actual merging operation, we elaborate an approach that respects Properties 6 and 7: (d) Merging with a Gap. We introduce a merge operation that supports the merging of two η -truncatedHSP trees whose LCE intervals have a gap of less than g characters. The difference to Lemma 25 is thatwe additionally build new η -nodes on the gap between both trees. The η -nodes whose generated substringsintersect with the gap are called bridging nodes.Let tHT η ( T [ I ]) and tHT η ( T [ J ]) be built on two LCE intervals I and J with 1 ≤ b ( J ) − e ( I ) ≤ g . Ourtask is to compute the merged tree tHT η ( T [ b ( I ) .. e ( J )]). We do that by (a) reprocessing O ( ∆ L + ∆ R ) nodesat every height of both trees (according to Lemma 25), and (b) building the bridging nodes connecting bothtrees. Like with the non-surrounded nodes, the generated substring of a bridging node can be a uniquesubstring of the text. This means that overwriting T [ e ( I ) − f.. b ( J ) + f ] would invalidate the generatedsubstrings of the bridging nodes and of some (formerly) non-surrounded nodes. Therefore, we mark the35 = η -nodes η -nodes T = bridging nodes f ≥ g f < g f ≥ g f I J [ e ( I ) − f . . b ( J ) + f ]merging Figure 31: Merging tHT η ( T [ I ]) and tHT η ( T [ J ]) with b ( J ) − g ≤ e ( I ) ≤ b ( J ) −
1. The substring T [ e ( I ) − f.. b ( J ) + f ] is marked protected for the sake of the bridging nodes.interval [ e ( I ) − f.. b ( J ) + f ] as protected. By doing so, we can use the characters of T [ e ( I ) − f.. b ( J ) + f ] to • create the bridging η -nodes, and to • reparse the non-surrounded nodes of both trees (Fig. 31).The bridging nodes and their ancestors take o (lg n lg ∗ n ) words of additional space since building tHT η ( T [ e ( I ) + 1 .. b ( J ) − | b ( J ) − e ( I ) | = O ( g ) takes ( g/ η ) lg n = o ( g lg σ ) = o (lg ∗ n lg n ) bits(or o (lg ∗ n lg n ) words) of space. By choosing g and α sufficiently large, we can store the bridging nodes ina recyclable interval while maintaining Property 7 for the merged LCE interval. Finally, the time bound forthis merging strategy is given in the following corollary: Corollary 34.
Given two LCE intervals I and J with b ( I ) ≤ b ( J ) ≤ e ( I )+ g , we can build tHT η ( T [ b ( I ) .. e ( J )])in O ( g lg ∗ n + t look g/ η + gη/ log σ n + t look lg ∗ n lg n ) time. Proof.
We adapt the merging of two HSP trees (Lemma 25) for the η -truncated HSP trees. The differenceto Lemma 25 is that we reparse an η -node by rebuilding its local surrounding consisting of O (( ∆ L + ∆ R )3 η )nodes that take α ( ∆ L + ∆ R ) lg n/ lg σ ≤ f words for a sufficiently large α . According to Property 7, there areat least f words of space left in a recyclable interval to recompute an η -node, and to create the bridging nodesin the fashion of Cor. 32. Both creating and recomputing takes overall O ( g lg ∗ n + t look g/ η + gη/ log σ n )time.There is one problem left before we can prove the main result of the paper: The sparse suffix sortingalgorithm of Sect. 4.1 creates LCE intervals on substrings smaller than g between two LCE intervals tem-porarily when applying Rule 3. We cannot afford to build such tiny η -truncated HSP trees, since they cannotrespect Property 6 and Property 7. Due to Rule 4, we eventually merge a temporarily created dynLCE witha dynLCE on a long LCE interval. Instead of temporarily creating an η -truncated HSP tree covering lessthan g characters, we apply the new merge operation of Cor. 34 directly, merging two trees that have a gapof less than g characters. With this and the other properties stated above, we come to the final proof: Proof of Thm. 1.
The analysis is split into suffix comparison, tree generation and tree merging: • Suffix comparisons are done as in Cor. 24. LCE queries on η -truncated HSP trees and HSP trees areconducted in the same time bounds (compare Lemma 21 with Cor. 33).36 All positions considered for creating the η -truncated HSP trees belong to C . Constructing the η -truncated HSP trees costs O ( |C| lg ∗ n + t look |C| / η + |C| lg lg n ) overall time, due to Cor. 32. • Merging in the fashion of Cor. 34 does not affect the overall time: Since a merge of two trees introducesless than g new text positions to an LCE interval, we conclude with the same analysis as in Thm. 26that the time for merging is upper bounded by the construction time.Plugging the times for suffix comparisons, tree construction and merging in Cor. 24 yields the overall time O ( |C| lg ∗ n + t look |C| / η + |C| lg lg n ) = O ( |C| ( t look (lg σ ) . / (lg n ) . + lg lg n )) = O ( |C| ( p lg σ + lg lg n )) , since t look = O (lg n ). The time for searching and sorting the suffixes is O ( m lg m lg ∗ n lg n ). The auxiliarydata structures used are SAVL ( Suf ( P )), the search tree L for the LCE intervals, and the list of recyclableintervals, each taking O ( m ) words of space. In the first part, we introduced the HSP trees based on the ESP technique as a new data structure that(a) answers LCE queries, and (b) can merge with another HSP tree to form a larger HSP tree. With theseproperties, HSP trees are an eligible choice for the mergeable LCE data structure needed for the sparse suffixsorting algorithm presented here.In the second part, we developed a truncated version with a trade-off parameter determining the heightat which to cut off the lower nodes. Setting the trade-off parameter adequately, the truncated HSP tree fitsinto text-space. As a result of independent interest, we obtained an LCE data structure with a trade-offparameter, like other already known solutions. Although not shown here, an ESP tree can similarly (a)answer LCE queries, (b) be merged, and (c) be truncated. However, answering LCE queries or merging twoESP trees is by a factor of O (lg n ) slower than when the operations are performed with HSP trees.In the appendix, we noted that the maximum number of fragile nodes in an ESP tree of a string of length n can be at least Ω(lg n ), which invalidates the upper bound of O (lg n lg ∗ n ) on the maximal number of fragilenodes postulated in [11]. This result also invalidates theoretical results that depend on the ESP technique(e.g., for approximating the edit distance with moves [11] or the LZ77 factorization [10], or for buildingindexes [41, 15, 30, 40]). We could quickly provide a new upper bound of O (lg n lg ∗ n ), but it remains anopen problem to refine our bounds. Luckily, our proposed HSP technique can be used as a substitution forthe ESP technique, since HSP trees and ESP trees share the same bounds for construction time and spaceusage. By switching to the HSP technique, we regain the promised O (lg n lg ∗ n ) number of fragile nodes.It is easy to see that this result also recovers the postulated O (lg n lg ∗ n ) approximation bound on the editdistance matching problem [11, 41]: Given ET ( T ) of a string T of length n , it is assumed by Cormode andMuthukrishnan [11, Theorem 7] that changing/deleting a character of T , or inserting a character in T changes O (lg ∗ n lg n ) nodes in ET ( T ). Although we only provided proofs that pre-/appending characters to T changes O (lg ∗ n lg n ) nodes of HT ( T ), it is easy to generalize this result by applying a merge operation: Given thatwe insert a character c ∈ Σ between T [ i ] and T [ i + 1], the trees HT ( T ) and HT ( T [1 ..i ] c T [ i + 1 .. ]) differ inat most O (lg ∗ n lg n ) nodes, since appending c to HT ( T [1 ..i ]) and merging HT ( T [1 ..i ] c ) with HT ( T [ i + 1 .. ])changes O (lg ∗ n lg n ) nodes. The same can be observed when deleting or changing the i -th character.In the light of the theoretical improvements of the HSP over the ESP, it is interesting to evaluate howHSP behaves practically. Especially, we are interested in how well HSP behaves in the context of grammarcompression [3] like the ESP-index [30, 40] on highly repetitive texts, where a more stable behavior of therepetitive nodes could lead to an improved compression ratio.From the theoretical point of view, it would be interesting to compute the sparse suffix sorting with atrade-off parameter adjusting space and query time such that this parameter can be chosen from a continuousdomain like the result we presented for the LCE query data structure.In the case that we can impose a restriction on the set of suffixes to sort, K¨arkk¨ainen and Ukkonen [24]presented a sparse suffix sorting algorithm running in optimal O ( n ) time while using O ( m ) words of space,37iven that P is a set of equally spaced text positions. We wonder whether it is also possible to gain a benefitwhen only every i -th entry of SA is needed, i.e., the order of each i -th lexicographically smallest suffix for anarithmetic progression i = c, c, c, . . . with a constant integer c ≥
2. Related to this problem is the suffixselection problem, i.e., to find the i -th lexicographically smallest suffix for a given integer i . Interestingly,Franceschini and Muthukrishnan [14] showed that the suffix selection problem can be solved in O ( n ) timein the comparison model, whereas suffix sorting is solved in Θ( n lg n ) time within the same model. References [1] S. Alstrup, G. S. Brodal, and T. Rauhe. Pattern matching in dynamic texts. In
Proc. SODA , pages819–828. ACM/SIAM, 2000.[2] A. Andersson and S. Nilsson. A new efficient radix sort. In
Proc. FOCS , pages 714–721. IEEE ComputerSociety, 1994.[3] H. Bannai. Grammar compression. In
Encyclopedia of Algorithms , pages 861–866. Springer, 2016.[4] J. L. Bentley and R. Sedgewick. Fast algorithms for sorting and searching strings. In
Proc. SODA ,pages 360–369. ACM/SIAM, 1997.[5] P. Bille, I. L. Gørtz, B. Sach, and H. W. Vildhøj. Time-space trade-offs for longest common extensions.
J. Discrete Algorithms , 25:42–50, 2014.[6] P. Bille, I. L. Gørtz, M. B. T. Knudsen, M. Lewenstein, and H. W. Vildhøj. Longest common extensionsin sublinear space. In
Proc. CPM , volume 9133 of
LNCS , pages 65–76. Springer, 2015.[7] P. Bille, J. Fischer, I. L. Gørtz, T. Kopelowitz, B. Sach, and H. W. Vildhøj. Sparse text indexing insmall space.
ACM Trans. Algorithms , 12(3):39:1–39:19, 2016.[8] T. M. Chan, J. I. Munro, and V. Raman. Selection and sorting in the ”restore” model. In
Proc. SODA ,pages 995–1004. SIAM, 2014.[9] C. J. Colbourn and A. C. H. Ling. Quorums from difference covers.
Inf. Process. Lett. , 75(1-2):9–12,2000.[10] G. Cormode and S. Muthukrishnan. Substring compression problems. In
Proc. SODA , pages 321–330.SIAM, 2005.[11] G. Cormode and S. Muthukrishnan. The string edit distance matching problem with moves.
ACMTransactions on Algorithms , 3(1):2:1–2:19, 2007.[12] J. Fischer, T. I, and D. K¨oppl. Deterministic sparse suffix sorting on rewritable texts. In
LATIN , volume9644 of
LNCS , pages 483–496. Springer, 2016.[13] G. Franceschini and R. Grossi. No sorting? Better searching!
ACM Trans. Algorithms , 4(1):2:1–2:13,2008.[14] G. Franceschini and S. Muthukrishnan. Optimal suffix selection. In
Proc. STOC , pages 328–337, 2007.[15] S. Fukunaga, Y. Takabatake, T. I, and H. Sakamoto. Online grammar compression for frequent patterndiscovery. In
Proc. ICGI , volume 57 of
Workshop and Conference Proceedings , pages 93–104. JMLR,2016.[16] P. Gawrychowski and T. Kociumaka. Sparse suffix tree construction in optimal time and space. In
Proc. SODA , pages 425–439. SIAM, 2017. 3817] A. Goldberg, S. Plotkin, and G. Shannon. Parallel symmetry-breaking in sparse graphs. In
Proc. STOC ,pages 315–324. ACM, 1987.[18] K. Goto. Optimal time and space construction of suffix arrays and LCP arrays for integer alphabets.
ArXiv CoRR , abs/1703.01009, 2017.[19] T. I. Longest common extensions with recompression. In
Proc. CPM , volume 78 of
LIPIcs , pages18:1–18:15. Schloss Dagstuhl, 2017.[20] T. I, J. K¨arkk¨ainen, and D. Kempa. Faster sparse suffix sorting. In
Proc. STACS , volume 25 of
LIPIcs ,pages 386–396. Schloss Dagstuhl, 2014.[21] R. W. Irving and L. Love. The suffix binary search tree and suffix AVL tree.
J. Discrete Algorithms , 1(5-6):387–408, 2003.[22] G. J. Jacobson. Space-efficient static trees and graphs. In
Proc. FOCS , pages 549–554. IEEE ComputerSociety, 1989.[23] J. K¨arkk¨ainen and D. Kempa. LCP array construction using O ( sort ( n )) (or less) I/Os. In Proc. SPIRE ,volume 9954 of
LNCS , pages 204–217. Springer, 2016.[24] J. K¨arkk¨ainen and E. Ukkonen. Sparse suffix trees. In
Proc. COCOON , volume 1090 of
LNCS , pages219–230. Springer, 1996.[25] J. K¨arkk¨ainen, P. Sanders, and S. Burkhardt. Linear work suffix array construction.
J. ACM , 53(6):918–936, 2006.[26] Z. Khan, J. S. Bloom, L. Kruglyak, and M. Singh. A practical algorithm for finding maximal exactmatches in large sequence datasets using sparse suffix arrays.
Bioinformatics , 25(13):1609–1616, 2009.[27] R. Kolpakov and G. Kucherov. Finding maximal repetitions in a word in linear time. In
Proc. FOCS ,pages 596–604. IEEE Computer Society, 1999.[28] R. Kolpakov, G. Kucherov, and T. A. Starikovskaya. Pattern matching on sparse suffix trees. In
Proc.CCP , pages 92–97, 2011.[29] Z. Li, J. Li, and H. Huo. Optimal in-place suffix sorting.
ArXiv CoRR , abs/1610.08305, 2016.[30] S. Maruyama, M. Nakahara, N. Kishiue, and H. Sakamoto. ESP-index: A compressed index based onedit-sensitive parsing.
J. Discrete Algorithms , 18:100–112, 2013.[31] K. Mehlhorn, R. Sundar, and C. Uhrig. Maintaining dynamic sequences under equality-tests in poly-logarithmic time. In
Proc. SODA , pages 213–222. SIAM, 1994.[32] T. Nishimoto, T. I, S. Inenaga, H. Bannai, and M. Takeda. Fully dynamic data structure for LCEqueries in compressed space. In
Proc. MFCS , volume 58 of
LIPIcs , pages 72:1–72:15. Schloss Dagstuhl,2016.[33] G. Nong, S. Zhang, and W. H. Chan. Two efficient algorithms for linear time suffix array construction.
IEEE Trans. Computers , 60(10):1471–1484, 2011.[34] R. Pagh. Low redundancy in static dictionaries with constant query time.
SIAM J. Comput. , 31(2):353–363, 2001.[35] N. Prezza. In-place sparse suffix sorting. In
Proc. SODA , pages 1496–1508. SIAM, 2018.[36] S. J. Puglisi, W. F. Smyth, and A. Turpin. A taxonomy of suffix array construction algorithms.
ACMComput. Surv. , 39(2):1–31, 2007. 3937] N. Rahman and R. Raman. Rank and select operations on binary strings. In
Encyclopedia of Algorithms ,pages 748–751. Springer, 2008.[38] S. C. Sahinalp and U. Vishkin. Symmetry breaking for suffix tree construction. In
Proc. STOC , pages300–309. ACM, 1994.[39] H. Sakamoto, S. Maruyama, T. Kida, and S. Shimozono. A space-saving approximation algorithm forgrammar-based compression.
IEICE Transactions , 92-D(2):158–165, 2009.[40] Y. Takabatake, Y. Tabei, and H. Sakamoto. Improved ESP-index: A practical self-index for highlyrepetitive texts. In
Proc. SEA , volume 8504 of
LNCS , pages 338–350. Springer, 2014.[41] Y. Takabatake, K. Nakashima, T. Kuboyama, Y. Tabei, and H. Sakamoto. siEDM: An efficient stringindex and search algorithm for edit distance with moves.
Algorithms , 9(2):26:1–26:18, 2016.[42] Y. Tanimura, T. I, H. Bannai, S. Inenaga, S. J. Puglisi, and M. Takeda. Deterministic sub-linear spaceLCE data structures with efficient construction. In
Proc. CPM , volume 54 of
LIPIcs , pages 1:1–1:10.Schloss Dagstuhl, 2016.[43] Y. Tanimura, T. Nishimoto, H. Bannai, S. Inenaga, and M. Takeda. Small-space LCE data structurewith constant-time queries. In
Proc. MFCS , volume 83 of
LIPIcs , pages 10:1–10:15. Schloss Dagstuhl,2017.[44] M. Vyverman, B. De Baets, V. Fack, and P. Dawyndt. essaMEM: finding maximal exact matches usingenhanced sparse suffix arrays.
Bioinformatics , 29(6):802–804, 2013.40
A Lower Bound on the Number of Fragile ESP Tree Nodes
Here, we present two examples reveiling that the ESP technique changes Ω(lg n ) nodes when changing asingle character. Each example contains a large number of type M meta-blocks in a specific constellation.Remembering how the ESP technique parses its input, a remaining single symbol neighbored by two repeatingmeta-blocks is fused with one of them to form a type M meta-block. Regardless of whether we favor to fuse aremaining symbol with either its preceding or succeeding (cf. Rule M) repeating meta-block to form a type M meta-block, for each of the two tie breaking rules, we give an example string of length at most n whose ESPtree has Ω(lg n ) fragile nodes. These examples contradict Lemma 9 in [11], where it is claimed that thereare O (lg ∗ n lg n ) fragile nodes in the ESP tree of a text of length n . A.1 Fusing with the Preceding Repeating Meta-Block
Consider a type 1 meta-block µ whose right-most node is fragile. If the leftmost node of arepeating meta-block ν is built on µ ’s rightmostnode, then the rightmost node of ν can also befragile. µν Having this idea in mind, we build an example consisting of a chain of repeating meta-blocks, where theleftmost node of a repeating meta-block is built on the fragile rightmost node of a meta-block of one depthbelow (shaded in the right picture). The main idea is the following: Each meta-block of this chain can be ofarbitrary (but sufficiently long) length. Keeping in mind that changing the name of a node means that thenames of its ancestors also have to change, we can create an example string whose ESP tree contains fragilenodes appearing on each height at arbitrary positions:
Example 35.
Let a , b and c ∈ Σ be three different characters. The text Y := ( X ) k ( X ) k − ( X ) k − · · · ( X k − ) with k := b log ( n/ log n ) c has a length at most n , and its ESP tree has Ω(lg n ) fragile nodes, where X := a , and X i := ( X i − b i − if i is odd ,X i − c i − if i is even , for i = 1 , . . . , k. X = a , X = aab , X = aabaabc , X = X b ,... To show that the claim in the example is correct, we insert a lemma showing the associativity of esp ona special class of strings:
Lemma 36.
Contrary to Rule M, assume that we favor fusing a remaining character with its precedingmeta-block to form a type M meta-block. Given a height h and two strings X, Y that are either empty orhave a length of at least 2 · h − , esp ( h ) ( X b i Y ) = esp ( h ) ( X ) esp ( h ) ( b i ) esp ( h ) ( Y ) if i ≥ h , b is neither asuffix of X nor a prefix of Y , and there is no prefix of esp ( j ) ( Y ) of the form cd k for some characters c , d ∈ Σ j with c = d , and integers k, j with k ≥ ≤ j ≤ h − Proof.
The additional requirement for Y is to ensure that the leftmost block of esp ( j ) ( Y ) is not a non-repetitive type M block that has been fused to its succeeding meta-block, only because it has no precedingmeta-block. Regardless of which characters are prepended to esp ( j − ( Y ), the first character of such a blockwould form with its preceding characters a new block.For h = 1, esp divides the string X b i Y into meta-blocks such that there is one type 1 meta-block µ thatexactly contains the substring b i . That is because of the following: If X (resp. Y ) is not the empty string,then it contains at least two characters. Since we favor fusing with the preceding meta-block, there is nochance that characters of X can enter µ . Assume that Y is not the empty string. Since the first block of esp ( Y ) is neither a non-repetitive type M block nor a block starting with b , it is not possible that charactersof this block can enter µ .Under the assumption that the claim holds for a given h − ≥
0, we have esp ( h ) ( X b i +1 Y ) = esp ( esp ( h − ( X b i +1 Y )) = esp ( esp ( h − ( X ) esp ( h − ( b i +1 ) esp ( h − ( Y )) . esp ( h ) ( X ) and esp ( h ) ( Y ) are either empty or contain at least two characters. Since i ≥ h , esp ( h − ( b i ) is the repetition of the same character. This repetition has a length of at least three such thatwe can apply the shown associativity for h = 1 to show the claim. Proof of Ex. 35.
We start with determining the length of Y . Since | X | = 3 , under the assumption that | X i | = 3 i , we obtain that | X i +1 | = 2 | X i | + 3 i = 3 i +1 . Therefore, (cid:12)(cid:12)(cid:12) X k − i i (cid:12)(cid:12)(cid:12) = 3 k for all i = 0 , . . . , k −
1. Weconclude that the length of Y is at most n , since | Y | = k k ≤ n log ( n/ log n ) / log n ≤ n .We now show that each substring X i of Y is the generated substring of a node x i of ET ( Y ) on height i whose subtree is equal to the perfect ternary subtree T i := ET ( X i ), for i = 1 , . . . , k −
1. This is true for i = 1 , ,
3, as can be seen in Fig. 32. For the general case, we adapt the associativity shown for esp inLemma 36 to the string X i : Sub-Claim.
For every i with 0 ≤ i ≤ k − (cid:12)(cid:12) esp ( i +1) ( X i +1 ) (cid:12)(cid:12) = 1,(II) esp ( h ) ( X i +1 ) = esp ( h ) ( X i X i d i i ) = esp ( h ) ( X i X i ) esp ( h ) ( d i i ) = esp ( h ) ( X i ) esp ( h ) ( X i ) esp ( h ) ( d i i ), and(III) esp ( h ) ( X i +1 ) starts with a repetition of a character,for every h with 0 ≤ h ≤ i , where d i is a character with d i = b if i is even, otherwise d i = c . Sub-Proof.
For i = 0 we have(I) (cid:12)(cid:12) esp (1) ( X ) (cid:12)(cid:12) = | esp ( aab ) | = 1 ( aab is put in a type M meta-block having exactly one block),(II) esp (0) ( X ) = X , and(III) X = aab starts with a repetition of the character a .Under the assumption that the claim holds for an integer i , we conclude that it holds for i + 1 due to esp ( h ) ( X i +2 ) = esp ( h ) ( X i +1 X i +1 d i +1 i +1 )= esp ( h ) ( X i X i d i i X i X i d i i d i +1 i +1 ) (Lemma 36 , d i = d i +1 ) = esp ( h ) ( X i X i d i i X i X i d i i ) esp ( h ) ( d i +1 i +1 ) (Lemma 36,(I) or (III)) = esp ( h ) ( X i X i ) esp ( h ) ( d i i ) esp ( h ) ( X i X i ) esp ( h ) ( d i i ) esp ( h ) ( d i +1 i +1 ) (Lemma 36,(I) or (III)) = esp ( h ) ( X i X i d i i ) esp ( h ) ( X i X i d i i ) esp ( h ) ( d i +1 i +1 )= esp ( h ) ( X i +1 ) esp ( h ) ( X i +1 ) esp ( h ) ( d i +1 i +1 )for 1 ≤ h ≤ i . The conditions of Lemma 36 hold because d i is neither a prefix nor a suffix of X i , d i = d i +1 , | X i X i | = 2 · i , and esp ( h ) ( X i X i ) starts with a repetition of a character due to ( (III) for h < i , or due to esp ( i ) ( X i X i ) = ( II ) esp ( i ) ( X i ) esp ( i ) ( X i ) and (I) for h = i. For h = i + 1 we use that (I) holds for X i , (cid:12)(cid:12)(cid:12) esp ( i ) ( d i i ) (cid:12)(cid:12)(cid:12) = 1, and esp ( i ) ( d i +1 i +1 ) is a repetition of length 3 ofthe same character, to obtain esp ( i +1) ( X i +2 ) = esp ( esp ( i ) ( X i +2 ))= esp ( esp ( i ) ( X i X i ) esp ( i ) ( d i i ) esp ( i ) ( X i X i ) esp ( i ) ( d i i ) esp ( i ) ( d i +1 i +1 )) (Lemma 36) = esp ( esp ( i ) ( X i X i ) esp ( i ) ( d i i ) esp ( i ) ( X i X i ) esp ( i ) ( d i i )) esp ( esp ( i ) ( d i +1 i +1 )) (evaluate and reformulate) = esp ( esp ( i ) ( X i X i ) esp ( i ) ( d i i )) esp ( esp ( i ) ( X i X i ) esp ( i ) ( d i i )) esp ( esp ( i ) ( d i +1 i +1 )) , esp puts esp ( i ) ( X i X i ) esp ( i ) ( d i i ) into a single type M meta-blockof length three, and that d i is neither a prefix nor a suffix of X i . This concludes (II). A consequenceis (III): For h ≤ i we have esp ( h ) ( X i +2 ) = esp ( h ) ( X i +1 ) esp ( h ) ( X i +1 ) esp ( h ) ( d i +1 i +1 ), and esp ( h ) ( X i +1 ) startswith a repetition of a character according to our assumption. For h = i + 1 we have esp ( i +1) ( X i +2 ) = esp ( esp ( i ) ( X i ) esp ( i ) ( X i ) esp ( i ) ( d i i ) esp ( i ) ( X i ) esp ( i ) ( X i ) esp ( i ) ( d i i ) esp ( i ) ( d i +1 i +1 )). Due to (I), (cid:12)(cid:12) esp ( i ) ( X i ) (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) esp ( i ) ( d i i ) (cid:12)(cid:12)(cid:12) = 1; hence the last application of esp creates three blocks, where each of the first two representsthe string esp ( i ) ( X i ) esp ( i ) ( X i ) esp ( i ) ( d i i ) of length three. Another application of esp yields (I). (cid:4) Let b i and c i denote the names of the roots of ET ( b i ) and of ET ( c i ), respectively. Set d i := b i if i is even, otherwise d i := c i . Then h X i +1 i i +1 = x i +1 due to Sub-Claim (I), and h X i +1 i i = x i x i d i due toSub-Claim (II). Consequently, esp (( h X i +1 i i ) k − i − ) = esp (( x i x i d i ) k − i − ) = ( esp ( x i x i d i )) k − i − = x k − i − i +1 . (1)This means that h X i i k − i h = h X k − i i i h is a repetitionof length 3 k − h consisting of the same name, for everyheight h = i, . . . , k . We conclude that T i := ET (( X i ) k − i ) isa perfect ternary tree. X k X k − X k − height k ET ( Y ) T T T k − ······ Y = Finally, we show that esp ( h ) ( Y ) = esp ( h ) ( X k ) · · · esp ( h ) ( X k − ) for each height h with 1 ≤ h ≤ k . On theone hand, we have esp ( h ) ( X k − i i X k − i − i +1 ) = esp ( h ) ( X k − i − i X i − X i − d i − i − X k − i − i +1 ) (III) with 0 ≤ i ≤ h − = esp ( h ) ( X k − i − i X i − X i − ) esp ( h ) ( d i − i − ) esp ( h ) ( X k − i − i +1 )= esp ( h ) ( X k − i − i X i − X i − d i − i − ) esp ( h ) ( X k − i − i +1 )= esp ( h ) ( X k − i i ) esp ( h ) ( X k − i − i +1 ) (2)for 1 ≤ h ≤ i − esp ( h ) ( X k − i i X k − i − i +1 ) = esp ( h − i +1) ( esp ( i − ( X k − i i X k − i − i +1 )) Eq. (2) = esp ( h − i +1) ( esp ( i − ( X k − i i ) esp ( i − ( X k − i − i +1 )) Eq. (1) = esp ( h − i ) ( esp (( x i − x i − d i − ) k − i ( x i − x i − d i − x i − x i − d i − h d i i i i − ) k − i − )) (apply esp ) = esp ( h − i ) ( x k − i i ( x i x i d i ) k − i − )= esp ( h − i − ( esp ( x k − i i x i x i d i ) esp (( x i x i d i ) k − i − )) (evaluate and reformulate) = esp ( h − i − ( esp ( x k − i i ) esp (( x i x i d i ) k − i − )) Eq. (1) = esp ( h − i − ( esp ( x k − i i ) esp ( x k − i − i +1 )) (Lemma 36) = esp ( h − i ) ( x k − i i ) esp ( h − i ) ( x k − i − i +1 ) (3)for i ≤ h ≤ k . It is easy to extend the pairwise associativity X k − i i X k − i − i +1 for each i with 0 ≤ i ≤ k − X k · · · X k − . This concludes that the root of T i has the same name as the i -th leftmost node of ET ( Y ) onheight k . Figure 33(left) shows an excerpt of T i and T i +1 . The crucial step in Eq. (3) is the re-formulationof the parsing esp ( h − i − ( esp ( x k − i i x i x i d i ) | {z } belongs to T i esp (( x i x i d i ) k − i − ) | {z } belongs to T i +1 ) = esp ( h − i − ( esp ( x k − i i ) esp (( x i x i d i ) k − i − ) | {z } =: µ i +1 ) (4)43 x d x x d b b b x x d x aab aab ccc aab aab ccc bbb bbb bbbtype M type M type 1 type M type M type 1 type 1 Figure 32: ET ( X ) as defined in Ex. 35. The subtree of each node with name x i is equal to ET ( X i ).showing that there is a type 1 meta-block µ i +1 covering all nodes of T i +1 and the rightmost node of T i , onheight i + 1. This meta-block is a repetition of the character esp ( x i x i d i ) = x i +1 ∈ Σ h +1 .Given that µ is the first type 1 meta-block of esp ( Y ) (covering the prefix X h +20 ), we now examine whathappens with µ i for each i with 0 ≤ i ≤ h − a from Y . Let us call the shortenedstring Y , i.e., Y = a Y . On removing the first a from Y , we claim that the meta-block µ i contains onecharacter x i less, for every i with 0 ≤ i ≤ h − h Y i i and h Y i i on height i with 0 ≤ i ≤ k − µ , this is trivial. For an i ≥
0, focus on the substring X k − i i X k − i − i +1 of Y : We have esp ( h X k − i i X k − i − i +1 i i ) = esp ( x k − i i ( x i x i d i ) k − i − ) = esp ( x k − i i | {z } suffix of µ i ) esp (( x i x i d i ) k − i − ) | {z } = µ i +1 = esp ( x k − i i ) x k − i − i +1 due to Eq. (3). Under the assumption that removing the first character a from Y causes µ i to shrink by onecharacter x i ∈ Σ i , we get esp ( x k − i − i ( x i x i d i ) k − i − ) = esp ( x k − i i x i d i ) esp (( x i x i d i ) k − i − − )= esp ( x k − i i x i d i ) x k − i − − i +1 = esp ( x k − i − i ) x k − i − i +1 . We observe that the length of µ i is decremented by one, causing the name of its rightmost block tochange, which is the leftmost node of T i +1 on height i + 1, and the first character of µ i +1 . Due to the tiebreaking rule, this block gets fused with its preceding meta-block at height i + 1, decrementing the lengthof its succeeding meta-block µ i +1 by one (and hence, this process repeats for all i = 0 , . . . , k − i of T i changes, for 1 ≤ i ≤ k −
1. Each of these nodes receivesa new name such that it is fused with its preceding type 1 meta-block to form a type M meta-block. Sincechanging a node on height i changes all its ancestors, at least k − i nodes are changed in T i . In total, atleast k + ( k −
1) + ( k −
2) + · · · + 2 = ( k + k ) / − k ) = Ω(log ( n/ log n )) = Ω(lg n ) fragile nodes.Note that the later introduced HSP technique (see Sect. 3) with the same tie breaking rule also producesΩ(lg n ) fragile nodes in this example. A.2 Fusing with the Succeeding Repeating Meta-Block
The idea is similar to the previous example. In particular, we introduce a corollary of Lemma 36:
Corollary 37.
Given a height h and a string Y that is either empty or has a length of at least 2 · h − , esp ( h ) ( XY ) = esp ( h ) ( X ) esp ( h ) ( Y ) if a is not a prefix of Y , where X = b i a j with i + j ≥ h , and a , b ∈ Σwith a = b . 44 i +1 · · · y i +1 x i +1 x i +1 x i +1 · · · x i x i x i · · · x i x i x i x i x i d i x i x i d i x i x i d i ... height i + 1height i type 1 type 1type M type M type M belongs to T i belongs to T i +1 y i +1 · · · y i +1 x i +1 x i +1 x i +1 · · · x i x i x i · · · x i x i x i x i d i x i x i d i x i x i d i ... height i + 1height i type M type 1type M type M type M Figure 33: Differences between ET ( Y ) ( left ) and ET ( Y ) ( right ) on the heights i and i + 1, where Y = a k ( a b ) k − (( a b ) c ) k − · · · and Y = a Y (defined in Ex. 35). The names y i +1 and x i +1 are only used inthis figure.In the following example, we build a text whose ESP tree has a specific type M meta-block on each heightthat we want to change. Given a type M meta-block µ that emerged from prepending a character to a type 1 meta-block, we can create a new meta-block by prepending another character such that it precedes µ and absorbs µ ’s first character ( µ then returns to be a type 1 meta-block). We can arrange the type M meta-blocks such that prepending a character to the text changes a type M meta-block on each height: Example 38.
Let k = b log ( n/ log n ) c be a natural number, and a , b ∈ Σ. Define Y := X X · · · X k with X i := b i a k − i , for 0 ≤ i ≤ k −
1. Then | Y | ≤ n , and ET ( Y ) has Ω(lg n ) fragile nodes. Proof.
Given an integer i with 0 ≤ i ≤ k −
1, we have | X i | = 3 k and | Y | = k k ≤ n . Corollary 37 yields esp ( h ) ( X i ) = esp ( h ) ( b i ) esp ( h ) ( a k − i ) for all heights h with 0 ≤ h ≤ i , since 3 k − i ≥ k − k − = 2 · k − .Let a i := h a k i i [1] and b i := h b k i i [1] be the nodes on height i with 0 ≤ i ≤ k and string ( a i ) = a i orrespectively string ( b i ) = b i ( a := a , b := b ), esp ( esp ( h − ( X i )) partitions its input esp ( h − ( X i ) into twometa-blocks: a type 1 meta-block containing all b i ’s, and a subsequent type 1 meta-block containing all a i ’s.All blocks of these two meta-blocks contain three characters, since each meta-block has a length that is equalto a power of three. For the upper heights we get esp ( h + i ) ( X i ) = esp ( h ) ( |·| =3 k − i z }| { esp ( i ) ( b i ) | {z } = b i esp ( i ) ( a k − i ) | {z } = a k − i − i ) for 0 ≤ h + i ≤ k − . (5)Hence, esp ( h + i ) ( X i ) consists of exactly one type M meta-block, which has the length 3 k − h − i , and each blockcontains three characters. We conclude that the tree T i := ET ( X i ) is a perfect ternary tree, for 0 ≤ i ≤ k − (cid:12)(cid:12) esp ( h ) ( X i ) (cid:12)(cid:12) = 3 k − h for all i, h with 0 ≤ i ≤ k − ≤ h ≤ k , with Cor. 37 it is easy to see that esp ( h ) ( Y ) = esp ( h ) ( X · · · X k − ) = esp ( h ) ( X ) · · · esp ( h ) ( X k − ) for all 0 ≤ h ≤ k . A conclusion is that X i isthe generated substring of the i -th leftmost node ET ( Y ) on height k whose name is equal to the name of theroot of T i , for 0 ≤ i ≤ k − a to Y and call the new string Y , i.e., Y = a Y . Our analysis of thedifference between ET ( Y ) and ET ( Y ) focuses on the unique meta-block at height i of T i : From Eq. (5) with h = 0, we observe that there is a single meta-block µ i at height i of T i , and this meta-block is a type M meta-block (cf. Fig. 34(right)). Our claim is that prepending a to Y changes the blocks of the borders of every µ i (0 ≤ i ≤ k − a forms a type 2 meta-block with the first character of X by “stealing”the first character from µ , and this character is a b = b . Assume that µ i (0 ≤ i ≤ k −
1) looses its firstcharacter (i.e., b i ). By relinquishing this character, µ i becomes a type 1 meta-block, consisting only of a i ’s.The last two a i ’s contained in µ are grouped into a block a i +1 of length two , where a i +1 := h a i a i i [1] is thename of the root node of ET ( a i a i ). Every newly appearing node a i +1 gets combined with its right-adjacentnode b i +1 to form a new type 2 meta-block. The used b i +1 is stolen from µ i +1 , and hence we observe aniterative process of stealing the first character b i +1 from µ i +1 for each height i = 0 , . . . , k −
2. Figure 35visualizes this observation on the lowest two heights.This can be inductively proven for each even integer i with 0 ≤ i ≤ h −
2. By Eq. (5), we know that45 a b a a root of T root of T baa aaa aaa bbb aaa aaatype M type 1 type 1 µ b i a h − i µ i i Figure 34: ET ( Y ) of the example string Y defined in Ex. 38 with k = 2 ( left ) and as a schematic illustration( right ) with the meta-block µ i on height i (due to space issues the number of nodes/children is incorrect). h X i i i = b i a k − i − i and h X i +1 i i = b i a k − i − i . Then esp ( esp ( h X i i i h X i +1 i i )) = esp ( esp ( b i a k − i − i b i a k − i − i )) (Cor. 37) = esp ( esp ( b i a i a i ) esp ( a k − i − i ) b i +1 a k − i − − i +1 )= esp ( esp ( b i a i a i ) a k − i − − i +1 b i +1 a k − i − − i +1 ) (Cor. 37) = esp ( esp ( b i a i a i ) a k − i − − i +1 ) esp ( b i +1 a k − i − − i +1 )= esp ( esp ( b i a i a i ) a k − i − − i +1 ) esp ( b i +1 a i +1 a i +1 ) esp ( a k − i − − i +1 ) , and esp ( a k − i − − i +1 ) = a k − i − − i +2 . Adding a i (set a := a ) to the string h X i i i h X i +1 i i yields esp ( esp ( a i h X i i i h X i +1 i i )) = esp ( esp ( a i b i a k − i − i b i a k − i − i )) (Cor. 37) = esp ( esp ( a i b i ) esp ( a k − i − i ) b i +1 a k − i − − i +1 )= esp ( esp ( a i b i ) esp ( a k − i − i a i a i ) b i +1 a k − i − − i +1 )= esp ( esp ( a i b i ) a k − i − − i +1 a i +1 b i +1 a k − i − − i +1 ) (Cor. 37) = esp ( esp ( a i b i ) a k − i − − i +1 ) esp ( a i +1 b i +1 ) esp ( a k − i − − i +1 a i +1 a i +1 )= esp ( esp ( a i b i ) a k − i − − i +1 ) esp ( a i +1 b i +1 ) a k − i − − i +2 a i +2 , and a i +2 carries on to the nodes h X i +2 i i +2 h X i +3 i i +2 on height i + 2 due to Cor. 37.Overall, the leftmost and rightmost node on height i + 1 of T i changes, for i = 0 , . . . , k −
1. In total,Ω(lg k ) nodes are changed. 46 a a a b a a a a a a · · · a a a a a a · · · a baa aaa aaa aaa aaa bbb aaa aaa aaa aaa aaa aaa · · · aaa aaa aaa aaatype M type 1 type 1 belongs to T belongs to T a a a a b a a a a · · · a a a a a a a a · · · a a ab aaa aaa aaa aa bbb aaa aaa aaa aaa · · · aaa aaa aaa aaa aaa aaatype 2 type 1 type 1 type 1 Figure 35: Excerpt of the ESP trees ET ( Y ) ( top ) and ET ( Y ) ( bottom ), where Y = ba k − b a k − b a k − · · · and Y = a Y (defined in Ex. 38). Due to space issues we contracted T to ET ( ba ). Note that right of therightmost a (bottom figure, top right node) is the node b (not shown in the figure due to space issues),and both nodes are combined into a type 2type 2