On Stabbing Queries for Generalized Longest Repeat
aa r X i v : . [ c s . D S ] N ov On Stabbing Queries for Generalized Longest Repeat
Bojian XuDepartment of Computer Science, Eastern Washington University, WA99004, USA. [email protected]
Abstract —A longest repeat query on a string, motivated by itsapplications in many subfields including computational biology,asks for the longest repetitive substring(s) covering a particularstring position (point query). In this paper, we extend the longestrepeat query from point query to interval query , allowing thesearch for longest repeat(s) covering any position interval, andthus significantly improve the usability of the solution. Ourmethod for interval query takes a different approach using theinsight from a recent work on shortest unique substrings [1], asthe prior work’s approach for point query becomes infeasible inthe setting of interval query. Using the critical insight from [1],we propose an indexing structure, which can be constructed inthe optimal O ( n ) time and space for a string of size n , suchthat any future interval query can be answered in O (1) time.Further, our solution can find all longest repeats covering anygiven interval using optimal O ( occ ) time, where occ is the numberof longest repeats covering that given interval, whereas the prior O ( n ) -time and space work can find only one candidate for eachpoint query. Experiments with real-world biological data showthat our proposal is competitive with prior works, both time andspace wise, while providing with the new functionality of intervalqueries as opposed to point queries provided by prior works. Keywords — string, repeats, longest repeats, stabbing query I. I
NTRODUCTION
Repetitive structures and regularity finding in genomes andproteins is important as these structures play important rolesin the biological functions of genomes and proteins [2]. Oneof the well-known features of DNA is its repetitive structure,especially in the genomes of eukaryotes. Examples are thatoverall about one-third of the whole human genome consistsof repeated substrings [3]; about 10–25% of all known proteinshave some form of repetitive structures [4]. In addition, anumber of significant problems in molecular string analysis canbe reduced to repeat finding [5]. Therefore, it is of great interestfor biologists to find such repeats in order to understand theirbiological functions and solve other problems.There has been an extensive body of work on repeatfinding in the communities of bioinformatics and stringology.The notion of maximal repeat and super maximal repeat [2],[6], [7], [8] captures all the repeats of the whole string ina space-efficient manner. Maximal repeat finding over multi-ple strings and its duality with minimum unique substringswere also understood [9], [10], [11]. We refer readers to [2](Section 7.11) for the discussion and further pointers to othertypes of repetitive structures, such as palindrome and tandemrepeat. However, all these notions of repeats do not trackthe locality of each repeat, and thus it is difficult for them
A preliminary version of this work appeared as a regular paper in the Pro-ceedings of IEEE International Conference on Bioinformatics and Biomedicine(BIBM), November 9–12, 2015, Washington D.C., USA. to support position-specific queries (stabbing queries) in anefficient manner.Because of this reason, longest repeat query was recentlyproposed and asks for the longest repetitive substring(s) thatcovers a particular string position [12], [13], [14]. Because anysubstring of a repetitive substring is also repetitive, longestrepeat query effectively provides a “stabbing” tool for findingmost of the repeats that cover any particular string position.The algorithm by Schnattinger et al. [13] for computing bidi-rectional matching statistics can be used to compute the right-most longest repeat covering every string position, whereas thestudy by ˙Ileri et al. [12] can find the leftmost longest repeat forevery string position. Both solutions use optimal O ( n ) timeand space for finding the longest repeat for all the n stringpositions. By storing the pre-computed longest repeats of everyposition, they are able to answer any future longest repeatquery in O (1) time, and thus achieve the amortized O (1) time cost in finding the longest repeat of any arbitrary stringposition. Since it is not clear how to parallelize the optimalalgorithms in [12], [13], the recent study in [14] proposed atime sub-optimal but parallelizable algorithm, so as to takeadvantage of the modern multi-processor computing platformssuch as the general-purpose graphics processing units.II. P ROBLEM S TATEMENT
We consider a string S [1 ..n ] , where each character S [ i ] isdrawn from an alphabet Σ = { , , . . . , σ } . A substring S [ i..j ] of S represents S [ i ] S [ i +1] . . . S [ j ] if ≤ i ≤ j ≤ n , and is anempty string if i > j . String S [ i ′ ..j ′ ] is a proper substring ofanother string S [ i..j ] if i ≤ i ′ ≤ j ′ ≤ j and j ′ − i ′ < j − i . The length of a non-empty substring S [ i..j ] , denoted as | S [ i..j ] | ,is j − i + 1 . We define the length of an empty string as zero.A prefix of S is a substring S [1 ..i ] for some i , ≤ i ≤ n . A proper prefix S [1 ..i ] is a prefix of S where i < n . A suffix of S is a substring S [ i..n ] for some i , ≤ i ≤ n . A propersuffix S [ i..n ] is a suffix of S where i > . We say character S [ i ] occupies the string position i . We say substring S [ i..j ] covers the position interval [ x..y ] of S , if i ≤ x ≤ y ≤ j . Inthe case x = y , we say substring S [ i..j ] covers the position x (or y ) of string S . For two strings A and B , we write A = B (and say A is equal to B ), if | A | = | B | and A [ i ] = B [ i ] for i = 1 , , . . . , | A | . We say A is lexicographically smaller than B , denoted as A < B , if (1) A is a proper prefix of B , or(2) A [1] < B [1] , or (3) there exists an integer k > suchthat A [ i ] = B [ i ] for all ≤ i ≤ k − but A [ k ] < B [ k ] . Asubstring S [ i..j ] of S is unique , if there does not exist anothersubstring S [ i ′ ..j ′ ] of S , such that S [ i..j ] = S [ i ′ ..j ′ ] but i = i ′ .A character S [ i ] is a singleton , if it is unique. A substring isa repeat if it is not unique. Definition 1. A longest repeat (LR) covering string positionnterval [ x..y ] , denoted as LR yx , is a repeat substring S [ i..j ] ,such that: (1) i ≤ x ≤ y ≤ j , and (2) there does not existanother repeat substring S [ i ′ ..j ′ ] , such that i ′ ≤ x ≤ y ≤ j ′ and j ′ − i ′ > j − i . Obviously, for any string position interval [ x..y ] , if S [ x..y ] is not unique, LR yx must exist, because at least S [ x..y ] itselfis a repeat. Further, there might be multiple choices for LR yx .For example, if S = abcabcddbca , then LR can be either S [1 ..
3] = abc or S [2 ..
4] = bca . Problem (generalized stabbing LR query). Given a stringposition interval [ x..y ] , ≤ x ≤ y ≤ n , find all choicesof LR yx or the fact that it does not exist.We call the generalized stabbing LR query as intervalquery , which includes the point query as a special case where x = y . All prior works [13], [12], [14] only studied pointquery. Our goal is to find an efficient mechanism for findingthe longest repeats of every possible string position interval.
III. P
RIOR W ORK AND O UR CONTRIBUTION
In addition to the related work discussed in Section I, therewere recently a sequence of work on finding shortest uniquesubstrings (SUS) [15], [16], [17], [18], [1], of which Hu etal. [1] studied the generalized version of SUS finding:
Givena string position interval [ x..y ] , ≤ x ≤ y ≤ n , find SUS yx ,the shortest unique substring that covers the string positioninterval [ x..y ] , or the fact that such SUS yx does not exist. To the best of our knowledge, no efficient reduction fromLR finding to SUS finding is known as of now. That is, givena set of SUSes covering a set of position intervals respectively,it is not clear how to find the set of LRs that cover that sameset of position intervals respectively, by only using the string S , the given set of SUSes, and linear (of the set size) time costfor the reduction. The reason behind the hardness of obtainingsuch an efficient reduction is because simply chopping off oneending character of an SUS does not necessarily produce anLR.For example: suppose S = a .. aba .. a of n + 1 characters,where every character is a except the middle one is b . Clearly, SUS nn − = S [ n − , n + 1] = aab , whereas LR nn − = S [1 ..n ] .Given SUS nn − and S itself, it is not clear how to find LR nn − = S [1 ..n ] using O (1) time, without involving otherauxiliary data structures (otherwise, the reduction, which isstill unknown, can become so complex, making itself no betterthan a self-contained solution for finding LR, which is whatthis paper is presenting.).Due to the overall importance of repeat finding in bioin-formatics and the lack of efficient reduction from SUS findingto LR finding, it is our belief that providing and implementinga complete solution for generalized LR finding will be bene-ficial to the community. In summary, we make the followingcontributions.1. We generalize the longest repeat query from point queryto interval query , allowing the search for the longest repeat(s)covering any interval of string positions, and thus significantlyimprove the usability of the solution.2. Because there are at most n point queries for a string ofsize n , all prior works pre-compute and save the results of every possible point query, such that any future point querycan be answered in O (1) time. However, in the setting ofinterval queries, there are (cid:0) n (cid:1) + n = Θ( n ) distinct intervals.It becomes impossible, under the O ( n ) time and space budget,to achieve the amortized O (1) query response time, by pre-computing and storing the longest repeats covering each ofthe Θ( n ) intervals. Therefore, a different approach is needed.Our approach uses the insight from the work by HU etal. [1] that leads us to an indexing structure, which can beconstructed using optimal O ( n ) time and space, such that, byusing this indexing structure, any future interval query can stillbe answered in O (1) time. The O ( n ) time and space costs areoptimal because reading and saving the input string alreadyneeds O ( n ) time and space.3. Our work can find all longest repeats covering any giveninterval using optimal O ( occ ) time, where occ is the number ofthe longest repeats covering that interval. However, the workin [12] and [13] can only find the leftmost and the rightmostcandidate, respectively, and only support point queries. Thealgorithm in [14] can find all longest repeats covering a stringposition, but their parallelizable sequential algorithm is sub-optimal in the time cost ( O ( n ) , indeed) and only supportspoint queries as well.4. We provide a generic implementation of our solution withoutassuming the alphabet size, making the software useful forthe analysis of different types of strings. Experimental studywith real-world biological data shows that our proposal iscompetitive with prior works, both time and space wise, whilesupporting interval queries in the meantime.IV. P REPARATION
The suffix array SA [1 ..n ] of the string S is a permutationof { , , . . . , n } , such that for any i and j , ≤ i < j ≤ n , we have S [ SA [ i ] ..n ] < S [ SA [ j ] ..n ] . That is, SA [ i ] is thestart position of the i th suffix in the sorted order of all thesuffixes of S . The rank array Rank [1 ..n ] is the inverse of thesuffix array. That is, Rank [ i ] = j iff SA [ j ] = i . The longestcommon prefix (lcp) array LCP [1 ..n +1] is an array of n +1 integers, such that for i = 2 , , . . . , n , LCP [ i ] is the length ofthe lcp of the two suffixes S [ SA [ i − ..n ] and S [ SA [ i ] ..n ] .We set LCP [1] =
LCP [ n + 1] = 0 . The following tableshows the suffix array and the lcp array of an example string S = mississippi . i LCP [ i ] SA [ i ] suffixes i ippi issippi ississippi mississippi pi ppi sippi sissippi
10 1 6 ssippi
11 3 3 ssissippi
12 0 – – In literature, the lcp array is often defined as an array of n integers. Weinclude an extra zero at LCP [ n + 1] as a sentinel to simplify the descriptionof our upcoming algorithms. efinition 2. The left-bounded longest repeat (LLR) startingat position k , denoted as LLR k , is a repeat S [ k..j ] , such thateither j = n or S [ k..j + 1] is unique. Clearly, for any string position k , if S [ k ] is not a singleton, LLR k must exist, because at least S [ k ] itself is a repeat.Further, if LLR k does exist, it must have only one choice,because k is a fixed string position and the length of LLR k must be as long as possible.Lemma 1 shows that, by using the rank array and the lcparray of the string S , it is easy to calculate any LLR i if itexists or to detect the fact that it does not exist. Lemma 1 ([12]) . For i = 1 , , . . . , n : LLR i = (cid:26) S [ i..i + L i − , if L i > does not exist , if L i = 0 where L i = max { LCP [ Rank [ i ]] , LCP [ Rank [ i ] + 1] } . Observe that an LLR can be a substring (proper suffix,indeed) of another LLR. For example, suppose S = ababab ,then LLR = S [4 ..
6] = bab , which is a substring of
LLR = S [3 ..
6] = abab . Formally, the neighboring LLRshave the following relationship.
Lemma 2 ([14]) . | LLR i | ≤ | LLR i +1 | + 1 Definition 3.
We say an LLR is useless if it is a substring ofanother LLR; otherwise, it is useful . Lemma 3.
Any existing longest repeat LR yx , ≤ x ≤ y ≤ n ,must be a useful LLR.Proof: (1) We first prove LR yx must be an LLR. Assumethat LR yx = S [ i..j ] is not an LLR. Note that S [ i..j ] is a repeatstarting from position i . If S [ i..j ] is not an LLR, it means S [ i..j ] can be extended to some position j ′ > j , so that S [ i..j ′ ] is still a repeat and also covers the position interval [ x..y ] .That says, | S [ i..j ′ ] | > | S [ i..j ] | . However, the contradiction isthat S [ i..j ] is already the longest repeat covering the positioninterval [ x..y ] . (2) Further, LR yx must be a useful LLR, becauseif it is a useless LLR, it means there exists another LLR thatcovers the position interval [ x..y ] but is longer than LR yx ,which contradicts the fact that LR yx is the longest repeat thatcovers the interval [ x..y ] .V. LR FINDING FOR ONE INTERVAL
In this section, we propose an algorithm that takes as inputa string position interval and returns the LR(s) covering thatinterval. The algorithm spends O ( n ) time and space per querybut does not need any indexing data structure. We presentthis algorithm here in case the practitioners have only a smallnumber of interval queries of their interest and thus this light-weighted algorithm will suffice. We start with the finding ofthe leftmost LR covering the given interval and will give atrivial extension in the end for finding all LRs covering thegiven interval. Lemma 4.
For any i , j , x , and y , ≤ i < j ≤ x ≤ y ≤ n : If LLR j does not exist or exists but does not cover the interval [ x..y ] , LLR i does not exist or does not cover [ x..y ] Algorithm 1:
Find the leftmost LR yx covering a given stringposition interval [ x..y ] . Input : (1) Two integers x and y , ≤ x ≤ y ≤ n , representing a stringposition interval [ x..y ] .(2) The rank array and the lcp array of the string S . Output : The leftmost LR yx or the fact that LR yx does not exist. start ← − ; end ← − ; ; // start and end positions of LR yx for i = x down to do L ← max { LCP [ Rank [ i ]] , LCP [ Rank [ i ] + 1] } ; // | LLR i | if L = 0 or i + L − < y then break; // Early stop else if L ≥ end − start + 1 then // Pick the leftmost one start ← i ; end ← i + L − return LR yx ← ( start, length ) ; Proof:
We prove the lemma by contradiction. (1) Assumeit is possible that when
LLR j does not cover the interval [ x..y ] , LLR i can still cover [ x..y ] . Say, LLR i = S [ i..k ] forsome k ≥ y . It follows that S [ j..k ] is also a repeat and covers [ x..y ] , which is a contradiction, because LLR j , the longestrepeat starting from string location j , does not cover [ x..y ] . (2)Assume it is possible that when LLR j does not exist, LLR i can still cover [ x..y ] . Say, LLR i = S [ i..k ] for some k ≥ y . Itfollows that S [ j..k ] is also a repeat and covers [ x..y ] , which isa contradiction, because LLR j does not exist at all, i.e., S [ j ] is a singleton.By Lemma 3, we know any LR must be an LLR, so we canfind LR yx covering a given interval [ x..y ] by simply checkingeach LLR i , i ≤ x , and picking the longest one that coversthe interval [ x..y ] . Ties are resolved by picking the leftmostchoice. Because of Lemma 4, early stop is possible to makethe procedure faster in practice by checking every LLR i in thedecreasing order of the value of i = x, x − , . . . , : the searchwill stop whenever we see an LLR i that does not cover theinterval [ x..y ] or does not exist at all. Algorithm 1 shows thepseudocode, which returns ( start, end ) , representing the startand ending positions of LR yx , respectively. If LR yx does notexist, ( − , − is returned. Lemma 5.
Given the rank array and the lcp array of the string S , for any string position interval [ x..y ] , Algorithm 1 can find LR yx or the fact that it does not exist, using O ( x ) time and O ( n ) space. If there are multiple choices for LR yx , the leftmostone is returned.Proof: The algorithm clearly has no more than x iterationsand each iteration takes O (1) time, so it costs O ( x ) time. Thespace cost is primarily from the rank array and the lcp array,which altogether is O ( n ) , assuming each integer in these arrayscosts a constant number of memory words. If multiple LRscover position interval [ x..y ] , the leftmost LR will be returned,as is guaranteed by Line 5 of Algorithm 1. Theorem 1.
For any position interval [ x..y ] in the string S , wecan find LR yx or the fact that it does not exist using O ( n ) timeand space. If there are multiple choices for LR yx , the leftmostone is returned.Proof: The suffix array of S can be constructed byexisting algorithms using O ( n ) time and space (e.g., [19]).After the suffix array is constructed, the rank array can betrivially created using another O ( n ) time and space. We canthen use the suffix array and the rank array to construct thecp array using another O ( n ) time and space [20]. Given therank array and the lcp array, the time cost of Algorithm 1 is O ( x ) (Lemma 5). So altogether, we can find LR yx or the factthat it does not exists using O ( n ) time and space. If there aremultiple choices for LR yx , the leftmost choice will be returned,as is claimed in Lemma 5. Algorithm 2:
Find all LRs that cover a given stringposition interval [ x..y ] Input : (1) Two integers x and y , ≤ x ≤ y ≤ n , representing a stringposition interval [ x..y ] .(2) The rank array and the lcp array of the string S . Output : All LRs that cover the position interval [ x..y ] or the fact that no suchLR exists. /* Find the length of LR yx . */ length ← ; for i = x down to do L ← max { LCP [ Rank [ i ]] , LCP [ Rank [ i ] + 1] } ; // | LLR i | if L = 0 or i + L − < y then break; /* LLR i does not exist or does not cover [ x..y ] , so we can early stop. */ else if L > length then length ← L ; /* Find all LRs that cover position interval [ x..y ] . */ if length > then // LR yx does exist. for i = x down to do L ← max { LCP [ Rank [ i ]] , LCP [ Rank [ i ] + 1] } ; // | LLR i | if L = 0 or i + L − < y then break; // Early stop else if L = length then Print LR yx ← ( i, i + length − ; else Print LR yx ← ( − , − ; // LR yx does not exist. Extension: find all LRs covering a given position interval.
It is trivial to extend Algorithm 1 to find all the LRs coveringany given position interval [ x..y ] as follows. We can first use asimilar procedure as Algorithm 1 to calculate | LR yx | . If LR yx does exist, then we will start over the procedure again to re-check every LLR i , i ≤ x , and return every LLR whose lengthis equal to | LR yx | . Due to Lemma 4, the same early stop aswhat we have in Algorithm 1 can be used for practical speedup.Algorithm 2 shows the pseudocode of this procedure, whichclearly spends an extra O ( x ) time. Using Theorem 1, we have: Theorem 2.
For any position interval [ x..y ] in the string S ,we can find all choices of LR yx or the fact that LR yx does notexist, using O ( n ) time and space. VI. A
GEOMETRIC PERSPECTIVE OF THE USEFUL
LLR
SAND THE LR QUERIES
In this section, we present a geometric perspective of theuseful LLRs and the generalized LR queries. This perspectiveis sparked by the idea presented in [1], which serves as theintuition behind the algorithms in Sections VII and VIII thatshare the similar spirit of those for SUS finding in [1]. Westart with the following lemma that says the useful LLRs areeasy to compute.
Lemma 6.
Given the lcp and rank arrays of the string S , wecan compute its useful LLRs in O ( n ) time and space.Proof: By Lemma 2, we know if
LLR i − exists, the rightboundary of LLR i is on or after the right boundary of LLR i − ,for any i ≥ , so we can construct the array of useful LLRs inone pass as follows: we calculate each LLR i using Lemma 1, D BC A ◦ diagonal b bbbb Fig. 1. The 2d geometric perspective on the useful LLRs of string S = aaababaabaaabaaab and its several generalized LR queries. (A) The LLRc array saves all the useful LLRs in the strictly increasing order of their stringpositions: { (1 , , (5 , , (7 , , (10 , , (11 , } , where each useful LLRis a ( start, end ) tuple, representing the start and ending position of the LLR.By viewing the start and end positions as the x and y coordinates, all theuseful LLRs of the example string can be visualized as the dark dots in thefigure. (B) Queries for LR , LR , LR and LR are visualized by thered, blue, green, and black polylines, numbered A – D , respectively. (C) Alldark dots and polylines are on or above the ◦ diagonal. Algorithm 3:
The calculation of LLRc, the array of useful LLRs,saved in ascending order of their positions.
Input : The rank and lcp arrays of the string S . j ← ; prev ← ; for i = 1 . . . n do L ← max { LCP [ Rank [ i ]] , LCP [ Rank [ i ] + 1] } ; // | LLR i | if L > and L ≥ prev then // LLR i is useful. LLRc [ j ] ← ( i, i + L − ; j ← j + 1 prev ← L ; return LLRc ; for i = 1 , , . . . , n , and eliminate (useless) LLR i , if | LLR i | =0 or | LLR i | = | LLR i − | − . Definition 4.
LLRc is an array of useful LLRs, which aresaved in the ascending order of their start position. We use
LLRc .size to denote the number of elements in
LLRc . Algorithm 3 shows the procedure for the
LLRc arrayconstruction in O ( n ) time and space, provided with the suffixarray and lcp array of S . Each LLRc array element is a ( start, end ) tuple, representing the start and ending positionsof the useful LLR. Because no useful LLR is a substring ofanother useful LLR, we have the following fact. Fact 1.
All elements in the
LLRc array have their both startand ending positions in strictly increasing order. That is, forany i and j , ≤ i < j ≤ LLRc .size : LLRc [ i ] .start < LLRc [ j ] .start and LLRc [ i ] .end < LLRc [ j ] .end . If we view each useful LLR’s start position as the x coordinate and ending position as the y coordinate, each usefulLLR can be viewed as a dot in the 2d space. All the 2d dots,representing all the useful LLRs that are saved in the LLRcarray, are distributed in the 2d space from the low-left cornertoward the up-right corner. Because of Fact 1, no two dotsshare the same x or y coordinates. Further, since every dot’s y coordinate is no less than its x coordinate, those dots areon or above the ◦ diagonal. Figure 1 shows this geometric lgorithm 4: Find LR using 2d DMQ.
Input : The lcp and rank arrays of the string S Compute the LLRc array; // Algorithm 3 Build the 2d DMQ index for the LLRc array elements ; // Existingtechnique, e.g., [21]/* Find one choice of LR yx . */ QueryOne2d( x, y ): x, y ); // return ( − , , if S x,y = ∅ ./* Find all choices of LR yx . */ QueryAll2d( x, y ): ( x ′ , y ′ ) ← x, y ); if ( x ′ , y ′ ) = ( − , − then FindAll2d( x, y, y ′ − x ′ + 1 ) ; // Recursive searches start. FindAll2d( x, y, weight ): // Helper function ( x ′ , y ′ ) ← x, y ); if ( x ′ , y ′ ) = ( − , − or ( y ′ − x ′ + 1 < weight ) then return ; // Recursion exits. Print ( x ′ , y ′ ) ; // One choice of LR yx is found. if x ′ − ≥ then FindAll2d( x ′ − , y, weight ) ; // New recursive search. if y ′ + 1 ≤ n then FindAll2d( x, y ′ + 1 , weight ) ; // New recursive search. perspective of several useful LLRs. Definition 5.
The weight of a dot ( x, y ) , representing a useful LLR x = S [ x..y ] , is | LLR x | = y − x + 1 , the length of LLR x . Definition 6. S x,y = { ( a, b ) ∈ LLRc | a ≤ x, b ≥ y } . If we draw in the 2d space a y shaped orthogonal polylinewhose angle locates at position ( x, y ) , S x,y is the set of 2ddots, representing those useful LLRs that are located on theup-left side (inclusive) of the polyline.Because any LR must be useful LLR (Lemma 3), from thisgeometric perspective, the answer to the LR yx query becomesthe heaviest dot(s), whose horizontal coordinate is ≤ x andwhose vertical coordinate is ≥ y . That is, LR yx are theheaviest dots in S x,y . If S x,y is empty, it means LR yx doesnot exist. Figure 1 shows this geometric perspective of severalgeneralized LR queries.VII. AN INDEX OF O ( occ · log n ) QUERY TIMEAs is explained in Section VI, LR yx is the heaviest dot(s)from the set S x,y , if S x,y is not empty; otherwise, LR yx doesnot exist. Finding one heaviest dot from S x,y is nothing butthe well-known 2d dominance max query.
2d dominance max query (DMQ).
Given a set of n dotsand any position ( x, y ) in the 2d space, find the heaviest dot,whose horizontal coordinate is ≤ x and vertical coordinate is ≥ y . If there are multiple choices, ties are resolved arbitrarily.There exist indexing structures (e.g., [21]) that can beconstructed on top of the n dots using O ( n log n ) time and O ( n ) space, such that by using the indexing structure, anyfuture 2d DMQ can be answered in O (log n ) time. Thereduction from finding an LR to a 2d DMQ immediately givesus the QueryOne2d function in Algorithm 4 for finding onechoice of an LR.
Theorem 3.
We can construct an indexing structure for astring S of size n using O ( n log n ) time and O ( n ) space, suchthat by using the indexing structure any future generalized LR query can be answered in O (log n ) time. If there exist multiplechoices for the LR of interest, ties are resolved arbitrarily.Proof: (1) The suffix array of S can be constructed byexisting algorithms using O ( n ) time and space (e.g., [19]).After the suffix array is constructed, the rank array can betrivially created using O ( n ) time and space. We can then usethe suffix array and the rank array to construct the lcp arrayusing another O ( n ) time and space [20]. (2) Given the rankarray and the lcp array, we can construct the LLRc arrayof useful LLRs using O ( n ) time and space (Lemma 6 andAlgorithm 3). (3) We then create the indexing structure for theLLRc array elements for 2d DMQ, using O ( n log n ) time and O ( n ) space (e.g., [21]). By using this index, we can answerany future generalized LR query in O (log n ) time and ties areresolved arbitrarily. A. Find all choices of any LR.
We know LR yx are the heaviest dots in S x,y , if S x,y is notempty; otherwise, LR yx does not exist. Upon receiving a queryfor LR yx , we first perform a 2d DMQ, which returns one of theheaviest dots in S x,y . If no such a dot is returned, then LR yx does not exist. Otherwise, suppose ( x ′ , y ′ ) is the dot returned,then ( x ′ , y ′ ) is one of the choices for LR yx .Because all the dots representing the LLRc array elementshave their both x and y coordinates strictly increase (Fact 1),all other choices (if existing) of LR yx must be existing in theunion of S x ′ − ,y and S x,y ′ +1 . Therefore, we can find otherchoices of LR yx by the following two recursive searches: onewill find one of the heaviest dots in S x ′ − ,y , the other willfind one of the heaviest dots in S x,y ′ +1 . Each of these tworecursive searches is again a 2d DMQ.For each recursive search: (1) If the weight of the heaviestdot it finds is equal to y ′ − x ′ + 1 , the length of LR yx , it willreturn the found dot as another choice of LR yx and will thenlaunch its own two new recursive searches, similar to whatits caller has done in order to find other choices for LR yx ; (2)otherwise, it stops and returns to its caller.Function QueryAll2d in Algorithm 4 shows the pseu-docode for finding all choices of LR yx . Example 1 (Figure 1) . Search A is for LR . That is to findall heaviest dots in S , , which include dot (7 , and dot (11 , . Suppose the 2d DMQ launched by search A returnsdot (7 , , which has a weight of and is one choice for LR . The next two recursive searches launched by search Awill be search B looking for one of the heaviest dots in S , and search C looking for one of the heaviest dots in S , .Search B will return the heaviest dot (11 , from S , ,whose weight is equal to , so the dot (11 , is anotherchoice of LR . Search B will then launch its own two newrecursive searches for one heaviest dot in each of S , and S , . (These two searches are not shown in Figure 1 forconcision). The search in S , returns dot (10 , whoseweight is less than , so the search stops and returns to itscaller. The search in S , finds nothing, so it stops andreturns to its caller. After all its recursive searches return,search B returns to its caller, which is search A .earch C finds nothing in S , , so it stops and returnsto its caller, which is search A .At this point, all the work of search A is finished, and wehave found all the choices, which are S [7 .. and S [11 .. (or LLRc [3] and
LLRc [5] , equivalently), for LR . Clearly, the same 2d DMQ index is used in finding allchoices of an LR query, and there are no more than · occ + 1 instances of 2d DMQ, in the finding of all choices of an LR,where occ is the number of choices of the LR. Because each2d DMQ takes O (log n ) time, we get the following theorem. Theorem 4.
We can construct an indexing structure for astring S of size n using O ( n log n ) time and O ( n ) space, suchthat by using the indexing structure, we can find all choicesof any LR in O ( occ · log n ) time, where occ is the number ofchoices of the LR being queried for. VIII. AN INDEX OF O ( occ ) QUERY TIMEIn this section, we present the optimal indexing structurefor generalized LR finding. It is again based on the intuitionderived from the geometric perspective on the relationshipbetween useful LLRs and LR queries (Section VI).Recall that the answer for an LR yx query is the heaviestdot(s) from S x,y , if S x,y is not empty. Due to Fact 1, S x,y corresponds to a continuous chunk of the LLRc array, if S x,y is not empty. Therefore, searching for one heaviest dot in S x,y becomes searching for one heaviest element within acontinuous chunk of the LLRc array, which is nothing butthe range minimum query on the array LLRc. Range minimum query (RMQ).
Given an array A [1 ..n ] of n comparable elements, find the index of the smallest elementwithin A [ i..j ] , for any given i and j , ≤ i ≤ j ≤ n . If thereare multiple choices, ties are resolved arbitrarily.There exist indexing structures (e.g., [22], [23]) that can beconstructed on top of the array A using O ( n ) time and space,such that any future RMQ can be answered in O (1) time.The next issue is: Upon receiving a query for LR yx , forsome x and y , ≤ x ≤ y ≤ n , how to find the left and rightboundaries of the continuous chunk of LLRc, over which wewill perform an RMQ ? Due to Fact 1 and with the aid ofthe geometric perspective of the useful LLRs, we can observethat the left boundary of the chunk only depends on the valueof y , whereas the right boundary of the chunk only dependson the value of x . Intuitively, if one sweeps a horizontal linestarting from position y (inclusive) toward the up direction,the LLRc array index of the first dot that the line hits is theleft boundary of the RMQ’s range. Similarly, if one sweeps avertical line starting from position x (inclusive) toward the leftdirection, the LLRc array index of the first dot that the linehits is the right boundary of the RMQ’s range. The range forRMQ is invalid, if any one of the following three possibilitieshappens: 1) No dot is hit by the horizontal line; 2) No dot ishit by the vertical line; 3) The index of the left boundary ofthe range is larger than the index of the right boundary of the We should actually perform range maximum query , which however can betrivially reduced to RMQ by viewing each array element as the negative ofits actual value.
Algorithm 5:
Compute L i and R i for i = 1 , , . . . , n . Input : The
LLRc array.
Output : The L and R arrays. for i = 1 . . . n do L i ← − ; R i ← − ; // Initialization. i ← ; for y = 1 . . . n do if y ≤ LLRc [ i ] .end then L y ← i ; else if i < LLRc .size then i ← i + 1 ; L y ← i ; else break; i ← LLRc .size ; for x = n . . . do if x ≥ LLRc [ i ] .start then R x ← i ; else if i > then i ← i − ; R x ← i ; else break; range. An invalid RMQ range means that LR yx does not exist.See Figure 1 for examples.More precisely, given the values of x and y from the queryfor LR yx , the left boundary L y and the right boundary R x ofthe range for RMQ can be determined as follows: L y = (cid:26) min { i | LLRc [ i ] .end ≥ y } , if { i | LLRc [ i ] .end ≥ y } 6 = ∅− , otherwise R x = (cid:26) max { i | LLRc [ i ] .start ≤ x } , if { i | LLRc [ i ] .start ≤ x } 6 = ∅− , otherwise Further, we can pre-compute L y and R x , for every x =1 , , . . . , n and y = 1 , , . . . , n , and save the results for futurereferences. Algorithm 5 shows the procedure for computingthe L and R arrays, which clearly uses O ( n ) time and space. Lemma 7.
Algorithm 5 computes L , L , . . . , L n and R , R , . . . , R n using O ( n ) time and space. Now we are ready to present the algorithm for findingone choice of a generalized LR query. Algorithm 6 (throughLine 7) gives the pseudocode. After array LLRc is created,we will compute the L and R arrays using the LLRc array(Algorithm 5). Then we will create the RMQ structure for theLLRc array, where the weight of each array element is definedas the length of the corresponding LLR (or, from the geometricperspective, is the weight of the 2d dot representing that LLR),using existing techniques (e.g., [22], [23]). Upon receiving aquery for LR yx , function QueryOneRMQ(x,y) performs anRMQ over the range
LLRc [ L y , R x ] , if ≤ L y ≤ R x ≤ n ;otherwise, it returns ( − , − , meaning LR yx does not exist.The answer returned by the RMQ is one of the choices for LR yx . If there exist multiple choices for LR yx , ties are resolvedarbitrarily, depending on which heaviest element in the rangeis returned by the RMQ. Theorem 5.
We can construct an indexing structure for astring S of size n using O ( n ) time and space, such that anyfuture generalized LR query can be answered in O (1) time.If There exist multiple choices for the LR being queried for,ties are resolved arbitrarily.Proof: (1) The suffix array of S can be constructed byexisting algorithms using O ( n ) time and space (e.g., [19]).After the suffix array is constructed, the rank array can betrivially created using O ( n ) time and space. We can then usethe suffix array and the rank array to construct the lcp arrayusing another O ( n ) time and space [20]. (2) Given the rank lgorithm 6: Find LR using RMQ.
Input : The lcp and rank arrays of the string S . Compute the LLRc array; // Algorithm 3 Compute the L and R arrays from the LLRc array ; // Algo. 5 Construct the RMQ structure for the LLRc array; // [22], [23]/* Find one choice of LR yx . */ QueryOneRMQ( x, y ): if L y = − and R x = − and L y ≤ R x then return LLRc h RMQ (cid:0)
LLRc [ L y ..R x ] (cid:1)i ; else return ( − , − ; // LR yx does not exist./* Find all choices of LR yx . */ QueryAllRMQ( x, y ) if L y = − and R x = − and L y ≤ R x then m ← RMQ (cid:0)
LLRc [ L y ..R x ] (cid:1) ; weight ← LLRc [ m ] .end − LLRc [ m ] .start + 1 ; // | LR yx | FindAllRMQ( L y , R x , weight ); // Recursive searches start else return ( − , − ; // LR yx does not exist. FindAllRMQ( ℓ, r, weight ) // Helper function m ← RMQ (cid:0)
LLRc [ ℓ..r ] (cid:1) ; if LLRc [ m ] .end − LLRc [ m ] .start + 1 < weight then return ; // Recursion exits. Print
LLRc [ m ] ; // One choice of LR yx is found. if ℓ ≤ m − then FindAllRMQ( ℓ, m − , weight ) ; // New recursive search. if r ≥ m + 1 then FindAllRMQ( m + 1 , r, weight ) ; // New recursive search. array and the lcp array, we can construct the LLRc arrayof useful LLRs using O ( n ) time and space (Lemma 6 andAlgorithm 3). (3) Given the LLRc array, we can compute the L and R arrays using another O ( n ) time and space (Lemma 7and Algorithm 5). (4) We then create the RMQ structure forthe LLRc array using another O ( n ) time and space, usingexisting techniques (e.g., [22], [23]). So, the total time andspace cost for building the indexing structure is O ( n ) . Byusing this RMQ indexing structure and the pre-computed L and R arrays, we can answer any future generalized LR queryin O (1) time (The QueryOneRMQ function in Algorithm 6). Ifthere exist multiple choices for the LR being searched for, tiesare resolved arbitrarily, as is determined by the RMQ structure.
A. Find all choices of any LR.
Upon receiving a query for LR yx , we first perform an RMQover range LLRc [ L y ..R x ] if such range exists; otherwise, itmeans LR yx does not exist, and we stop. Suppose the range LLRc [ L y ..R x ] is valid and its RMQ returns m , the arrayindex of the heaviest element in the range, then LLRc [ m ] isone of the choices for LR yx and | LR yx | = LLRc [ m ] .end − LLRc [ m ] .start + 1 . If LR yx has other choices, those choicesmust be existing in the union of the ranges LLRc [ L y ..m − and LLRc [ m + 1 ..R x ] . We can find those choices of LR yx by recursively performing an RMQ on each of those tworanges. The recursion will exit, if the element returned byRMQ has a weight smaller than | LR yx | or the range for RMQis invalid. The QueryAllRMQ function in Algorithm 6 showsthe pseudocode of this procedure for finding all choices of anLR query.
Example 2 (Figure 1) . Given the LLRc array of the examplestring in Figure 1, Algorithm 5 computes the L and R arrays. i L i R i Upon receiving the query LR , we first use the L and R arrays to retrieve the range [ L , R ] = [3 , , which is avalid range for RMQ. Then we perform RM Q (cid:0)
LLRc [3 .. (cid:1) of Search A and either or can be returned, becauseboth LLRc [3] and
LLRc [5] are the heaviest elements inthe range
LLRc [3 .. . Suppose is returned and is savedin m , then we get LLRc [3] as one choice for LR and | LR | = | LLRc [3] | = 7 .Then, we will find other choices for LR by performinga recursive search on each of the ranges [ L , m −
1] = [3 , and [ m + 1 , R ] = [4 , . The first range is invalid, sothe search exits (meaning Search C in Figure 1 will notbe performed). The search on the second range [4 , (cor-responding to Search B in Figure 1), which is valid, willlaunch RM Q (cid:0)
LLRc [4 , (cid:1) . The RMQ will return . Since | LLRc [5] | = | LR yx | = 7 , LLRc [5] is another choice for LR yx .Then, the search on the range [4 , will launch its owntwo recursive searches on the ranges [4 , −
1] = [4 , and [5 + 1 ,
5] = [6 , . The search on the first range will find theheaviest element’s weight is less than | LR yx | , so the searchstops. Because the second range is invalid, the recursive searchon that range will stop immediately.At this point, all choices for LR , which are LLRc [3] and
LLRc [5] , have been found.
Clearly, the same indexing structure is used by all RMQ’sin the search for all choices of LR yx . Further, there are no morethan · occ + 1 RMQ’s in the finding of all choices of one LR,where occ is the number of choices of the LR. Because eachRMQ takes O (1) time, we get the following theorem. Theorem 6.
We can construct an indexing structure for astring S of size n using O ( n ) time and space, such that byusing the indexing structure, we can find all choices of anygeneralized LR in O ( occ ) time, where occ is the number ofchoices of the LR being queried for. IX. I
MPLEMENTATION AND E XPERIMENTS
We implement our proposals in
C++ , using the librarybinary of the implementation of the DMQ and RMQ structuresfrom [1]. Our implementation is generic in that it does notassume the alphabet size of the underlying string, and thussupports LR queries over different types of strings.We compare the performance of our proposals with theprior works including the optimal O ( n ) time and space so-lution from [12] and the suboptimal sequential algorithm pre-sented in [14]. Note that all prior works can only answer pointqueries. All programs involved in the experiments use the same libdivsufsort library for the suffix array construction,and are compiled by gcc 4.7.2 with -O3 option.We conduct our experiments on a GNU/Linux machinewith kernel version 3.2.51-1. The computer is equipped withan Intel Xeon 2.40GHz E5-2609 CPU with 10MB SmartCache and has 16GB RAM. All experiments are conducted on https://code.google.com/p/libdivsufsort . P ea k M e m o r y U s age i n M B s Sequence Size in MBsDNA[12][14]RMQ-basedDMQ-based 0 2000 4000 6000 8000 10000 12000 10 20 30 40 50 60 70 80 90 100 P ea k M e m o r y U s age i n M B s Sequence Size in MBsProtein[12][14]RMQ-basedDMQ-based
Fig. 2. Peak memory usage of different proposals for DNA and Protein strings of different sizes T i m e C o s t i n S e c ond s Sequence Size in MBsDNASA,Rank,LCPRMQDMQ 0 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90 100 T i m e C o s t i n S e c ond s Sequence Size in MBsProteinSA,Rank,LCPRMQDMQ
Fig. 3. Indexing structure construction time for DNA and Protein strings of different sizes real-world datasets including the
DNA and
Protein strings,downloaded from the Pizza&Chili Corpus . The datasets weuse are the two MB DNA and
Protein pure ASCII textfiles, each of which thus represents a string of × × , , characters. Any other shorter stringsinvolved in our experiments are prefixes of certain lengths,cut from the MB strings.
A. Space
Here, we measure the peak memory usage of differentproposals, using the Linux command /usr/bin/time -f “ %M ” that captures the maximum resident set size of a processduring its lifetime. We do not save the output in the RAMin order to focus on the comparison of the memory usage ofthe algorithmics. It is also because practitioners often flush theoutputs directly to disk files for future reuse.Figure 2 shows the peak memory usage of different pro-posals that process DNA and protein strings of different sizes.It is worth noting that, by design, the memory usage of eachproposal is independent from the query type, such as finding http://pizzachili.dcc.uchile.cl/texts.html one choice vs. all choices of an LR, point query vs. intervalquery. We have the following main observations:– All proposals show the linearity of their space usage overstring size.– Our DMQ-based proposal uses much more memory spacethan other proposals. It is mainly caused by the high spacedemand from the DMQ structure.– Our RMQ-based proposal uses nearly the same amountof memory space as that of prior works, while significantlyimproving the usability of the technique by providing thefunctionality of interval queries. B. Time
Figure 3 shows the construction time of the indexingstructures used by different proposals. Note that all proposalsneed to construct the suffix array, rank array, and the lcparray of the given string, and our proposals further use theseauxiliary arrays to construct the DMQ and RMQ structures forinterval queries. The following are the main observations:– The construction of the DMQ structure takes much moretime than that of the auxiliary arrays and the RMQ structure. T i m e C o s t i n S e c ond s Sequence Size in MBsDNA; Find one choice for each LR[12][14]RMQ, interval size = 1RMQ, interval size = 5RMQ, interval size = 10RMQ, interval size = 15RMQ, interval size = 20 0 1 2 3 4 5 6 7 10 20 30 40 50 60 70 80 90 100 T i m e C o s t i n S e c ond s Sequence Size in MBsDNA; Find all choices for each LR[14]RMQ, interval size = 1RMQ, interval size = 5RMQ, interval size = 10RMQ, interval size = 15RMQ, interval size = 20 0 1 2 3 4 5 6 7 8 10 20 30 40 50 60 70 80 90 100 T i m e C o s t i n S e c ond s Sequence Size in MBsProtein; Find one choice for each LR[12][14]RMQ, interval size = 1RMQ, interval size = 5RMQ, interval size = 10RMQ, interval size = 15RMQ, interval size = 20 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 10 20 30 40 50 60 70 80 90 100 T i m e C o s t i n S e c ond s Sequence Size in MBsProtein; Find all choices for each LR[14]RMQ, interval size = 1RMQ, interval size = 5RMQ, interval size = 10RMQ, interval size = 15RMQ, interval size = 20
Fig. 4. Query time of different proposals for DNA and Proteins strings of different sizes – Both the auxiliary array and RMQ structure clearly show thelinearity in the their construction time over string size.– The construction of the RMQ structure takes less time thanthe construction of the auxiliary arrays, making our RMQ-based proposal practical while supporting interval queries.Figure 4 shows the time cost of various types of query.Our DMQ-based proposal is so slow in query response thatwe do not include it in the figure. For point queries, we plotthe total time cost for all the point queries over all n stringpositions, where n is the string size. For interval queries withinterval size δ , we plot the total time cost for all the intervalqueries over all n − δ + 1 intervals of the string. Note thatonly point queries are involved in the experiments with theproposals from [12] and [14], because they do not supportinterval queries. The two figures on the left show the casefor finding only one choice for each LR, whereas the two onthe right show the case for finding all choices for each LR.Because the proposal from [12] does not support the findingof all choices, it is not included in the two figures on the rightside. The following are the main observations:– All proposals show the clear linearity of the total query timecost, meaning the amortized O (1) time cost for each query.– In the setting of finding one choice for each LR (the twofigures on the left of Figure 4), our RMQ-based proposal is the fastest regarding the per-query response time, includingboth point query and interval query! Further, our RMQ-basedproposal’s interval query response becomes even faster, wheninterval size increases. That is because a longer interval iscovered by fewer number of repeats, reducing the search spacesize for finding the LR covering the interval.– In the setting of finding all choices for each LR (the twofigures on the right side of Figure 4): • For point query, our RMQ-based proposal is a littleslower than [14] due to the following reason. Onaverage, an LR point query returns more choices thanan interval query. Our technique needs to make aquery to the index for finding every single choice,whereas the technique in [14] only needs one extra“walk” for finding all choices for a particular LR pointquery. Even though our technique is faster than [14]for finding one choice (the two figures on the left),when a particular point query has many choice, ourtechnique can become slower in finding all choices. • As interval size increases, our RMQ-based proposalbecomes faster, because a longer interval on averagehas fewer choices for its LR, making our techniquehave fewer queries to its index. Our technique’s in-terval query can be even faster than the point queryy [14] in finding all choices when interval sizeincreases. For example, it is true, when interval sizebecomes ≥ for DNA string (top-right figure) and ≥ for protein string (bottom-right figure).X. C ONCLUSION
We generalized the longest repeat query on a string frompoint query to interval query and proposed both time and spaceoptimal solution for interval queries. Our approach is differentfrom prior work which can only handle point queries. Usingthe insight from [1], we proposed an indexing structure thatcan be built on top of the string using time and space linearof the string size, such that any future interval queries canbe answered in O (1) time. We implemented our proposalswithout assuming the alphabet size of the string, making ituseful for different types of strings. An interesting future workis to parallelize our proposal so as to take advantage of themodern multi-core and multi-processor computing platforms,such as the general-purpose graphics processing units.R EFERENCES[1] X. Hu, J. Pei, and Y. Tao, “Shortest unique queries on strings,” in
Proceedings of International Symposium on String Processing andInformation Retrieval (SPIRE) , 2014, pp. 161–172.[2] D. Gusfield,
Algorithms on strings, trees and sequences: computerscience and computational biology . Cambridge University Press, 1997.[3] E. H. McConkey,
Human Genetics: The Molecular Revolution . Boston,MA: Jones and Bartlett, 1993.[4] X. Liu and L. Wang, “Finding the region of pseudo-periodic tandemrepeats in biological sequences,”
Algorithms for Molecular Biology ,vol. 1, no. 1, p. 2, 2006.[5] H. M. Martinez, “An efficient method for finding repeats in molecularsequences,”
Nucleic Acids Research , vol. 11, no. 13, pp. 4629–4634,1983.[6] V. Becher, A. Deymonnaz, and P. A. Heiber, “Efficient computation ofall perfect repeats in genomic sequences of up to half a gigabyte, witha case study on the human genome,”
Bioinformatics , vol. 25, no. 14,pp. 1746–1753, 2009.[7] M. O. Kulekci, J. S. Vitter, and B. Xu, “Efficient maximal repeatfinding using the burrows-wheeler transform and wavelet tree,”
IEEETransactions on Computational Biology and Bioinformatics (TCBB) ,vol. 9, no. 2, pp. 421–429, 2012.[8] T. Beller, K. Berger, and E. Ohlebusch, “Space-efficient computationof maximal and supermaximal repeats in genome sequences,” in
Pro-ceedings of the 19th International Conference on String Processing andInformation Retrieval (SPIRE) , 2012, pp. 99–110. [9] A. Bakalis, C. S. Iliopoulos, C. Makris, S. Sioutas, E. Theodoridis,A. K. Tsakalidis, and K. Tsichlas, “Locating maximal multirepeats inmultiple strings under various constraints,”
Computer Journal , vol. 50,no. 2, pp. 178–185, 2007.[10] C. S. Iliopoulos, W. F. Smyth, and M. Yusufu, “Faster algorithms forcomputing maximal multirepeats in multiple sequences,”
FundamentaInformaticae , vol. 97, no. 3, pp. 311–320, 2009.[11] L. Ilie and W. F. Smyth, “Minimum unique substrings and maximumrepeats,”
Fundamenta Informaticae , vol. 110, no. 1-4, pp. 183–195,2011.[12] A. M. ˙Ileri, M. O. K¨ulekci, and B. Xu, “On longest repeat queries,” http://arxiv.org/abs/1501.06259 .[13] T. Schnattinger, E. Ohlebusch, and S. Gog, “Bidirectional searchin a string with wavelet trees and bidirectional matching statistics,”
Information and Computation , vol. 213, pp. 13–22, Apr. 2012.[14] Y. Tian and B. Xu, “On longest repeat queries using GPU,” in
Proceedings of the 20th International Conference on Database Systemsfor Advanced Applications (DASFAA) , 2015, pp. 316–333.[15] J. Pei, W. C. H. Wu, and M. Y. Yeh, “On shortest unique substringqueries,” in
Proceedings of the 2013 IEEE International Conference onData Engineering (ICDE) , 2013, pp. 937–948.[16] K. Tsuruta, S. Inenaga, H. Bannai, and M. Takeda, “Shortest uniquesubstrings queries in optimal time,” in
Proceedings of InternationalConference on Current Trends in Theory and Practice of ComputerScience (SOFSEM) , 2014, pp. 503–513.[17] A. M. ˙Ileri, M. O. K¨ulekci, and B. Xu, “Shortest unique substringquery revisited,” in
Proceedings of the 25th Annual Symposium onCombinatorial Pattern Matching (CPM) , 2014, pp. 172–181.[18] A. M. ˙Ileri, M. O. K¨ulekci, and B. Xu, “A simple yet time-optimal andlinear-space algorithm for shortest unique substring queries,”
Theoreti-cal Computer Science , vol. 562, no. 0, pp. 621 – 633, 2015.[19] P. Ko and S. Aluru, “Space efficient linear time construction of suffixarrays,”
Journal of Discrete Algorithms , vol. 3, no. 2-4, pp. 143–156,2005.[20] T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park, “Linear-timelongest-common-prefix computation in suffix arrays and its applica-tions,” in
Symposium on Combinatorial Pattern Matching , 2001, pp.181–192.[21] C. Sheng and Y. Tao, “New results on two-dimensional orthogonalrange aggregation in external memory,” in
Proceedings of the ThirtiethACM SIGMOD-SIGACT-SIGART Symposium on Principles of DatabaseSystems (PODS) , 2011, pp. 129–139.[22] J. Fischer and V. Heun, “Theoretical and practical improvements on thermq-problem, with applications to lca and lce,” in
Proceedings of the17th Annual Conference on Combinatorial Pattern Matching (CPM) ,2006, pp. 36–48.[23] D. Harel and R. E. Tarjan, “Fast algorithms for finding nearest commonancestors,”