[PDF] On Stabbing Queries for Generalized Longest Repeat

Abstract

A longest repeat query on a string, motivated by its applications in many subfields including computational biology, asks for the longest repetitive substring(s) covering a particular string position (point query). In this paper, we extend the longest repeat query from point query to \emph{interval query}, allowing the search for longest repeat(s) covering any position interval, and thus significantly improve the usability of the solution. Our method for interval query takes a different approach using the insight from a recent work on \emph{shortest unique substrings} [1], as the prior work's approach for point query becomes infeasible in the setting of interval query. Using the critical insight from [1], we propose an indexing structure, which can be constructed in the optimal O(n) time and space for a string of size n , such that any future interval query can be answered in O(1) time. Further, our solution can find \emph{all} longest repeats covering any given interval using optimal O(occ) time, where occ is the number of longest repeats covering that given interval, whereas the prior O(n) -time and space work can find only one candidate for each point query. Experiments with real-world biological data show that our proposal is competitive with prior works, both time and space wise, while providing with the new functionality of interval queries as opposed to point queries provided by prior works.

Full PDF

aa r X i v : . [ c s . D S ] N ov On Stabbing Queries for Generalized Longest Repeat

Bojian XuDepartment of Computer Science, Eastern Washington University, WA99004, USA. [email protected]

Abstract —A longest repeat query on a string, motivated by itsapplications in many subﬁelds including computational biology,asks for the longest repetitive substring(s) covering a particularstring position (point query). In this paper, we extend the longestrepeat query from point query to interval query , allowing thesearch for longest repeat(s) covering any position interval, andthus signiﬁcantly improve the usability of the solution. Ourmethod for interval query takes a different approach using theinsight from a recent work on shortest unique substrings [1], asthe prior work’s approach for point query becomes infeasible inthe setting of interval query. Using the critical insight from [1],we propose an indexing structure, which can be constructed inthe optimal O ( n ) time and space for a string of size n , suchthat any future interval query can be answered in O (1) time.Further, our solution can ﬁnd all longest repeats covering anygiven interval using optimal O ( occ ) time, where occ is the numberof longest repeats covering that given interval, whereas the prior O ( n ) -time and space work can ﬁnd only one candidate for eachpoint query. Experiments with real-world biological data showthat our proposal is competitive with prior works, both time andspace wise, while providing with the new functionality of intervalqueries as opposed to point queries provided by prior works. Keywords — string, repeats, longest repeats, stabbing query I. I

NTRODUCTION

Repetitive structures and regularity ﬁnding in genomes andproteins is important as these structures play important rolesin the biological functions of genomes and proteins [2]. Oneof the well-known features of DNA is its repetitive structure,especially in the genomes of eukaryotes. Examples are thatoverall about one-third of the whole human genome consistsof repeated substrings [3]; about 10–25% of all known proteinshave some form of repetitive structures [4]. In addition, anumber of signiﬁcant problems in molecular string analysis canbe reduced to repeat ﬁnding [5]. Therefore, it is of great interestfor biologists to ﬁnd such repeats in order to understand theirbiological functions and solve other problems.There has been an extensive body of work on repeatﬁnding in the communities of bioinformatics and stringology.The notion of maximal repeat and super maximal repeat [2],[6], [7], [8] captures all the repeats of the whole string ina space-efﬁcient manner. Maximal repeat ﬁnding over multi-ple strings and its duality with minimum unique substringswere also understood [9], [10], [11]. We refer readers to [2](Section 7.11) for the discussion and further pointers to othertypes of repetitive structures, such as palindrome and tandemrepeat. However, all these notions of repeats do not trackthe locality of each repeat, and thus it is difﬁcult for them

A preliminary version of this work appeared as a regular paper in the Pro-ceedings of IEEE International Conference on Bioinformatics and Biomedicine(BIBM), November 9–12, 2015, Washington D.C., USA. to support position-speciﬁc queries (stabbing queries) in anefﬁcient manner.Because of this reason, longest repeat query was recentlyproposed and asks for the longest repetitive substring(s) thatcovers a particular string position [12], [13], [14]. Because anysubstring of a repetitive substring is also repetitive, longestrepeat query effectively provides a “stabbing” tool for ﬁndingmost of the repeats that cover any particular string position.The algorithm by Schnattinger et al. [13] for computing bidi-rectional matching statistics can be used to compute the right-most longest repeat covering every string position, whereas thestudy by ˙Ileri et al. [12] can ﬁnd the leftmost longest repeat forevery string position. Both solutions use optimal O ( n ) timeand space for ﬁnding the longest repeat for all the n stringpositions. By storing the pre-computed longest repeats of everyposition, they are able to answer any future longest repeatquery in O (1) time, and thus achieve the amortized O (1) time cost in ﬁnding the longest repeat of any arbitrary stringposition. Since it is not clear how to parallelize the optimalalgorithms in [12], [13], the recent study in [14] proposed atime sub-optimal but parallelizable algorithm, so as to takeadvantage of the modern multi-processor computing platformssuch as the general-purpose graphics processing units.II. P ROBLEM S TATEMENT

We consider a string S [1 ..n ] , where each character S [ i ] isdrawn from an alphabet Σ = { , , . . . , σ } . A substring S [ i..j ] of S represents S [ i ] S [ i +1] . . . S [ j ] if ≤ i ≤ j ≤ n , and is anempty string if i > j . String S [ i ′ ..j ′ ] is a proper substring ofanother string S [ i..j ] if i ≤ i ′ ≤ j ′ ≤ j and j ′ − i ′ < j − i . The length of a non-empty substring S [ i..j ] , denoted as | S [ i..j ] | ,is j − i + 1 . We deﬁne the length of an empty string as zero.A preﬁx of S is a substring S [1 ..i ] for some i , ≤ i ≤ n . A proper preﬁx S [1 ..i ] is a preﬁx of S where i < n . A sufﬁx of S is a substring S [ i..n ] for some i , ≤ i ≤ n . A propersufﬁx S [ i..n ] is a sufﬁx of S where i > . We say character S [ i ] occupies the string position i . We say substring S [ i..j ] covers the position interval [ x..y ] of S , if i ≤ x ≤ y ≤ j . Inthe case x = y , we say substring S [ i..j ] covers the position x (or y ) of string S . For two strings A and B , we write A = B (and say A is equal to B ), if | A | = | B | and A [ i ] = B [ i ] for i = 1 , , . . . , | A | . We say A is lexicographically smaller than B , denoted as A < B , if (1) A is a proper preﬁx of B , or(2) A [1] < B [1] , or (3) there exists an integer k > suchthat A [ i ] = B [ i ] for all ≤ i ≤ k − but A [ k ] < B [ k ] . Asubstring S [ i..j ] of S is unique , if there does not exist anothersubstring S [ i ′ ..j ′ ] of S , such that S [ i..j ] = S [ i ′ ..j ′ ] but i = i ′ .A character S [ i ] is a singleton , if it is unique. A substring isa repeat if it is not unique. Deﬁnition 1. A longest repeat (LR) covering string positionnterval [ x..y ] , denoted as LR yx , is a repeat substring S [ i..j ] ,such that: (1) i ≤ x ≤ y ≤ j , and (2) there does not existanother repeat substring S [ i ′ ..j ′ ] , such that i ′ ≤ x ≤ y ≤ j ′ and j ′ − i ′ > j − i . Obviously, for any string position interval [ x..y ] , if S [ x..y ] is not unique, LR yx must exist, because at least S [ x..y ] itselfis a repeat. Further, there might be multiple choices for LR yx .For example, if S = abcabcddbca , then LR can be either S [1 ..

3] = abc or S [2 ..

4] = bca . Problem (generalized stabbing LR query). Given a stringposition interval [ x..y ] , ≤ x ≤ y ≤ n , ﬁnd all choicesof LR yx or the fact that it does not exist.We call the generalized stabbing LR query as intervalquery , which includes the point query as a special case where x = y . All prior works [13], [12], [14] only studied pointquery. Our goal is to ﬁnd an efﬁcient mechanism for ﬁndingthe longest repeats of every possible string position interval.

III. P

RIOR W ORK AND O UR CONTRIBUTION

In addition to the related work discussed in Section I, therewere recently a sequence of work on ﬁnding shortest uniquesubstrings (SUS) [15], [16], [17], [18], [1], of which Hu etal. [1] studied the generalized version of SUS ﬁnding:

Givena string position interval [ x..y ] , ≤ x ≤ y ≤ n , ﬁnd SUS yx ,the shortest unique substring that covers the string positioninterval [ x..y ] , or the fact that such SUS yx does not exist. To the best of our knowledge, no efﬁcient reduction fromLR ﬁnding to SUS ﬁnding is known as of now. That is, givena set of SUSes covering a set of position intervals respectively,it is not clear how to ﬁnd the set of LRs that cover that sameset of position intervals respectively, by only using the string S , the given set of SUSes, and linear (of the set size) time costfor the reduction. The reason behind the hardness of obtainingsuch an efﬁcient reduction is because simply chopping off oneending character of an SUS does not necessarily produce anLR.For example: suppose S = a .. aba .. a of n + 1 characters,where every character is a except the middle one is b . Clearly, SUS nn − = S [ n − , n + 1] = aab , whereas LR nn − = S [1 ..n ] .Given SUS nn − and S itself, it is not clear how to ﬁnd LR nn − = S [1 ..n ] using O (1) time, without involving otherauxiliary data structures (otherwise, the reduction, which isstill unknown, can become so complex, making itself no betterthan a self-contained solution for ﬁnding LR, which is whatthis paper is presenting.).Due to the overall importance of repeat ﬁnding in bioin-formatics and the lack of efﬁcient reduction from SUS ﬁndingto LR ﬁnding, it is our belief that providing and implementinga complete solution for generalized LR ﬁnding will be bene-ﬁcial to the community. In summary, we make the followingcontributions.1. We generalize the longest repeat query from point queryto interval query , allowing the search for the longest repeat(s)covering any interval of string positions, and thus signiﬁcantlyimprove the usability of the solution.2. Because there are at most n point queries for a string ofsize n , all prior works pre-compute and save the results of every possible point query, such that any future point querycan be answered in O (1) time. However, in the setting ofinterval queries, there are (cid:0) n (cid:1) + n = Θ( n ) distinct intervals.It becomes impossible, under the O ( n ) time and space budget,to achieve the amortized O (1) query response time, by pre-computing and storing the longest repeats covering each ofthe Θ( n ) intervals. Therefore, a different approach is needed.Our approach uses the insight from the work by HU etal. [1] that leads us to an indexing structure, which can beconstructed using optimal O ( n ) time and space, such that, byusing this indexing structure, any future interval query can stillbe answered in O (1) time. The O ( n ) time and space costs areoptimal because reading and saving the input string alreadyneeds O ( n ) time and space.3. Our work can ﬁnd all longest repeats covering any giveninterval using optimal O ( occ ) time, where occ is the number ofthe longest repeats covering that interval. However, the workin [12] and [13] can only ﬁnd the leftmost and the rightmostcandidate, respectively, and only support point queries. Thealgorithm in [14] can ﬁnd all longest repeats covering a stringposition, but their parallelizable sequential algorithm is sub-optimal in the time cost ( O ( n ) , indeed) and only supportspoint queries as well.4. We provide a generic implementation of our solution withoutassuming the alphabet size, making the software useful forthe analysis of different types of strings. Experimental studywith real-world biological data shows that our proposal iscompetitive with prior works, both time and space wise, whilesupporting interval queries in the meantime.IV. P REPARATION

The sufﬁx array SA [1 ..n ] of the string S is a permutationof { , , . . . , n } , such that for any i and j , ≤ i < j ≤ n , we have S [ SA [ i ] ..n ] < S [ SA [ j ] ..n ] . That is, SA [ i ] is thestart position of the i th sufﬁx in the sorted order of all thesufﬁxes of S . The rank array Rank [1 ..n ] is the inverse of thesufﬁx array. That is, Rank [ i ] = j iff SA [ j ] = i . The longestcommon preﬁx (lcp) array LCP [1 ..n +1] is an array of n +1 integers, such that for i = 2 , , . . . , n , LCP [ i ] is the length ofthe lcp of the two sufﬁxes S [ SA [ i − ..n ] and S [ SA [ i ] ..n ] .We set LCP [1] =

LCP [ n + 1] = 0 . The following tableshows the sufﬁx array and the lcp array of an example string S = mississippi . i LCP [ i ] SA [ i ] sufﬁxes i ippi issippi ississippi mississippi pi ppi sippi sissippi

10 1 6 ssippi

11 3 3 ssissippi

12 0 – – In literature, the lcp array is often deﬁned as an array of n integers. Weinclude an extra zero at LCP [ n + 1] as a sentinel to simplify the descriptionof our upcoming algorithms. eﬁnition 2. The left-bounded longest repeat (LLR) startingat position k , denoted as LLR k , is a repeat S [ k..j ] , such thateither j = n or S [ k..j + 1] is unique. Clearly, for any string position k , if S [ k ] is not a singleton, LLR k must exist, because at least S [ k ] itself is a repeat.Further, if LLR k does exist, it must have only one choice,because k is a ﬁxed string position and the length of LLR k must be as long as possible.Lemma 1 shows that, by using the rank array and the lcparray of the string S , it is easy to calculate any LLR i if itexists or to detect the fact that it does not exist. Lemma 1 ([12]) . For i = 1 , , . . . , n : LLR i = (cid:26) S [ i..i + L i − , if L i > does not exist , if L i = 0 where L i = max { LCP [ Rank [ i ]] , LCP [ Rank [ i ] + 1] } . Observe that an LLR can be a substring (proper sufﬁx,indeed) of another LLR. For example, suppose S = ababab ,then LLR = S [4 ..

6] = bab , which is a substring of

LLR = S [3 ..

6] = abab . Formally, the neighboring LLRshave the following relationship.

Lemma 2 ([14]) . | LLR i | ≤ | LLR i +1 | + 1 Deﬁnition 3.

We say an LLR is useless if it is a substring ofanother LLR; otherwise, it is useful . Lemma 3.

Any existing longest repeat LR yx , ≤ x ≤ y ≤ n ,must be a useful LLR.Proof: (1) We ﬁrst prove LR yx must be an LLR. Assumethat LR yx = S [ i..j ] is not an LLR. Note that S [ i..j ] is a repeatstarting from position i . If S [ i..j ] is not an LLR, it means S [ i..j ] can be extended to some position j ′ > j , so that S [ i..j ′ ] is still a repeat and also covers the position interval [ x..y ] .That says, | S [ i..j ′ ] | > | S [ i..j ] | . However, the contradiction isthat S [ i..j ] is already the longest repeat covering the positioninterval [ x..y ] . (2) Further, LR yx must be a useful LLR, becauseif it is a useless LLR, it means there exists another LLR thatcovers the position interval [ x..y ] but is longer than LR yx ,which contradicts the fact that LR yx is the longest repeat thatcovers the interval [ x..y ] .V. LR FINDING FOR ONE INTERVAL

In this section, we propose an algorithm that takes as inputa string position interval and returns the LR(s) covering thatinterval. The algorithm spends O ( n ) time and space per querybut does not need any indexing data structure. We presentthis algorithm here in case the practitioners have only a smallnumber of interval queries of their interest and thus this light-weighted algorithm will sufﬁce. We start with the ﬁnding ofthe leftmost LR covering the given interval and will give atrivial extension in the end for ﬁnding all LRs covering thegiven interval. Lemma 4.

For any i , j , x , and y , ≤ i < j ≤ x ≤ y ≤ n : If LLR j does not exist or exists but does not cover the interval [ x..y ] , LLR i does not exist or does not cover [ x..y ] Algorithm 1:

Find the leftmost LR yx covering a given stringposition interval [ x..y ] . Input : (1) Two integers x and y , ≤ x ≤ y ≤ n , representing a stringposition interval [ x..y ] .(2) The rank array and the lcp array of the string S . Output : The leftmost LR yx or the fact that LR yx does not exist. start ← − ; end ← − ; ; // start and end positions of LR yx for i = x down to do L ← max { LCP [ Rank [ i ]] , LCP [ Rank [ i ] + 1] } ; // | LLR i | if L = 0 or i + L − < y then break; // Early stop else if L ≥ end − start + 1 then // Pick the leftmost one start ← i ; end ← i + L − return LR yx ← ( start, length ) ; Proof:

We prove the lemma by contradiction. (1) Assumeit is possible that when

LLR j does not cover the interval [ x..y ] , LLR i can still cover [ x..y ] . Say, LLR i = S [ i..k ] forsome k ≥ y . It follows that S [ j..k ] is also a repeat and covers [ x..y ] , which is a contradiction, because LLR j , the longestrepeat starting from string location j , does not cover [ x..y ] . (2)Assume it is possible that when LLR j does not exist, LLR i can still cover [ x..y ] . Say, LLR i = S [ i..k ] for some k ≥ y . Itfollows that S [ j..k ] is also a repeat and covers [ x..y ] , which isa contradiction, because LLR j does not exist at all, i.e., S [ j ] is a singleton.By Lemma 3, we know any LR must be an LLR, so we canﬁnd LR yx covering a given interval [ x..y ] by simply checkingeach LLR i , i ≤ x , and picking the longest one that coversthe interval [ x..y ] . Ties are resolved by picking the leftmostchoice. Because of Lemma 4, early stop is possible to makethe procedure faster in practice by checking every LLR i in thedecreasing order of the value of i = x, x − , . . . , : the searchwill stop whenever we see an LLR i that does not cover theinterval [ x..y ] or does not exist at all. Algorithm 1 shows thepseudocode, which returns ( start, end ) , representing the startand ending positions of LR yx , respectively. If LR yx does notexist, ( − , − is returned. Lemma 5.

Given the rank array and the lcp array of the string S , for any string position interval [ x..y ] , Algorithm 1 can ﬁnd LR yx or the fact that it does not exist, using O ( x ) time and O ( n ) space. If there are multiple choices for LR yx , the leftmostone is returned.Proof: The algorithm clearly has no more than x iterationsand each iteration takes O (1) time, so it costs O ( x ) time. Thespace cost is primarily from the rank array and the lcp array,which altogether is O ( n ) , assuming each integer in these arrayscosts a constant number of memory words. If multiple LRscover position interval [ x..y ] , the leftmost LR will be returned,as is guaranteed by Line 5 of Algorithm 1. Theorem 1.

For any position interval [ x..y ] in the string S , wecan ﬁnd LR yx or the fact that it does not exist using O ( n ) timeand space. If there are multiple choices for LR yx , the leftmostone is returned.Proof: The sufﬁx array of S can be constructed byexisting algorithms using O ( n ) time and space (e.g., [19]).After the sufﬁx array is constructed, the rank array can betrivially created using another O ( n ) time and space. We canthen use the sufﬁx array and the rank array to construct thecp array using another O ( n ) time and space [20]. Given therank array and the lcp array, the time cost of Algorithm 1 is O ( x ) (Lemma 5). So altogether, we can ﬁnd LR yx or the factthat it does not exists using O ( n ) time and space. If there aremultiple choices for LR yx , the leftmost choice will be returned,as is claimed in Lemma 5. Algorithm 2:

Find all LRs that cover a given stringposition interval [ x..y ] Input : (1) Two integers x and y , ≤ x ≤ y ≤ n , representing a stringposition interval [ x..y ] .(2) The rank array and the lcp array of the string S . Output : All LRs that cover the position interval [ x..y ] or the fact that no suchLR exists. /* Find the length of LR yx . */ length ← ; for i = x down to do L ← max { LCP [ Rank [ i ]] , LCP [ Rank [ i ] + 1] } ; // | LLR i | if L = 0 or i + L − < y then break; /* LLR i does not exist or does not cover [ x..y ] , so we can early stop. */ else if L > length then length ← L ; /* Find all LRs that cover position interval [ x..y ] . */ if length > then // LR yx does exist. for i = x down to do L ← max { LCP [ Rank [ i ]] , LCP [ Rank [ i ] + 1] } ; // | LLR i | if L = 0 or i + L − < y then break; // Early stop else if L = length then Print LR yx ← ( i, i + length − ; else Print LR yx ← ( − , − ; // LR yx does not exist. Extension: ﬁnd all LRs covering a given position interval.

It is trivial to extend Algorithm 1 to ﬁnd all the LRs coveringany given position interval [ x..y ] as follows. We can ﬁrst use asimilar procedure as Algorithm 1 to calculate | LR yx | . If LR yx does exist, then we will start over the procedure again to re-check every LLR i , i ≤ x , and return every LLR whose lengthis equal to | LR yx | . Due to Lemma 4, the same early stop aswhat we have in Algorithm 1 can be used for practical speedup.Algorithm 2 shows the pseudocode of this procedure, whichclearly spends an extra O ( x ) time. Using Theorem 1, we have: Theorem 2.

For any position interval [ x..y ] in the string S ,we can ﬁnd all choices of LR yx or the fact that LR yx does notexist, using O ( n ) time and space. VI. A

GEOMETRIC PERSPECTIVE OF THE USEFUL

LLR

SAND THE LR QUERIES

In this section, we present a geometric perspective of theuseful LLRs and the generalized LR queries. This perspectiveis sparked by the idea presented in [1], which serves as theintuition behind the algorithms in Sections VII and VIII thatshare the similar spirit of those for SUS ﬁnding in [1]. Westart with the following lemma that says the useful LLRs areeasy to compute.

Lemma 6.

Given the lcp and rank arrays of the string S , wecan compute its useful LLRs in O ( n ) time and space.Proof: By Lemma 2, we know if

LLR i − exists, the rightboundary of LLR i is on or after the right boundary of LLR i − ,for any i ≥ , so we can construct the array of useful LLRs inone pass as follows: we calculate each LLR i using Lemma 1, D BC A ◦ diagonal b bbbb Fig. 1. The 2d geometric perspective on the useful LLRs of string S = aaababaabaaabaaab and its several generalized LR queries. (A) The LLRc array saves all the useful LLRs in the strictly increasing order of their stringpositions: { (1 , , (5 , , (7 , , (10 , , (11 , } , where each useful LLRis a ( start, end ) tuple, representing the start and ending position of the LLR.By viewing the start and end positions as the x and y coordinates, all theuseful LLRs of the example string can be visualized as the dark dots in theﬁgure. (B) Queries for LR , LR , LR and LR are visualized by thered, blue, green, and black polylines, numbered A – D , respectively. (C) Alldark dots and polylines are on or above the ◦ diagonal. Algorithm 3:

The calculation of LLRc, the array of useful LLRs,saved in ascending order of their positions.

Input : The rank and lcp arrays of the string S . j ← ; prev ← ; for i = 1 . . . n do L ← max { LCP [ Rank [ i ]] , LCP [ Rank [ i ] + 1] } ; // | LLR i | if L > and L ≥ prev then // LLR i is useful. LLRc [ j ] ← ( i, i + L − ; j ← j + 1 prev ← L ; return LLRc ; for i = 1 , , . . . , n , and eliminate (useless) LLR i , if | LLR i | =0 or | LLR i | = | LLR i − | − . Deﬁnition 4.

LLRc is an array of useful LLRs, which aresaved in the ascending order of their start position. We use

LLRc .size to denote the number of elements in

LLRc . Algorithm 3 shows the procedure for the

LLRc arrayconstruction in O ( n ) time and space, provided with the sufﬁxarray and lcp array of S . Each LLRc array element is a ( start, end ) tuple, representing the start and ending positionsof the useful LLR. Because no useful LLR is a substring ofanother useful LLR, we have the following fact. Fact 1.

All elements in the

LLRc array have their both startand ending positions in strictly increasing order. That is, forany i and j , ≤ i < j ≤ LLRc .size : LLRc [ i ] .start < LLRc [ j ] .start and LLRc [ i ] .end < LLRc [ j ] .end . If we view each useful LLR’s start position as the x coordinate and ending position as the y coordinate, each usefulLLR can be viewed as a dot in the 2d space. All the 2d dots,representing all the useful LLRs that are saved in the LLRcarray, are distributed in the 2d space from the low-left cornertoward the up-right corner. Because of Fact 1, no two dotsshare the same x or y coordinates. Further, since every dot’s y coordinate is no less than its x coordinate, those dots areon or above the ◦ diagonal. Figure 1 shows this geometric lgorithm 4: Find LR using 2d DMQ.

Input : The lcp and rank arrays of the string S Compute the LLRc array; // Algorithm 3 Build the 2d DMQ index for the LLRc array elements ; // Existingtechnique, e.g., [21]/* Find one choice of LR yx . */ QueryOne2d( x, y ): x, y ); // return ( − , , if S x,y = ∅ ./* Find all choices of LR yx . */ QueryAll2d( x, y ): ( x ′ , y ′ ) ← x, y ); if ( x ′ , y ′ ) = ( − , − then FindAll2d( x, y, y ′ − x ′ + 1 ) ; // Recursive searches start. FindAll2d( x, y, weight ): // Helper function ( x ′ , y ′ ) ← x, y ); if ( x ′ , y ′ ) = ( − , − or ( y ′ − x ′ + 1 < weight ) then return ; // Recursion exits. Print ( x ′ , y ′ ) ; // One choice of LR yx is found. if x ′ − ≥ then FindAll2d( x ′ − , y, weight ) ; // New recursive search. if y ′ + 1 ≤ n then FindAll2d( x, y ′ + 1 , weight ) ; // New recursive search. perspective of several useful LLRs. Deﬁnition 5.

The weight of a dot ( x, y ) , representing a useful LLR x = S [ x..y ] , is | LLR x | = y − x + 1 , the length of LLR x . Deﬁnition 6. S x,y = { ( a, b ) ∈ LLRc | a ≤ x, b ≥ y } . If we draw in the 2d space a y shaped orthogonal polylinewhose angle locates at position ( x, y ) , S x,y is the set of 2ddots, representing those useful LLRs that are located on theup-left side (inclusive) of the polyline.Because any LR must be useful LLR (Lemma 3), from thisgeometric perspective, the answer to the LR yx query becomesthe heaviest dot(s), whose horizontal coordinate is ≤ x andwhose vertical coordinate is ≥ y . That is, LR yx are theheaviest dots in S x,y . If S x,y is empty, it means LR yx doesnot exist. Figure 1 shows this geometric perspective of severalgeneralized LR queries.VII. AN INDEX OF O ( occ · log n ) QUERY TIMEAs is explained in Section VI, LR yx is the heaviest dot(s)from the set S x,y , if S x,y is not empty; otherwise, LR yx doesnot exist. Finding one heaviest dot from S x,y is nothing butthe well-known 2d dominance max query.

2d dominance max query (DMQ).

Given a set of n dotsand any position ( x, y ) in the 2d space, ﬁnd the heaviest dot,whose horizontal coordinate is ≤ x and vertical coordinate is ≥ y . If there are multiple choices, ties are resolved arbitrarily.There exist indexing structures (e.g., [21]) that can beconstructed on top of the n dots using O ( n log n ) time and O ( n ) space, such that by using the indexing structure, anyfuture 2d DMQ can be answered in O (log n ) time. Thereduction from ﬁnding an LR to a 2d DMQ immediately givesus the QueryOne2d function in Algorithm 4 for ﬁnding onechoice of an LR.

Theorem 3.

We can construct an indexing structure for astring S of size n using O ( n log n ) time and O ( n ) space, suchthat by using the indexing structure any future generalized LR query can be answered in O (log n ) time. If there exist multiplechoices for the LR of interest, ties are resolved arbitrarily.Proof: (1) The sufﬁx array of S can be constructed byexisting algorithms using O ( n ) time and space (e.g., [19]).After the sufﬁx array is constructed, the rank array can betrivially created using O ( n ) time and space. We can then usethe sufﬁx array and the rank array to construct the lcp arrayusing another O ( n ) time and space [20]. (2) Given the rankarray and the lcp array, we can construct the LLRc arrayof useful LLRs using O ( n ) time and space (Lemma 6 andAlgorithm 3). (3) We then create the indexing structure for theLLRc array elements for 2d DMQ, using O ( n log n ) time and O ( n ) space (e.g., [21]). By using this index, we can answerany future generalized LR query in O (log n ) time and ties areresolved arbitrarily. A. Find all choices of any LR.

We know LR yx are the heaviest dots in S x,y , if S x,y is notempty; otherwise, LR yx does not exist. Upon receiving a queryfor LR yx , we ﬁrst perform a 2d DMQ, which returns one of theheaviest dots in S x,y . If no such a dot is returned, then LR yx does not exist. Otherwise, suppose ( x ′ , y ′ ) is the dot returned,then ( x ′ , y ′ ) is one of the choices for LR yx .Because all the dots representing the LLRc array elementshave their both x and y coordinates strictly increase (Fact 1),all other choices (if existing) of LR yx must be existing in theunion of S x ′ − ,y and S x,y ′ +1 . Therefore, we can ﬁnd otherchoices of LR yx by the following two recursive searches: onewill ﬁnd one of the heaviest dots in S x ′ − ,y , the other willﬁnd one of the heaviest dots in S x,y ′ +1 . Each of these tworecursive searches is again a 2d DMQ.For each recursive search: (1) If the weight of the heaviestdot it ﬁnds is equal to y ′ − x ′ + 1 , the length of LR yx , it willreturn the found dot as another choice of LR yx and will thenlaunch its own two new recursive searches, similar to whatits caller has done in order to ﬁnd other choices for LR yx ; (2)otherwise, it stops and returns to its caller.Function QueryAll2d in Algorithm 4 shows the pseu-docode for ﬁnding all choices of LR yx . Example 1 (Figure 1) . Search A is for LR . That is to ﬁndall heaviest dots in S , , which include dot (7 , and dot (11 , . Suppose the 2d DMQ launched by search A returnsdot (7 , , which has a weight of and is one choice for LR . The next two recursive searches launched by search Awill be search B looking for one of the heaviest dots in S , and search C looking for one of the heaviest dots in S , .Search B will return the heaviest dot (11 , from S , ,whose weight is equal to , so the dot (11 , is anotherchoice of LR . Search B will then launch its own two newrecursive searches for one heaviest dot in each of S , and S , . (These two searches are not shown in Figure 1 forconcision). The search in S , returns dot (10 , whoseweight is less than , so the search stops and returns to itscaller. The search in S , ﬁnds nothing, so it stops andreturns to its caller. After all its recursive searches return,search B returns to its caller, which is search A .earch C ﬁnds nothing in S , , so it stops and returnsto its caller, which is search A .At this point, all the work of search A is ﬁnished, and wehave found all the choices, which are S [7 .. and S [11 .. (or LLRc [3] and

LLRc [5] , equivalently), for LR . Clearly, the same 2d DMQ index is used in ﬁnding allchoices of an LR query, and there are no more than · occ + 1 instances of 2d DMQ, in the ﬁnding of all choices of an LR,where occ is the number of choices of the LR. Because each2d DMQ takes O (log n ) time, we get the following theorem. Theorem 4.

We can construct an indexing structure for astring S of size n using O ( n log n ) time and O ( n ) space, suchthat by using the indexing structure, we can ﬁnd all choicesof any LR in O ( occ · log n ) time, where occ is the number ofchoices of the LR being queried for. VIII. AN INDEX OF O ( occ ) QUERY TIMEIn this section, we present the optimal indexing structurefor generalized LR ﬁnding. It is again based on the intuitionderived from the geometric perspective on the relationshipbetween useful LLRs and LR queries (Section VI).Recall that the answer for an LR yx query is the heaviestdot(s) from S x,y , if S x,y is not empty. Due to Fact 1, S x,y corresponds to a continuous chunk of the LLRc array, if S x,y is not empty. Therefore, searching for one heaviest dot in S x,y becomes searching for one heaviest element within acontinuous chunk of the LLRc array, which is nothing butthe range minimum query on the array LLRc. Range minimum query (RMQ).

Given an array A [1 ..n ] of n comparable elements, ﬁnd the index of the smallest elementwithin A [ i..j ] , for any given i and j , ≤ i ≤ j ≤ n . If thereare multiple choices, ties are resolved arbitrarily.There exist indexing structures (e.g., [22], [23]) that can beconstructed on top of the array A using O ( n ) time and space,such that any future RMQ can be answered in O (1) time.The next issue is: Upon receiving a query for LR yx , forsome x and y , ≤ x ≤ y ≤ n , how to ﬁnd the left and rightboundaries of the continuous chunk of LLRc, over which wewill perform an RMQ ? Due to Fact 1 and with the aid ofthe geometric perspective of the useful LLRs, we can observethat the left boundary of the chunk only depends on the valueof y , whereas the right boundary of the chunk only dependson the value of x . Intuitively, if one sweeps a horizontal linestarting from position y (inclusive) toward the up direction,the LLRc array index of the ﬁrst dot that the line hits is theleft boundary of the RMQ’s range. Similarly, if one sweeps avertical line starting from position x (inclusive) toward the leftdirection, the LLRc array index of the ﬁrst dot that the linehits is the right boundary of the RMQ’s range. The range forRMQ is invalid, if any one of the following three possibilitieshappens: 1) No dot is hit by the horizontal line; 2) No dot ishit by the vertical line; 3) The index of the left boundary ofthe range is larger than the index of the right boundary of the We should actually perform range maximum query , which however can betrivially reduced to RMQ by viewing each array element as the negative ofits actual value.

Algorithm 5:

Compute L i and R i for i = 1 , , . . . , n . Input : The

LLRc array.

Output : The L and R arrays. for i = 1 . . . n do L i ← − ; R i ← − ; // Initialization. i ← ; for y = 1 . . . n do if y ≤ LLRc [ i ] .end then L y ← i ; else if i < LLRc .size then i ← i + 1 ; L y ← i ; else break; i ← LLRc .size ; for x = n . . . do if x ≥ LLRc [ i ] .start then R x ← i ; else if i > then i ← i − ; R x ← i ; else break; range. An invalid RMQ range means that LR yx does not exist.See Figure 1 for examples.More precisely, given the values of x and y from the queryfor LR yx , the left boundary L y and the right boundary R x ofthe range for RMQ can be determined as follows: L y = (cid:26) min { i | LLRc [ i ] .end ≥ y } , if { i | LLRc [ i ] .end ≥ y } 6 = ∅− , otherwise R x = (cid:26) max { i | LLRc [ i ] .start ≤ x } , if { i | LLRc [ i ] .start ≤ x } 6 = ∅− , otherwise Further, we can pre-compute L y and R x , for every x =1 , , . . . , n and y = 1 , , . . . , n , and save the results for futurereferences. Algorithm 5 shows the procedure for computingthe L and R arrays, which clearly uses O ( n ) time and space. Lemma 7.

Algorithm 5 computes L , L , . . . , L n and R , R , . . . , R n using O ( n ) time and space. Now we are ready to present the algorithm for ﬁndingone choice of a generalized LR query. Algorithm 6 (throughLine 7) gives the pseudocode. After array LLRc is created,we will compute the L and R arrays using the LLRc array(Algorithm 5). Then we will create the RMQ structure for theLLRc array, where the weight of each array element is deﬁnedas the length of the corresponding LLR (or, from the geometricperspective, is the weight of the 2d dot representing that LLR),using existing techniques (e.g., [22], [23]). Upon receiving aquery for LR yx , function QueryOneRMQ(x,y) performs anRMQ over the range

LLRc [ L y , R x ] , if ≤ L y ≤ R x ≤ n ;otherwise, it returns ( − , − , meaning LR yx does not exist.The answer returned by the RMQ is one of the choices for LR yx . If there exist multiple choices for LR yx , ties are resolvedarbitrarily, depending on which heaviest element in the rangeis returned by the RMQ. Theorem 5.

We can construct an indexing structure for astring S of size n using O ( n ) time and space, such that anyfuture generalized LR query can be answered in O (1) time.If There exist multiple choices for the LR being queried for,ties are resolved arbitrarily.Proof: (1) The sufﬁx array of S can be constructed byexisting algorithms using O ( n ) time and space (e.g., [19]).After the sufﬁx array is constructed, the rank array can betrivially created using O ( n ) time and space. We can then usethe sufﬁx array and the rank array to construct the lcp arrayusing another O ( n ) time and space [20]. (2) Given the rank lgorithm 6: Find LR using RMQ.

Input : The lcp and rank arrays of the string S . Compute the LLRc array; // Algorithm 3 Compute the L and R arrays from the LLRc array ; // Algo. 5 Construct the RMQ structure for the LLRc array; // [22], [23]/* Find one choice of LR yx . */ QueryOneRMQ( x, y ): if L y = − and R x = − and L y ≤ R x then return LLRc h RMQ (cid:0)

LLRc [ L y ..R x ] (cid:1)i ; else return ( − , − ; // LR yx does not exist./* Find all choices of LR yx . */ QueryAllRMQ( x, y ) if L y = − and R x = − and L y ≤ R x then m ← RMQ (cid:0)

LLRc [ L y ..R x ] (cid:1) ; weight ← LLRc [ m ] .end − LLRc [ m ] .start + 1 ; // | LR yx | FindAllRMQ( L y , R x , weight ); // Recursive searches start else return ( − , − ; // LR yx does not exist. FindAllRMQ( ℓ, r, weight ) // Helper function m ← RMQ (cid:0)

LLRc [ ℓ..r ] (cid:1) ; if LLRc [ m ] .end − LLRc [ m ] .start + 1 < weight then return ; // Recursion exits. Print

LLRc [ m ] ; // One choice of LR yx is found. if ℓ ≤ m − then FindAllRMQ( ℓ, m − , weight ) ; // New recursive search. if r ≥ m + 1 then FindAllRMQ( m + 1 , r, weight ) ; // New recursive search. array and the lcp array, we can construct the LLRc arrayof useful LLRs using O ( n ) time and space (Lemma 6 andAlgorithm 3). (3) Given the LLRc array, we can compute the L and R arrays using another O ( n ) time and space (Lemma 7and Algorithm 5). (4) We then create the RMQ structure forthe LLRc array using another O ( n ) time and space, usingexisting techniques (e.g., [22], [23]). So, the total time andspace cost for building the indexing structure is O ( n ) . Byusing this RMQ indexing structure and the pre-computed L and R arrays, we can answer any future generalized LR queryin O (1) time (The QueryOneRMQ function in Algorithm 6). Ifthere exist multiple choices for the LR being searched for, tiesare resolved arbitrarily, as is determined by the RMQ structure.

A. Find all choices of any LR.

Upon receiving a query for LR yx , we ﬁrst perform an RMQover range LLRc [ L y ..R x ] if such range exists; otherwise, itmeans LR yx does not exist, and we stop. Suppose the range LLRc [ L y ..R x ] is valid and its RMQ returns m , the arrayindex of the heaviest element in the range, then LLRc [ m ] isone of the choices for LR yx and | LR yx | = LLRc [ m ] .end − LLRc [ m ] .start + 1 . If LR yx has other choices, those choicesmust be existing in the union of the ranges LLRc [ L y ..m − and LLRc [ m + 1 ..R x ] . We can ﬁnd those choices of LR yx by recursively performing an RMQ on each of those tworanges. The recursion will exit, if the element returned byRMQ has a weight smaller than | LR yx | or the range for RMQis invalid. The QueryAllRMQ function in Algorithm 6 showsthe pseudocode of this procedure for ﬁnding all choices of anLR query.

Example 2 (Figure 1) . Given the LLRc array of the examplestring in Figure 1, Algorithm 5 computes the L and R arrays. i L i R i Upon receiving the query LR , we ﬁrst use the L and R arrays to retrieve the range [ L , R ] = [3 , , which is avalid range for RMQ. Then we perform RM Q (cid:0)

LLRc [3 .. (cid:1) of Search A and either or can be returned, becauseboth LLRc [3] and

LLRc [5] are the heaviest elements inthe range

LLRc [3 .. . Suppose is returned and is savedin m , then we get LLRc [3] as one choice for LR and | LR | = | LLRc [3] | = 7 .Then, we will ﬁnd other choices for LR by performinga recursive search on each of the ranges [ L , m −

1] = [3 , and [ m + 1 , R ] = [4 , . The ﬁrst range is invalid, sothe search exits (meaning Search C in Figure 1 will notbe performed). The search on the second range [4 , (cor-responding to Search B in Figure 1), which is valid, willlaunch RM Q (cid:0)

LLRc [4 , (cid:1) . The RMQ will return . Since | LLRc [5] | = | LR yx | = 7 , LLRc [5] is another choice for LR yx .Then, the search on the range [4 , will launch its owntwo recursive searches on the ranges [4 , −

1] = [4 , and [5 + 1 ,

5] = [6 , . The search on the ﬁrst range will ﬁnd theheaviest element’s weight is less than | LR yx | , so the searchstops. Because the second range is invalid, the recursive searchon that range will stop immediately.At this point, all choices for LR , which are LLRc [3] and

LLRc [5] , have been found.

Clearly, the same indexing structure is used by all RMQ’sin the search for all choices of LR yx . Further, there are no morethan · occ + 1 RMQ’s in the ﬁnding of all choices of one LR,where occ is the number of choices of the LR. Because eachRMQ takes O (1) time, we get the following theorem. Theorem 6.

We can construct an indexing structure for astring S of size n using O ( n ) time and space, such that byusing the indexing structure, we can ﬁnd all choices of anygeneralized LR in O ( occ ) time, where occ is the number ofchoices of the LR being queried for. IX. I

MPLEMENTATION AND E XPERIMENTS

We implement our proposals in

C++ , using the librarybinary of the implementation of the DMQ and RMQ structuresfrom [1]. Our implementation is generic in that it does notassume the alphabet size of the underlying string, and thussupports LR queries over different types of strings.We compare the performance of our proposals with theprior works including the optimal O ( n ) time and space so-lution from [12] and the suboptimal sequential algorithm pre-sented in [14]. Note that all prior works can only answer pointqueries. All programs involved in the experiments use the same libdivsufsort library for the sufﬁx array construction,and are compiled by gcc 4.7.2 with -O3 option.We conduct our experiments on a GNU/Linux machinewith kernel version 3.2.51-1. The computer is equipped withan Intel Xeon 2.40GHz E5-2609 CPU with 10MB SmartCache and has 16GB RAM. All experiments are conducted on https://code.google.com/p/libdivsufsort . P ea k M e m o r y U s age i n M B s Sequence Size in MBsDNA[12][14]RMQ-basedDMQ-based 0 2000 4000 6000 8000 10000 12000 10 20 30 40 50 60 70 80 90 100 P ea k M e m o r y U s age i n M B s Sequence Size in MBsProtein[12][14]RMQ-basedDMQ-based

Fig. 2. Peak memory usage of different proposals for DNA and Protein strings of different sizes T i m e C o s t i n S e c ond s Sequence Size in MBsDNASA,Rank,LCPRMQDMQ 0 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90 100 T i m e C o s t i n S e c ond s Sequence Size in MBsProteinSA,Rank,LCPRMQDMQ

Fig. 3. Indexing structure construction time for DNA and Protein strings of different sizes real-world datasets including the

DNA and

Protein strings,downloaded from the Pizza&Chili Corpus . The datasets weuse are the two MB DNA and

Protein pure ASCII textﬁles, each of which thus represents a string of × × , , characters. Any other shorter stringsinvolved in our experiments are preﬁxes of certain lengths,cut from the MB strings.

A. Space

Here, we measure the peak memory usage of differentproposals, using the Linux command /usr/bin/time -f “ %M ” that captures the maximum resident set size of a processduring its lifetime. We do not save the output in the RAMin order to focus on the comparison of the memory usage ofthe algorithmics. It is also because practitioners often ﬂush theoutputs directly to disk ﬁles for future reuse.Figure 2 shows the peak memory usage of different pro-posals that process DNA and protein strings of different sizes.It is worth noting that, by design, the memory usage of eachproposal is independent from the query type, such as ﬁnding http://pizzachili.dcc.uchile.cl/texts.html one choice vs. all choices of an LR, point query vs. intervalquery. We have the following main observations:– All proposals show the linearity of their space usage overstring size.– Our DMQ-based proposal uses much more memory spacethan other proposals. It is mainly caused by the high spacedemand from the DMQ structure.– Our RMQ-based proposal uses nearly the same amountof memory space as that of prior works, while signiﬁcantlyimproving the usability of the technique by providing thefunctionality of interval queries. B. Time

Figure 3 shows the construction time of the indexingstructures used by different proposals. Note that all proposalsneed to construct the sufﬁx array, rank array, and the lcparray of the given string, and our proposals further use theseauxiliary arrays to construct the DMQ and RMQ structures forinterval queries. The following are the main observations:– The construction of the DMQ structure takes much moretime than that of the auxiliary arrays and the RMQ structure. T i m e C o s t i n S e c ond s Sequence Size in MBsDNA; Find one choice for each LR[12][14]RMQ, interval size = 1RMQ, interval size = 5RMQ, interval size = 10RMQ, interval size = 15RMQ, interval size = 20 0 1 2 3 4 5 6 7 10 20 30 40 50 60 70 80 90 100 T i m e C o s t i n S e c ond s Sequence Size in MBsDNA; Find all choices for each LR[14]RMQ, interval size = 1RMQ, interval size = 5RMQ, interval size = 10RMQ, interval size = 15RMQ, interval size = 20 0 1 2 3 4 5 6 7 8 10 20 30 40 50 60 70 80 90 100 T i m e C o s t i n S e c ond s Sequence Size in MBsProtein; Find one choice for each LR[12][14]RMQ, interval size = 1RMQ, interval size = 5RMQ, interval size = 10RMQ, interval size = 15RMQ, interval size = 20 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 10 20 30 40 50 60 70 80 90 100 T i m e C o s t i n S e c ond s Sequence Size in MBsProtein; Find all choices for each LR[14]RMQ, interval size = 1RMQ, interval size = 5RMQ, interval size = 10RMQ, interval size = 15RMQ, interval size = 20

Fig. 4. Query time of different proposals for DNA and Proteins strings of different sizes – Both the auxiliary array and RMQ structure clearly show thelinearity in the their construction time over string size.– The construction of the RMQ structure takes less time thanthe construction of the auxiliary arrays, making our RMQ-based proposal practical while supporting interval queries.Figure 4 shows the time cost of various types of query.Our DMQ-based proposal is so slow in query response thatwe do not include it in the ﬁgure. For point queries, we plotthe total time cost for all the point queries over all n stringpositions, where n is the string size. For interval queries withinterval size δ , we plot the total time cost for all the intervalqueries over all n − δ + 1 intervals of the string. Note thatonly point queries are involved in the experiments with theproposals from [12] and [14], because they do not supportinterval queries. The two ﬁgures on the left show the casefor ﬁnding only one choice for each LR, whereas the two onthe right show the case for ﬁnding all choices for each LR.Because the proposal from [12] does not support the ﬁndingof all choices, it is not included in the two ﬁgures on the rightside. The following are the main observations:– All proposals show the clear linearity of the total query timecost, meaning the amortized O (1) time cost for each query.– In the setting of ﬁnding one choice for each LR (the twoﬁgures on the left of Figure 4), our RMQ-based proposal is the fastest regarding the per-query response time, includingboth point query and interval query! Further, our RMQ-basedproposal’s interval query response becomes even faster, wheninterval size increases. That is because a longer interval iscovered by fewer number of repeats, reducing the search spacesize for ﬁnding the LR covering the interval.– In the setting of ﬁnding all choices for each LR (the twoﬁgures on the right side of Figure 4): • For point query, our RMQ-based proposal is a littleslower than [14] due to the following reason. Onaverage, an LR point query returns more choices thanan interval query. Our technique needs to make aquery to the index for ﬁnding every single choice,whereas the technique in [14] only needs one extra“walk” for ﬁnding all choices for a particular LR pointquery. Even though our technique is faster than [14]for ﬁnding one choice (the two ﬁgures on the left),when a particular point query has many choice, ourtechnique can become slower in ﬁnding all choices. • As interval size increases, our RMQ-based proposalbecomes faster, because a longer interval on averagehas fewer choices for its LR, making our techniquehave fewer queries to its index. Our technique’s in-terval query can be even faster than the point queryy [14] in ﬁnding all choices when interval sizeincreases. For example, it is true, when interval sizebecomes ≥ for DNA string (top-right ﬁgure) and ≥ for protein string (bottom-right ﬁgure).X. C ONCLUSION

We generalized the longest repeat query on a string frompoint query to interval query and proposed both time and spaceoptimal solution for interval queries. Our approach is differentfrom prior work which can only handle point queries. Usingthe insight from [1], we proposed an indexing structure thatcan be built on top of the string using time and space linearof the string size, such that any future interval queries canbe answered in O (1) time. We implemented our proposalswithout assuming the alphabet size of the string, making ituseful for different types of strings. An interesting future workis to parallelize our proposal so as to take advantage of themodern multi-core and multi-processor computing platforms,such as the general-purpose graphics processing units.R EFERENCES[1] X. Hu, J. Pei, and Y. Tao, “Shortest unique queries on strings,” in

Proceedings of International Symposium on String Processing andInformation Retrieval (SPIRE) , 2014, pp. 161–172.[2] D. Gusﬁeld,

Algorithms on strings, trees and sequences: computerscience and computational biology . Cambridge University Press, 1997.[3] E. H. McConkey,

Human Genetics: The Molecular Revolution . Boston,MA: Jones and Bartlett, 1993.[4] X. Liu and L. Wang, “Finding the region of pseudo-periodic tandemrepeats in biological sequences,”

Algorithms for Molecular Biology ,vol. 1, no. 1, p. 2, 2006.[5] H. M. Martinez, “An efﬁcient method for ﬁnding repeats in molecularsequences,”

Nucleic Acids Research , vol. 11, no. 13, pp. 4629–4634,1983.[6] V. Becher, A. Deymonnaz, and P. A. Heiber, “Efﬁcient computation ofall perfect repeats in genomic sequences of up to half a gigabyte, witha case study on the human genome,”

Bioinformatics , vol. 25, no. 14,pp. 1746–1753, 2009.[7] M. O. Kulekci, J. S. Vitter, and B. Xu, “Efﬁcient maximal repeatﬁnding using the burrows-wheeler transform and wavelet tree,”

IEEETransactions on Computational Biology and Bioinformatics (TCBB) ,vol. 9, no. 2, pp. 421–429, 2012.[8] T. Beller, K. Berger, and E. Ohlebusch, “Space-efﬁcient computationof maximal and supermaximal repeats in genome sequences,” in

Pro-ceedings of the 19th International Conference on String Processing andInformation Retrieval (SPIRE) , 2012, pp. 99–110. [9] A. Bakalis, C. S. Iliopoulos, C. Makris, S. Sioutas, E. Theodoridis,A. K. Tsakalidis, and K. Tsichlas, “Locating maximal multirepeats inmultiple strings under various constraints,”

Computer Journal , vol. 50,no. 2, pp. 178–185, 2007.[10] C. S. Iliopoulos, W. F. Smyth, and M. Yusufu, “Faster algorithms forcomputing maximal multirepeats in multiple sequences,”

FundamentaInformaticae , vol. 97, no. 3, pp. 311–320, 2009.[11] L. Ilie and W. F. Smyth, “Minimum unique substrings and maximumrepeats,”

Fundamenta Informaticae , vol. 110, no. 1-4, pp. 183–195,2011.[12] A. M. ˙Ileri, M. O. K¨ulekci, and B. Xu, “On longest repeat queries,” http://arxiv.org/abs/1501.06259 .[13] T. Schnattinger, E. Ohlebusch, and S. Gog, “Bidirectional searchin a string with wavelet trees and bidirectional matching statistics,”

Information and Computation , vol. 213, pp. 13–22, Apr. 2012.[14] Y. Tian and B. Xu, “On longest repeat queries using GPU,” in

Proceedings of the 20th International Conference on Database Systemsfor Advanced Applications (DASFAA) , 2015, pp. 316–333.[15] J. Pei, W. C. H. Wu, and M. Y. Yeh, “On shortest unique substringqueries,” in

Proceedings of the 2013 IEEE International Conference onData Engineering (ICDE) , 2013, pp. 937–948.[16] K. Tsuruta, S. Inenaga, H. Bannai, and M. Takeda, “Shortest uniquesubstrings queries in optimal time,” in

Proceedings of InternationalConference on Current Trends in Theory and Practice of ComputerScience (SOFSEM) , 2014, pp. 503–513.[17] A. M. ˙Ileri, M. O. K¨ulekci, and B. Xu, “Shortest unique substringquery revisited,” in

Proceedings of the 25th Annual Symposium onCombinatorial Pattern Matching (CPM) , 2014, pp. 172–181.[18] A. M. ˙Ileri, M. O. K¨ulekci, and B. Xu, “A simple yet time-optimal andlinear-space algorithm for shortest unique substring queries,”

Theoreti-cal Computer Science , vol. 562, no. 0, pp. 621 – 633, 2015.[19] P. Ko and S. Aluru, “Space efﬁcient linear time construction of sufﬁxarrays,”

Journal of Discrete Algorithms , vol. 3, no. 2-4, pp. 143–156,2005.[20] T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park, “Linear-timelongest-common-preﬁx computation in sufﬁx arrays and its applica-tions,” in

Symposium on Combinatorial Pattern Matching , 2001, pp.181–192.[21] C. Sheng and Y. Tao, “New results on two-dimensional orthogonalrange aggregation in external memory,” in

Proceedings of the ThirtiethACM SIGMOD-SIGACT-SIGART Symposium on Principles of DatabaseSystems (PODS) , 2011, pp. 129–139.[22] J. Fischer and V. Heun, “Theoretical and practical improvements on thermq-problem, with applications to lca and lce,” in

Proceedings of the17th Annual Conference on Combinatorial Pattern Matching (CPM) ,2006, pp. 36–48.[23] D. Harel and R. E. Tarjan, “Fast algorithms for ﬁnding nearest commonancestors,”