[PDF] Gapped Indexing for Consecutive Occurrences

Abstract

The classic string indexing problem is to preprocess a string S into a compact data structure that supports efficient pattern matching queries. Typical queries include existential queries (decide if the pattern occurs in S), reporting queries (return all positions where the pattern occurs), and counting queries (return the number of occurrences of the pattern). In this paper we consider a variant of string indexing, where the goal is to compactly represent the string such that given two patterns P1 and P2 and a gap range [\alpha,\beta] we can quickly find the consecutive occurrences of P1 and P2 with distance in [\alpha,\beta], i.e., pairs of occurrences immediately following each other and with distance within the range. We present data structures that use \~O(n) space and query time \~O(|P1|+|P2|+n^(2/3)) for existence and counting and \~O(|P1|+|P2|+n^(2/3)*occ^(1/3)) for reporting. We complement this with a conditional lower bound based on the set intersection problem showing that any solution using \~O(n) space must use \tilde{\Omega}}(|P1|+|P2|+\sqrt{n}) query time. To obtain our results we develop new techniques and ideas of independent interest including a new suffix tree decomposition and hardness of a variant of the set intersection problem.

Full PDF

GGapped Indexing for Consecutive Occurrences

Philip Bille [email protected]

Inge Li Gørtz [email protected]

Max Rishøj Pedersen [email protected]

Teresa Anna Steiner [email protected]

Abstract

The classic string indexing problem is to preprocess a string S into a compact data structure thatsupports eﬃcient pattern matching queries. Typical queries include existential queries (decide if thepattern occurs in S ), reporting queries (return all positions where the pattern occurs), and countingqueries (return the number of occurrences of the pattern). In this paper we consider a variant of stringindexing, where the goal is to compactly represent the string such that given two patterns P and P anda gap range [ α, β ] we can quickly ﬁnd the consecutive occurrences of P and P with distance in [ α, β ], i.e.,pairs of occurrences immediately following each other and with distance within the range. We presentdata structures that use (cid:101) O ( n ) space and query time (cid:101) O ( | P | + | P | + n / ) for existence and counting and (cid:101) O ( | P | + | P | + n / occ / ) for reporting. We complement this with a conditional lower bound based onthe set intersection problem showing that any solution using (cid:101) O ( n ) space must use (cid:101) Ω( | P | + | P | + √ n )query time. To obtain our results we develop new techniques and ideas of independent interest includinga new suﬃx tree decomposition and hardness of a variant of the set intersection problem. The classic string indexing problem is to preprocess a string S into a compact data structure that supportseﬃcient pattern matching queries. Typical queries include existential queries (decide if the pattern occursin S ), reporting queries (return all positions where the pattern occurs), and counting queries (return thenumber of occurrences of the pattern). An important variant of this problem is the gapped string indexingproblem [6, 8, 10, 14, 27, 28, 31]. Here, the goal is to compactly represent the string such that given twopatterns P and P and a gap range [ α, β ] we can quickly ﬁnd occurrences of P and P with distance in[ α, β ]. Searching and indexing with gaps is frequently used in computational biology applications [6, 11, 13,14, 19, 21, 22, 32, 35, 38].Another variant is string indexing for consecutive occurrences [9, 40]. Here, the goal is to compactlyrepresent the string such that given a pattern P and a gap range [ α, β ] we can quickly ﬁnd consecutiveoccurrences of P with distance in [ α, β ], i.e., pairs of occurrences immediately following each other and withdistance within the range.In this paper, we consider the natural combination of these variants that we call gapped indexing forconsecutive occurrences . Here, the goal is to compactly represent the string such that given two patterns P and P and a gap range [ α, β ] we can quickly ﬁnd the consecutive occurrences of P and P with distance in[ α, β ].We can apply standard techniques to obtain several simple solutions to the problem. To state the bounds,let n be the size of S . If we store the suﬃx tree for S , we can answer queries by searching for both querystrings, merging the results, and removing all non-consecutive occurrences. This leads to a solution using O ( n ) space and (cid:101) O ( | P | + | P | + occ P + occ P ) query time, where occ P and occ P denote the number ofoccurrences of P and P , respectively . However, occ P + occ P may be as large as Ω( n ) and much largerthan the size of the output.Alternatively, we can obtain a fast query time in terms of the output at the cost of increasing the spaceto Ω( n ). To do so, store for each node v in the suﬃx tree the set of all consecutive occurrences ( i, j ) where (cid:101) O and (cid:101) Ω ignores polylogarithmic factors a r X i v : . [ c s . D S ] F e b is a position below v in a 2D range searching data structure organized by the lexicographic order of j andtheir distance from descendants of v . To answer a query, we then perform a 2D range search in the structurecorresponding to P using the lexicographic range in the suﬃx tree deﬁned by P and the gap range. Thisleads to a solution for reporting queries using (cid:101) O ( n ) space and (cid:101) O ( | P | + | P | + occ) time, where occ is sizeof the output. For existence and counting, we obtain the same bound without the occ term.In this paper, we introduce new solutions that signiﬁcantly improve the above time-space trade-oﬀs.Speciﬁcally, we present data structures that use (cid:101) O ( n ) space and query time (cid:101) O ( | P | + | P | + n / ) for existenceand counting and (cid:101) O ( | P | + | P | + n / occ / ) for reporting. We complement this with a conditional lowerbound based on the set intersection problem showing that any solution using (cid:101) O ( n ) space must use (cid:101) Ω( | P | + | P | + √ n ) query time. To obtain our results we develop new techniques and ideas of independent interestincluding a new suﬃx tree decomposition and hardness of a variant of the set intersection problem. Throughout the paper, let S be a string of length n . Given two patterns P and P a consecutive occurrence in S is a pair of occurrences ( i, j ), 0 ≤ i < j < | S | where i is an occurrence of P and j an occurrenceof P , such that no other occurrences of either P or P occurs in between. The distance of a consecutiveoccurrence ( i, j ) is j − i . Our goal is to preprocess S into a compact data structure that given pattern strings P and P and a gap range [ α, β ] supports the following queries: • Exists ( P , P , α, β ): determine if there is a consecutive occurrence of P and P with distance withinthe range [ α, β ]. • Count ( P , P , α, β ): return the number of consecutive occurrences of P and P with distance withinthe range [ α, β ]. • Report ( P , P , α, β ): report all consecutive occurrences of P and P with distance within the range[ α, β ].We present new data structures with the following bounds: Theorem 1.

Given a string of length n , we can(i) construct an O ( n ) space data structure that supports Exists ( P , P , α, β ) and Count ( P , P , α, β ) queriesin O ( | P | + | P | + n / log (cid:15) n ) time for constant (cid:15) > , or(ii) construct an O ( n log n ) space data structure that supports Report ( P , P , α, β ) queries in O ( | P | + | P | + n / occ / log n log log n ) time, where occ is the size of the output. Hence, ignoring polylogarithmic factors, Theorem 1 achieves (cid:101) O ( n ) space and query time (cid:101) O ( | P | + | P | + n / ) for existence and counting and (cid:101) O ( | P | + | P | + n / occ / ) for reporting. Compared to the abovementioned simple suﬃx tree approaches that ﬁnds all occurrences of the query strings and merges them, wematch the (cid:101) O ( n ) space, while reducing the dependency on n in the query time from worst-case Ω( | P | + | P | + n )to (cid:101) O ( | P | + | P | + n / ) for Exists and

Count queries and (cid:101) O ( | P | + | P | + n / occ / ) for Report queries.We complement Theorem 1 with a conditional lower bound based on the set intersection problem. Specif-ically, we use the Strong SetDisjointness Conjecture from [20] to obtain the following result:

Theorem 2.

Assuming the Strong SetDisjointness Conjecture, any data structure on a string S of length n that supports Exists queries in O ( n δ + | P | + | P | ) time, for δ ∈ [0 , / , requires (cid:101) Ω (cid:0) n − δ − o (1) (cid:1) space. Thisbound also holds if we limit the queries to only support ranges of the form [0 , β ] , and even if the bound β isknown at preprocessing time. With δ = 1 /

2, Theorem 2 implies that any near linear space solution must have query time (cid:101) Ω( | P | + | P | + √ n ). Thus, Theorem 1 is optimal within a factor roughly n / . On the other hand, with δ = 0,Theorem 2 implies that any solution with optimal (cid:101) O ( | P | + | P | ) query time must use (cid:101) Ω( n − o (1) ) space. Note2hat this matches the trade-oﬀ achieved by the above mentioned simple solution that combines suﬃx treeswith two-dimensional range searching data structures.Finally, note that Theorem 2 holds even when the gap range is of the form [0 , β ]. As a simple extensionof our techniques, we show how to improve our solution from Theorem 1 to match Theorem 2 in this specialcase. To obtain our results we develop new techniques and show new interesting properties of consecutive occur-rences. We ﬁrst consider

Exists and

Count queries. The key idea is to split gap ranges into large and smalldistances. For large distances that there can only be a limited number of consecutive occurrences and weshow how these can be eﬃciently handled using a segmentation of the string. For small distances, we clusterthe suﬃx tree and store precomputed answers for selected pairs of nodes. Since the number of distinctdistances is small we obtain an eﬃcient bound on the space.We extend our solution for

Exists and

Count queries to handle

Report queries. To do so we develop a newdecomposition of suﬃx trees, called the induced suﬃx tree decomposition that recursively divides the suﬃxtree in half by index in the string. Hence, the decomposition is a balanced binary tree, where every nodestores the suﬃx tree of a substring of S . We show how to traverse this structure to eﬃciently recover theconsecutive occurrences.For our conditional lower bound we show a reduction based on the set intersection problem. Along theway we show that set intersection remains hard even if all elements in the instance have the same frequency. As mentioned, string indexing for gaps and consecutive occurrences are the most closely related lines of workto this paper. Another related area is document indexing , where the goal is to preprocess a collection ofstrings, called documents , to report those documents that contain patterns subject to various constraints.For a comprehensive overview of this area see the survey by Navarro [36].A well studied line of work within document indexing is document indexing for top- k queries [12, 23, 24,25, 26, 33, 34, 37, 39, 42, 43]. The goal is to eﬃciently report the top- k documents of smallest weight, wherethe weight is a function of the query. Speciﬁcally, the weight can be the distance of a pair of occurrencesof the same or two diﬀerent query patterns [25, 33, 37, 42]. The techniques for top- k indexing (see e.g.Hon et al. [25]) can be adapted to eﬃciently solve gapped indexing for consecutive occurrences in the specialcase when the gap range is of the form [0 , β ]. However, since these techniques heavily exploit that the goalis to ﬁnd the top- k closest occurrences , they do not generalize to general gap ranges.There are several results on conditional lower bounds for pattern matching and string indexing [4, 5,20, 29, 30]. Notably, Ferragina et al. [16] and Cohen and Porat [15] reduce the two dimensional substringindexing problem to set intersection (though the goal was to prove an upper, not a lower bound). In the twodimensional substring indexing problem the goal is to preprocess pairs of strings such that given two patternswe can output the pairs that contain a pattern each. Larsen et al. [30] prove a conditional lower bound forthe document version of indexing for two patterns, i.e., ﬁnding all documents containing both of two querypatterns. Goldstein et al. [20] show that similar lower bounds can be achieved via conjectured hardness ofset intersection. Thus, there are several results linking indexing for two patterns and set intersection. Ourreduction is still quite diﬀerent, since we need a translation from intersection to distance. The paper is organized as follows. In Section 2 we deﬁne notation and recall some useful results. In Section 3we show how to answer

Exists and

Count queries, proving Theorem 1(i). In Section 4 we show how to answer

Report queries, proving Theorem 1(ii). In Section 5 we prove the lower bound, proving Theorem 2. Finally,in Section 6 we apply our techniques to solve the variant where α = 0.3 Preliminaries

Strings. A string S of length n is a sequence S [0] S [1] . . . S [ n −

1] of characters from an alphabet Σ. Acontiguous subsequence S [ i, j ] = S [ i ] S [ i + 1] . . . S [ j ] is a substring of S . The substrings of the form S [ i, n − suﬃxes of S . The suﬃx tree [44] is a compact trie of all suﬃxes of S $, where $ is a symbol not inthe alphabet, and is lexicographically smaller than any letter in the alphabet. Each leaf is labelled with theindex i of the suﬃx S [ i, n −

1] it corresponds to. Using perfect hashing [18], the suﬃx tree can be storedin O ( n ) space and solve the string indexing problem (i.e., ﬁnd and report all occurrences of a pattern P ) in O ( m + occ) time, where m is the length of P and occ is the number of times P occurs in S .For any node v in the suﬃx tree, we deﬁne str( v ) to be the string found by concatenating all labels onthe path from the root to v . The locus of a string P , denoted locus( P ), is the minimum depth node v suchthat P is a preﬁx of str( v ). The suﬃx array stores the suﬃx indices of S $ in lexicographic order. We identifyeach leaf in the suﬃx tree with the suﬃx index it represents. The suﬃx tree has the property that the leavesbelow any node represent suﬃxes that appear in consecutive order in the suﬃx array. For any node v inthe suﬃx tree, range( v ) denotes the range that v spans in the suﬃx array. The inverse suﬃx array is theinverse permutation of the suﬃx array, that is, an array where the i th element is the index of suﬃx i in thesuﬃx array. Orthogonal range successor.

The orthogonal range successor problem is to preprocess an array A [0 , . . . , n −

1] into a data structure that eﬃciently supports the following queries: • RangeSuccessor ( a, b, x ): return the successor of x in A [ a, . . . , b ], that is, the minimum y > x such thatthere is an i ∈ [ a, b ] with A [ i ] = y . • RangePredecessor ( a, b, x ): return the predecessor of x in A [ a, . . . , b ], that is, the maximum y < x suchthat there is an i ∈ [ a, b ] with A [ i ] = y . In this section we give a data structure that can answer

Exists and

Count queries. The main idea is to splitthe query interval into “large” and “small” distances. For large distances we exploit that there can only bea small number of consecutive occurrences and we check them with a simple segmentation of S . For smalldistances we cluster the suﬃx tree and precompute answers for selected pairs of nodes.We ﬁrst show how to use orthogonal range successor queries to ﬁnd consecutive occurrences. Then wedeﬁne the clustering scheme used for the suﬃx tree and give the complete data structure. Assume we have found the loci of P and P in the suﬃx tree. Then we can answer the following queries ina constant number of orthogonal range successor queries: • FindConsecutive P ( i ): given an occurrence i of P , return the consecutive occurrence ( i, j ) of P and P , if it exists, and No otherwise. • FindConsecutive P ( j ): given an occurrence j of P , return the consecutive occurrence ( i, j ) of P and P , if it exists, and No otherwise.Given a query FindConsecutive P ( i ), we answer as follows. Compute j = RangeSuccessor (range(locus( P )) , i )to get the closest occurrence of P after i . Compute i (cid:48) = RangePredecessor (range(locus( P )) , j ) to get theclosest occurrence of P before j . If i = i (cid:48) then no other occurrence of P exists between i and j and theyare consecutive. In that case we return ( i, j ). Otherwise, we return No .Similarly, we can answer FindConsecutive P ( j ) by ﬁrst doing a RangePredecessor and then a

RangeSuccessor query. Thus, given the loci of both patterns and a speciﬁc occurrence of either P or P , we can in a constant4umber of RangeSuccessor and

RangePredecessor queries ﬁnd the corresponding consecutive occurrence, if itexists.

To build the data structure we will use a cluster decomposition of the suﬃx tree.

Cluster Decomposition

A cluster decomposition of a tree T is deﬁned as follows: For a connectedsubgraph C ⊆ T , a boundary node v is a node v ∈ C such that either v is the root of T , or v has an edgeleaving C – that is, there exists an edge ( v, u ) in the tree T such that u ∈ T \ C . A cluster is a connectedsubgraph C of T with at most two boundary nodes. A cluster with one boundary node is called a leaf cluster .A cluster with two boundary nodes is called a path cluster . For a path cluster C , the two boundary nodesare connected by a unique path. We call this path the spine of C . A cluster partition is a partition of T into clusters, i.e. a set CP of clusters such that (cid:83) C ∈ CP V ( C ) = V ( T ) and (cid:83) C ∈ CP E ( C ) = E ( T ) and no twoclusters in CP share any edges. Here, E ( G ) and V ( G ) denote the edge and vertex set of a (sub)graph G ,respectively. We need the next lemma which follows from well-known tree decompositions [1, 2, 3, 17] (seeBille and Gørtz [7] for a direct proof). Lemma 3.

Given a tree T with n nodes and a parameter τ , there exists a cluster partition CP such that | CP | = O ( n/τ ) and every C ∈ CP has at most τ nodes. Furthermore, such a partition can be computed in O ( n ) time. Data Structure

We build a clustering of the suﬃx tree of S as in Lemma 3, with cluster size at most τ ,where τ is some parameter satisfying 0 < τ ≤ n . Then the counting data structure consists of: • The suﬃx tree of S , with some additional information for each node. For each node v we store: – The range v spans in the suﬃx array, i.e., range( v ). – A bit that indicates if v is on a spine. – If v is on a spine, a pointer to the lower boundary node of the spine. – If v is a leaf, the local rank of v . That is, the rank of v in the text order of the leaves in thecluster that contains v . Note that this is at most τ . • The inverse suﬃx array of S . • A range successor data structure on the suﬃx array of S . • An array M ( u, v ) of length (cid:98) nτ (cid:99) +1 for every pair of boundary nodes ( u, v ). For 1 ≤ x ≤ (cid:98) nτ (cid:99) , M ( u, v )[ x ]is the number of consecutive occurrences ( i, j ) of str( u ) and str( v ) with distance at most x . We set M ( u, v )[0] = 0.Denote M ( u, v )[ α, β ] = M ( u, v )[ β ] − M ( u, v )[ α − M ( u, v )[ α, β ] is the number of consecutiveoccurrences of str( u ) and str( v ) with a distance in [ α, β ]. Space Analysis.

We store a constant amount per node in the suﬃx tree. The suﬃx tree and inverse suﬃxarray occupy O ( n ) space. For the orthogonal range successor data structure we use the data structure ofNekrich and Navarro [41] which uses O ( n ) space and O (log (cid:15) n ) time, for constant (cid:15) >

0. There are O (cid:0) n /τ (cid:1) pairs of boundary nodes and for each pair we store an array of length O ( n/τ ). Therefore the total spaceconsumption is O (cid:0) n + n /τ (cid:1) . 5 .3 Query Algorithm We now show how to count the consecutive occurrences ( i, j ) with a distance in the interval, i.e. α ≤ j − i ≤ β .We call each such pair a valid occurrence .To answer a query we split the query interval [ α, β ] into two: [ α, (cid:98) nτ (cid:99) ] and [ (cid:98) nτ (cid:99) + 1 , β ], and handle theseseparately. > nτ . We start by ﬁnding the loci of P and P in the suﬃx tree. As shown in Section 3.1, this allows us to ﬁndthe consecutive occurrence containing a given occurrence of either P or P . We implicitly partition thestring S into segments of (at most) (cid:98) n/τ (cid:99) characters by calculating τ segment boundaries. Segment i , for0 ≤ i < τ , contains characters S [ i · (cid:98) nτ (cid:99) , ( i + 1) · (cid:98) nτ (cid:99) −

1] and segment τ (if it exists) contains the characters S [ τ · (cid:98) nτ (cid:99) , n − P in each segment by performing a series of RangePredecessor queries, starting from the beginning of the last segment. Each time an occurrence i is found we perform thenext query from the segment boundary to the left of i , continuing until the start of the string is reached.For each occurrence i of P found in this way, we use FindConsecutive P ( i ) to ﬁnd the consecutive occurrence( i, j ) if it exists. We check each of them, discard any with distance ≤ nτ and count how many are valid. ≤ nτ . In this part, we only count valid occurrences with distance ≤ nτ . Consider the loci of P and P in the suﬃxtree. Let C i denote the cluster that contains locus( P i ) for i = 1 ,

2. There are two main cases.

At least one locus is not on a spine

If either locus is in a small subtree hanging oﬀ a spine in a clusteror in a leaf cluster, we directly ﬁnd all consecutive occurrences as follows: If locus( P ) is in a small subtreethen we use FindConsecutive P ( i ) on each leaf i below locus( P ) to ﬁnd all consecutive occurrences, countthe valid occurrences and terminate. If only locus( P ) is in a small subtree then we use FindConsecutive P ( j )for each leaf j below locus( P ), count the valid occurrences and terminate. Both loci are on the spine

If neither locus is in a small subtree then both are on a spine. Let ( b , b )denote the lower boundary nodes of the clusters C and C , respectively. There are two types of consecutiveoccurrences ( i, j ):(i) Occurrences where either i or j are inside C resp. C .(ii) Occurrences below the boundary nodes, that is, i is below b and j is below b .See Figure 1(a). We describe how to count the diﬀerent types of occurrences next. Type (i) occurrences

To ﬁnd the valid occurrences ( i, j ) where either i ∈ C or j ∈ C we doas follows. First we ﬁnd all the consecutive occurrences ( i, j ) where i is a leaf in C by computing FindConsecutive P ( i ) for all leaves i below locus( P ) in C . We count all valid occurrences we ﬁnd inthis way. Then we ﬁnd all remaining consecutive occurrences ( i, j ) where j is a leaf in C by computing FindConsecutive P ( j ) for all leaves j below locus( P ) in C . If FindConsecutive P ( j ) returns a valid occurrence( i, j ) we use the inverse suﬃx array to check if the leaf i is below b . This can be done by checking whether i ’s position in the suﬃx array is in range( b ). If i is below b we count the occurrence, otherwise we discardit. Type (ii) occurrences

Next, we count the consecutive occurrences ( i, j ), where both i and j arebelow b and b , respectively. We will use the precomputed table, but we have to be a careful not toovercount. By its construction, M ( b , b )[ α, min( (cid:98) nτ (cid:99) , β )] is the number of consecutive occurrences ( i (cid:48) , j (cid:48) )of str( b ) and str( b ), where α ≤ j (cid:48) − i (cid:48) ≤ min( (cid:98) nτ (cid:99) , β ). However, not all of these occurrence ( i (cid:48) , j (cid:48) ) are6igure 1: (a) Any consecutive occurrences ( i, j ) of P and P is either also a consecutive occurrence of str( b )and str( b ), or i or j are within the cluster. The suﬃx array is shown in the bottom with the correspondingranges marked. (b) Example of a false occurrence. Here ( i (cid:48) , j (cid:48) ) is a consecutive occurrence of str( b ) andstr( b ), but not a consecutive occurrence of P and P due to i . The string S is shown in bottom with thepositions of the occurrences marked.necessarily consecutive occurrences of P and P , as there could be an occurrence of P in C or P in C which is between i (cid:48) and j (cid:48) . We call such a pair ( i (cid:48) , j (cid:48) ) a false occurrence . See Figure 1(b). We proceed asfollows.1. Set c = M ( b , b )[ α, min( (cid:98) nτ (cid:99) , β )].2. Construct the lists L i containing the leaves in C i that are below locus( P i ) sorted by text order for i = 1 ,

2. We can obtain the lists as follows. Let [ a, b ] be the range of locus( P i ) and [ a (cid:48) , b (cid:48) ] = range( b i ).Sort the leaves in [ a, a (cid:48) − ∪ [ b (cid:48) + 1 , b ] using their local rank.3. Until both lists are empty iteratively pick and remove the smallest element e from the start of eitherlist. There are two cases. • e is an element of L . – Compute j (cid:48) = RangeSuccessor (range( b ) , e ) to get the closest occurrence of str( b ) after e . – Compute i (cid:48) = RangePredecessor (range( b ) , j (cid:48) ) to get the closest occurrence of str( b ) before j (cid:48) . • e is an element of L . – Compute i (cid:48) = RangePredecessor (range( b ) , e ) to get the previous occurrence i (cid:48) of str( b ). – Compute j (cid:48) = RangeSuccessor (range( b ) , j (cid:48) ) to get the following occurrence j (cid:48) of str( b ).If α ≤ j (cid:48) − i (cid:48) ≤ min( (cid:98) nτ (cid:99) , β ) and i (cid:48) < e < j (cid:48) decrement c by one. We skip any subsequent occurrencesthat are also inside ( i (cid:48) , j (cid:48) ). As the lists are sorted by text order, all occurrences that are within thesame consecutive occurrence ( i (cid:48) , j (cid:48) ) are handled in sequence.Finally, we add the counts of the diﬀerent type of occurrences. Correctness.

Consider a consecutive occurrence ( i, j ) where j − i > nτ . Such a pair must span a segmentboundary, i.e., i and j cannot be in the same segment. As ( i, j ) is a consecutive occurrence, i is the lastoccurrence of P in its segment and j is the ﬁrst occurrence of P in its segment. With the RangePredecessor P that are the last in their segment. We thus check and count all validoccurrences of large distance in the initial pass of the segments.If either locus is in a small subtree we use FindConsecutive P ( . ) or FindConsecutive P ( . ) on the leavesbelow that locus, which by the arguments in Section 3.1 will ﬁnd all consecutive occurrences.Otherwise, both loci are on a spine. To count type (i) occurrences we use FindConsecutive P ( i ) for allleaves i below locus( P ) in C and FindConsecutive P ( j ) for all leaves j below locus( P ) in C . However, anyvalid occurrence ( i, j ) where both i ∈ C and j ∈ C is found by both operations. Therefore, whenever weﬁnd a valid occurrence ( i, j ) via i = FindConsecutive P ( j ) for j ∈ C , we only count the occurrence if i isbelow b . Thus we count all type (i) occurrences exactly once.To count type (ii) occurrences we start with c = M ( b , b )[ α, min( (cid:98) nτ (cid:99) , β )], which is the number ofconsecutive occurrences ( i (cid:48) , j (cid:48) ) of str( b ) and str( b ), where α ≤ j (cid:48) − i (cid:48) ≤ min( (cid:98) nτ (cid:99) , β ). Each ( i (cid:48) , j (cid:48) ) is eitheralso a consecutive occurrence of P and P , or there exists an occurrence of P or P between i (cid:48) and j (cid:48) .Let ( i (cid:48) , j (cid:48) ) be a false occurrence and let w.l.o.g. i be an occurrence of P with i (cid:48) < i < j (cid:48) . Then i is aleaf in C , since ( i (cid:48) , j (cid:48) ) is a consecutive occurrence of str( b ) and str( b ). In step 3 we check for each leafinside the clusters below the loci, if it is between a consecutive occurrence ( i (cid:48) , j (cid:48) ) of str( b ) and str( b ) andif α ≤ j (cid:48) − i (cid:48) ≤ min( (cid:98) nτ (cid:99) , β ). In that case ( i (cid:48) , j (cid:48) ) is a false occurrence and we adjust the count c . As ( i (cid:48) , j (cid:48) )can have multiple occurrences of P and P inside it, we skip subsequent occurrences inside ( i (cid:48) , j (cid:48) ). Afteradjusting for false occurrences, c is the number of type (ii) occurrences. Time Analysis.

We ﬁnd the loci in O ( | P | + | P | ) time. Then we perform a number of range successorand ﬁnd consecutive queries. The time for a ﬁnd consecutive query is bounded by the time to do a constantnumber of range successor queries. To count the large distances we check at most τ segment boundariesand thus perform O ( τ ) range successor and ﬁnd consecutive queries. If either locus is not on a spine wecheck the leaves below that locus. There are at most τ such leaves due to the clustering. To count type (i)occurrences we check the leaves below the loci and inside the clusters. There are at most 2 τ such leaves intotal. To count type (ii) occurrences we check two lists constructed from the leaves inside the clusters belowthe loci. There are again at most 2 τ such leaves in total. For each of these O ( τ ) leaves we use a constantnumber of range successor and ﬁnd consecutive queries. Thus the time for this part is bounded by the timeto perform O ( τ ) range successor queries.Using the data structure of Nekrich and Navarro [41], each range successor query takes O (log (cid:15) n ) time sothe total time for these queries is O ( τ log (cid:15) n ). For type (ii) occurrences we sort two lists of size at most τ froma universe of size τ , which we can do in O ( τ ) time. Thus, the total query time is O ( | P | + | P | + τ log (cid:15) n ).Setting τ = Θ( n / ) we get a data structure that uses O (cid:0) n + n /τ (cid:1) = O ( n ) space and has querytime O ( | P | + | P | + τ log (cid:15) n ) = O ( | P | + | P | + n / log (cid:15) n ), for constant (cid:15) >

0. Given an

Exists query weanswer with a

Count query, terminating when the ﬁrst valid occurrence is found. This concludes the proofof Theorem 1(i).

In this section, we describe our data structure for reporting queries. Note that in Section 3, we explicitlyﬁnd all valid occurrences except for type (ii) occurrences, where we use the precomputed values. In thissection, we describe how we can use a recursive scheme to report these.The main idea, inspired by fast set intersection by Cohen and Porat [15], is to build a binary structurewhich allows us to recursively divide into subproblems of half the size. Intuitively, the subdivision is a binarytree where every node contains the suﬃx tree of a substring of S . We use this structure to ﬁnd type (ii)occurrences by recursing on smaller trees. We deﬁne the binary decomposition of the suﬃx tree next. Thedetails of the full solution follow after that. 8 X X X Figure 2: The suﬃx tree of

NANANANABATMAN $ together with its children trees T [0 ,

7] and T [8 , Let T be a suﬃx tree of a string S of length n . For an interval [ a, b ] of text positions , we deﬁne T [ a, b ] to bethe subtree of T induced by the leaves in [ a, b ]: That is, we consider the subtree consisting of leaves in [ a, b ]together with their ancestors. We then delete each node that has only one child in the subtree and contractits ingoing and outgoing edge. See Figure 2.The induced suﬃx tree decomposition of T now consists of a higher level binary tree structure, the de-composition tree , where each node corresponds to an induced subtree of the suﬃx tree. The root correspondsto T [0 , n − • The root of the decomposition tree corresponds to T [0 , n −

1] and has level 0. • For each T [ a, b ] of level i in the decomposition, if b − a >

1, its two children in the decompositiontree are T [ a, c ] and T [ c + 1 , b ] where c = (cid:98) a + b (cid:99) ; we will sometimes refer to these as “children trees” todiﬀerentiate from children in the suﬃx tree.The decomposition tree is a balanced binary tree and the total size of the induced subtrees in the decom-position is O ( n log n ): There are at most 2 i decomposition tree nodes on level i , each of which corresponds toan induced subtree of size O (cid:0) n i (cid:1) , and thus the total size of the trees on each of the O (log n ) levels is O ( n ).For each node v in T [ a, b ], we deﬁne the successor node of v in each of the children trees of T [ a, b ] in thefollowing way: If v exists in the child tree, the successor node is v . Else, it is the closest descendant whichis present. Note that from the way the induced subtrees are constructed, v has at most one successor nodein each child tree.The induced suﬃx tree decomposition of S consists of: • Each T [ a, b ] stored as a compact trie. • For each T [ a, b ] we store a “cropped” suﬃx array SA [ a,b ] , that is, the suﬃx array of S [ a, b ] with theoriginal indices within S . 9 For each node v in T [ a, b ] we store a pointer from v to its successor nodes in each child tree, if it exists,and the interval in SA [ a,b ] that corresponds to the leaves below v .Since we store only constant information per node in any T [ a, b ], the total space usage of this is O ( n log n ). The reporting data structure consists of: • The induced suﬃx tree decomposition for S , • An orthogonal range successor data structure on the suﬃx array, and • The data structure from Section 3 for each T [ a, b ] in the induced suﬃx tree decomposition with pa-rameters n i and τ i , where n i = (cid:98) n i (cid:99) and τ i = Θ( n / i ), such that n i /τ i = (cid:98) n / i (cid:99) . The only change isthat we do not store an orthogonal range successor data structure for each of the induced subtrees. Space Analysis.

We use the O ( n log log n ) space and O (log log n ) time orthogonal range successor struc-ture of Zhou [45]. The existence data structure for each T [ a, b ] of level i is linear in n i . Thus, by thearguments of Section 4.1, the total space is O ( n log n ). The main idea behind the algorithm is the following: For large distances, as in Section 3, we implicitlysegment S to ﬁnd all consecutive occurrences of at least a certain distance. For small distances, we aregoing to use the cluster decomposition and counting arrays to decide whether valid occurrences exist. Thatis, if one of the loci is in a small subtree, we use FindConsecutive P ( . ) resp. FindConsecutive P ( . ) to ﬁnd allconsecutive occurrences. Else, we perform a query as in Section 3 to decide whether any valid occurrencesexist, and if yes, we recurse on smaller subtrees.The idea here is, that in the induced suﬃx tree decomposition, the trees are divided in half by text position - therefore, a consecutive occurrence either will be fully contained in the left child tree, fully contained in theright child tree, or have the property that the occurrence of P is the maximum occurrence in the left childtree and the occurrence of P is the minimum occurrence in the right child tree. We will check the bordercase each time when we recurse.In detail, we do the following: We ﬁnd the loci of P and P in the suﬃx tree. As in the previous section,we check τ segment boundaries with τ = Θ( n / ) to ﬁnd all consecutive occurrences with distance within[max( α, (cid:98) n / (cid:99) ) , β ]. Now, we only have to ﬁnd consecutive occurrences of distance within [ α, min( β, (cid:98) n / (cid:99) )]in T = T [0 , n − n i = (cid:98) n i (cid:99) and β i = min( β, (cid:98) n / i (cid:99) ) and let T [ a, b ] be an induced subtree oflevel i .To ﬁnd all consecutive occurrences with distance within [ α, β i ] in T [ a, b ] of level i , given the loci of P and P in T [ a, b ], recursively do the following: • If any of the loci is not on a spine of a cluster, we ﬁnd all consecutive occurrences using

FindConsecutive P ( . )resp. FindConsecutive P ( . ) and check for each of them if they are valid; we report all such, then termi-nate. • Else, we use the query algorithm for small distances from Section 3 to decide whether a valid occurrencewith distance within [ α, β i ] exists in T [ a, b ].If such a valid occurrence exists, we recurse; that is, set c = (cid:98) a + b (cid:99) . We use RangePredecessor to ﬁndthe last occurrence of P before and including c , and RangeSuccessor to ﬁnd the ﬁrst occurrence of P after c . Then we check if they are consecutive (again using RangePredecessor and

RangeSuccessor ), andif it is a valid occurrence. If yes, we add it to the output. Then, for both S [ a, c ] and S [ c + 1 , b ], weimplicitly partition into segments of size (cid:98) n / i +1 (cid:99) and ﬁnd and output all valid occurrences of distance10 n / i +1 . Then we follow pointers to the successor nodes of the current loci to ﬁnd the loci of P and P in the children trees T [ a, c ] and T [ c + 1 , b ] and recurse on those trees to ﬁnd all consecutive occurrencesof distance within [ α, β i +1 ] Correctness.

At any point before we recurse on level i , we check all consecutive occurrences of distance > n / i +1 by segmenting the current substring of S . By the arguments of the previous section, we will ﬁnd allsuch valid occurrences. Thus, on the subtrees of level i + 1, we need only care about consecutive occurrenceswith distance in [ α, β i +1 ].By the properties of the induced suﬃx tree decomposition, a consecutive occurrence of P and P thatis present in T [ a, b ] will either be fully contained in T [ a, c ], or in T [ c + 1 , b ], or the occurrence of P is thelast occurrence before and including c and the occurrence of P is the ﬁrst occurrence after c . We check theborder case each time we recurse. Thus, no consecutive occurrences get lost when we recurse. If we stopthe recursion, it is either because one of the loci was in a small subtree or that no valid occurrences withdistance within [ α, β i ] exists in T [ a, b ]. In the ﬁrst case we found all valid occurrences with distance within[ α, β i ] in T [ a, b ] by the same arguments as in Section 3. Thus, we ﬁnd all valid occurrences of P and P . Time Analysis.

For ﬁnding the loci, we ﬁrst spend O ( | P | + | P | ) time in the initial suﬃx tree T [0 , n − O (occ) leaves. Thus,we traverse at most O (occ log n ) nodes.Each time we recurse, we spend a constant number of RangeSuccessor and

RangePredecessor queries tocheck the border cases. Additionally, we spend O ( n / i ) such queries on each node of level i that we visit inthe decomposition tree: For ﬁnding the “large” occurrences, and additionally either for reporting everythingwithin a small subtree or doing an existence query. For ﬁnding large occurrences, there are O ( n / i ) segmentsto check. The number of orthogonal range successor queries used for existence queries or reporting within asmall subtree is bounded by the number of leaves within a cluster, which is also O ( n / i ).Now, let x be the number of decomposition tree nodes we traverse and let l i , i = 1 , . . . , x , be the level ofeach such node. The goal is to bound (cid:80) xi =1 (cid:0) n li (cid:1) / . By the argument above, x = O (occ log n ). Note thatbecause the decomposition tree is binary we have that (cid:0)(cid:80) xi =1 12 li (cid:1) ≤ log n . The number of queries to theorthogonal range successor data structure is thus asymptotically bounded by: x (cid:88) i =1 (cid:16) n l i (cid:17) / = n / x (cid:88) i =1 (cid:18) l i (cid:19) / · ≤ n / (cid:32) x (cid:88) i =1 (cid:18) l i (cid:19) · (cid:33) / (cid:32) x (cid:88) i =1 (cid:33) / = n / (cid:32) x (cid:88) i =1 l i (cid:33) / x / = O ( n / occ / log n )For the inequality, we use H¨older’s inequality, which holds for all ( x , . . . , x k ) ∈ R k and ( y , . . . , y k ) ∈ R k p and q both in (1 , ∞ ) such that 1 /p + 1 /q = 1: k (cid:88) i =1 | x i y i | ≤ (cid:32) k (cid:88) i =1 | x i | p (cid:33) /p (cid:32) k (cid:88) i =1 | y i | q (cid:33) /q (1)We apply (1) with p = 3 / q = 3.Since the data structure of Zhou [45] uses O (log log n ) time per query, the total running time of thealgorithm is O ( | P | + | P | + n / occ / log n log log n ). This concludes the proof of Theorem 1(ii). We now prove the conditional lower bound from Theorem 2 based on set intersection. We use the frameworkand conjectures as stated in Goldstein et al. [20]. Throughout the section, let I = S , , . . . , S m be a collectionof m sets of total size N from a universe U . The SetDisjointness problem is to preprocess I into a compactdata structure, such that given any pair of sets S i and S j , we can quickly determine if S i ∩ S j = ∅ . We usethe following conjecture. Conjecture 1 (Strong SetDisjointness Conjecture) . Any data structure that can answer SetDisjointnessqueries in t query time must use (cid:101) Ω (cid:16) N t (cid:17) space. We deﬁne the following weaker variant of the SetDisjointness problem: the f -FrequencySetDisjointnessproblem is the SetDisjointness problem where every element occurs in precisely f sets. We now show thatany solution to the f -FrequencySetDisjointness problem implies a solution to SetDisjointness, matching thecomplexities up to polylogarithmic factors. Lemma 4.

Assuming the Strong SetDisjointness Conjecture, every data structure that can answer f -FrequencySetDisjointness queries in time O ( N δ ) , for δ ∈ [0 , / , must use (cid:101) Ω (cid:0) N − δ − o (1) (cid:1) space.Proof. Assume there is a data structure D solving the f -FrequencySetDisjointness problem in time O ( N δ )and space O (cid:0) N − δ − (cid:15) (cid:1) for constant (cid:15) with 0 < (cid:15) <

1. Let I = S , . . . , S m be a given instance of SetDis-jointness, where each S i is a set of elements from universe U , and assume w.l.o.g. that m is a power oftwo.Deﬁne the frequency of an element, f e , as the number of sets in I that contain e . We construct log m instances I , . . . , I log m of the f -FrequencySetDisjointness problem. For each j , 1 ≤ j ≤ log m , the instance I j contains the following sets: • For each i ∈ [1 , m ] a set S ji containing all e ∈ S i that satisfy 2 j − ≤ f e < j ; • j − “dummy sets”, which contain extra copies of elements to make sure that all elements have thesame frequency. That is, we add every element with 2 j − ≤ f e < j to the ﬁrst 2 j − f e dummy sets.These sets will not be queried in the reduction.Instance I j has O ( m ) sets and every element occurs exactly 2 j times. Further, the total number of elementsin all the instances is at most 2 N . We now build f -FrequencySetDisjointness data structures D j = D ( I j )for each of the log m instances.To answer a SetDisjointness query for two sets S i and S i , we query D j for the sets S ji and S ji , for each1 ≤ j ≤ log m . If there exists a j such that S ji and S ji are not disjoint, we output that S i and S j are notdisjoint. Else, we output that they are disjoint.If there exists e ∈ S i ∩ S i , let j be such that 2 j − ≤ f e < j . Then e ∈ S ji ∩ S ji , and we will correctlyoutput that the sets are not disjoint. If S i and S i are disjoint, then, since S ji is a subset of S i and S ji isa subset of S i , the queried sets are disjoint in every instance. Thus we also answer correctly in this case.12igure 3: Instance of f -FrequencySetDisjointness problem reduced to Exists . Alphabet Σ = { , } and ﬁxedfrequency f = 2, resulting in block size B = 2 · N j denote the total number of elements in I j . For each j , we have N j ≤ N and thus N − δ − (cid:15)j ≤ (2 N ) − δ − (cid:15) . Thus, the space complexity is asymptotically bounded by (cid:100) log m (cid:101) (cid:88) j =1 N − δ − (cid:15)j = O ( N − δ − (cid:15) log m ) . Similarly, we have N δj = O ( N δ ) and so the time complexity is asymptotically bounded by (cid:100) log m (cid:101) (cid:88) j =1 N δj = O ( N δ log m ) . This is a contradiction to Conjecture 1.

We can reduce the f -FrequencySetDisjointness problem to Exists queries of the gapped indexing problem:Assume we are given an instance of the f -FrequencySetDisjointness problem with a total of N elements.Each distinct element occurs f times. Assume again w.l.o.g. that the number of sets m is a power of two.Assign to each set S i in the instance a unique binary string w i of length log m . Build a string S as follows:Consider an arbitrary ordering e , e , ... of the distinct elements present in the instance. Let $ be an extraletter not in the alphabet. The ﬁrst B = f · log m + f letters are a concatenation of w i $ of all sets S i that e is contained in, sorted by i . This block is followed by B copies of $ . Then, we have B symbols consistingof the strings for each set that e is contained in, again followed by B copies of $ , and so on. See Figure 3for an example.For a query for two sets S i and S j , where i < j , we set P = w i and P = w j , α = 0, and β = B . If thesets are disjoint, then there are no occurrences which are at most B apart. Otherwise w i and w j occur inthe same block, and w j comes after w i . The length of the string S is 2 N log m + 2 N : In the block for eachelement, we have log m + 1 letters for each of its occurrences, and it is followed by a $ block of the samelength.This means that if we can solve Exists queries in s ( n ) space and t ( n ) + O ( | P | + | P | ) time, where n is thelength of the string, we can solve the f -FrequencySetDisjointness problem in s (2 N log m + 2 N ) space and t (2 N log m + 2 N ) + O (log m ) time. Together with Lemma 4, Theorem 2 follows. [0 , β ] Gaps

In this section, we consider the special case where the queries are one sided intervals of the form [0 , β ]. Wegive a data structure supporting the following tradeoﬀs:13 heorem 5.

Given a string of length n , we can(i) construct an O ( n ) space data structure that supports Exists ( P , P , , β ) queries in O ( | P | + | P | + √ n log (cid:15) n ) time for constant (cid:15) > , or(ii) construct an O ( n log n ) space data structure that supports Count ( P , P , , β ) and Report ( P , P , , β ) queries in O ( | P | + | P | + ( √ n · occ) log log n ) time, where occ is the size of the output. Note that since the results match (up to log factors) the best known results for set intersection, this isabout as good as we can hope for. We mention here that for this speciﬁc problem, a similar tradeoﬀ followsfrom the strategies used by Hon et al. [25]. The results from that paper include (among others) a datastructure for documents such that given a query of two patterns P and P and a number k , one can outputthe k documents with the closest occurrences of P and P . Thus, the problem is slightly diﬀerent, however,with some adjustments, the results from Theorem 5 follow (up to a log factor). We show a simple, directsolution.The data structure is a simpler version of the data structure considered in the previous sections. Themain idea is that for each pair of boundary nodes u and v , we do not have to store an array of distances, butonly one number that carries all the information: the smallest distance of a consecutive occurrence of str( u )and str( v ). Thus, for existence, we can cluster with τ = √ n to achieve linear space, and we do not need tocheck large distances separately. For the reporting solution, we store the decomposition from Section 4.1,and use the matrix M to decide where to recurse. In the following we will describe the details. Existence data structure.

For solving

Exists queries in this setting, we cluster the suﬃx tree withparameter τ = √ n . Again, we store the linear space orthogonal range successor data structure by Nekrichand Navarro [41] on the suﬃx array. For each pair of boundary nodes ( u, v ), we store at M ( u, v ) the minimumdistance of a consecutive occurrence of str( u ) and str( v ). The total space is linear. To query, we proceedsimilarly as in Section 3 for the “small distances”: We ﬁnd the loci of P and P . If any of the loci is noton the spine, we check all consecutive occurrences using FindConsecutive P ( . ) resp. FindConsecutive P ( . ). Ifboth loci are on the spine, denote b , b the lower boundary nodes of the respective clusters. Find M ( b , b ).If M ( b , b ) ≤ β , we can immediately return Yes : If a valid occurrence ( i (cid:48) , j (cid:48) ) of str( b ) and str( b ) exists,then either ( i (cid:48) , j (cid:48) ) is a consecutive occurrence of P and P , or there exists a consecutive occurrence ofsmaller distance. Otherwise, that is if M ( b , b ) > β , all valid occurrences ( i, j ) have the property thateither i is in the cluster of locus( P ) or j is in the cluster of locus( P ), and we check all such pairs using FindConsecutive P ( . ) resp. FindConsecutive P ( . ). The running time is O ( | P | + | P | + √ n log (cid:15) n ). Reporting data structure.

For the reporting data structure, we store the decomposition of the suﬃx treeas described in Section 4.1 and the O ( n log n ) space orthogonal range successor data structure by Zhou [45]on the suﬃx array. For each induced subtree of level i in the decomposition, we store the existence datastructure we just described. Reporting algorithm.

The algorithm follows a similar, but simpler, recursive structure as in Section 4.We begin by ﬁnding the loci of P and P . If either of the loci is not on a spine, we ﬁnd all consecutiveoccurrences using FindConsecutive P ( . ) resp. FindConsecutive P ( . ), check if they are valid, report these, andterminate. If both loci are on a spine, we check M ( b , b ) for the lower boundary nodes b and b . If M ( b , b ) > β , all valid occurrences ( i, j ) have the property that either i is in the cluster of locus( P ) or j is in the cluster of locus( P ). We check all such pairs using FindConsecutive P ( . ) resp. FindConsecutive P ( . ),report the valid occurrences, and terminate. If M ( b , b ) ≤ β , we recurse on the children trees. That is, wecheck the border case and follow pointers to the loci in the children trees. Analysis.

The space is O ( n log n ), just as in Section 4.For time analysis, we spend O ( (cid:112) n li ) orthogonal range successor queries on the nodes in the decompositiontree of level l i where we stop the recursion. For all other nodes we visit in the tree traversal, we only spend a14onstant number of queries. In total, we visit O (occ log( n/ occ) + occ) decomposition tree nodes (by followingthe analysis in [15]), and we spend O ( (cid:112) n li ) orthogonal range successor queries on O (occ) many such nodes.We use the same notation as in Section 4. By x = O (occ) we now denote the number of nodes where westop the algorithm and output. Since each such node can be seen as a leaf in a binary tree, (cid:80) xi =1 12 li ≤ p = q = 2). We get as anasymptotic bound for the number of orthogonal range successor queries: x (cid:88) i =1 (cid:114) n l i = √ n x (cid:88) i =1 (cid:114) l i · ≤ √ n (cid:118)(cid:117)(cid:117)(cid:116) x (cid:88) i =1 l i (cid:118)(cid:117)(cid:117)(cid:116) x (cid:88) i =1 ≤ √ nx = O ( √ n · occ) . Note that since occ log( n/ occ) = O (occ (cid:112) n/ occ) = O ( √ n · occ), this brings the total number of orthogonalrange successor queries to O (occ + √ n · occ). Using the data structure by Zhou [45], the time bound fromTheorem 5 follows. We have considered the problem of gapped indexing for consecutive occurrences. We have given a linearspace data structure that can count the number of such occurrences. For the reporting problem, we havegiven a near-linear space data structure. The running time for both includes an O ( n / ) term, which formsa gap of O ( n / ) to the conditional lower bound of O ( √ n ). Thus, the most obvious open question is whetherwe can close this gap, either by improving the data structure or ﬁnding a stronger lower bound.Further, we have used the property that there can only be few consecutive occurrences of large distances.Thus, our solution cannot be easily extended to ﬁnding all pairs of occurrences with distance within thequery interval. An open question is if it is possible to get similar results for that problem. Lastly, documentversions of similar problems have concerned themselves with ﬁnding all documents that contain P and P or the top- k of smallest distance; conditional lower bounds for these problems are also known. It would beinteresting to see if any of these results be extended to ﬁnding all documents that contain a (consecutive)occurrence of P and P that has a distance within a query interval. References [1] Stephen Alstrup, Jacob Holm, Kristian de Lichtenberg, and Mikkel Thorup. Minimizing diameters ofdynamic trees. In

Proc. 24th ICALP , pages 270–280, 1997.[2] Stephen Alstrup, Jacob Holm, and Mikkel Thorup. Maintaining center and median in dynamic trees.In

Proc. 7th SWAT , pages 46–56, 2000.[3] Stephen Alstrup and Theis Rauhe. Improved labeling scheme for ancestor queries. In

Proc. 13th SODA ,pages 947–953, 2002.[4] Amihood Amir, Timothy M. Chan, Moshe Lewenstein, and Noa Lewenstein. On hardness of jumbledindexing. In

Proc. 41st ICALP , pages 114–125, 2014.[5] Amihood Amir, Tsvi Kopelowitz, Avivit Levy, Seth Pettie, Ely Porat, and B. Riva Shalom. Mind thegap: Essentially optimal algorithms for online dictionary matching with one gap. In

Proc. 27th ISAAC ,pages 12:1–12:12, 2016. 156] Johannes Bader, Simon Gog, and Matthias Petri. Practical variable length gap pattern matching. In

Proc. 15th SEA , pages 1–16, 2016.[7] Philip Bille and Inge Li Gørtz. The tree inclusion problem: In linear space and faster.

ACM Trans.Algorithms , 7(3):1–47, 2011.[8] Philip Bille and Inge Li Gørtz. Substring range reporting.

Algorithmica , 69(2):384–396, 2014.[9] Philip Bille, Inge Li Gørtz, Max Rishøj Pedersen, Eva Rotenberg, and Teresa Anna Steiner. StringIndexing for Top- k Close Consecutive Occurrences. In

Proc. 40th FSTTCS , volume 182, pages 14:1–14:17, 2020.[10] Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and Søren Vind. String indexing for patterns withwildcards.

Theory Comput. Syst. , 55(1):41–60, 2014.[11] Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and David Kofoed Wind. String matching withvariable length gaps.

Theoret. Comput. Sci. , 443, 2012. Announced at SPIRE 2010.[12] Sudip Biswas, Arnab Ganguly, Rahul Shah, and Sharma V Thankachan. Ranked document retrievalfor multiple patterns.

Theor. Comput. Sci. , 746:98–111, 2018.[13] P Bucher and A Bairoch. A generalized proﬁle syntax for biomolecular sequence motifs and its functionin automatic sequence interpretation. In

Proc. 2nd ISMB , pages 53–61, 1994.[14] Manuel C´aceres, Simon J Puglisi, and Bella Zhukova. Fast indexes for gapped pattern matching. In

Proc. 46th SOFSEM , pages 493–504, 2020.[15] Hagai Cohen and Ely Porat. Fast set intersection and two-patterns matching.

Theor. Comput. Sci. ,411(40-42):3795–3800, 2010.[16] Paolo Ferragina, Nick Koudas, S. Muthukrishnan, and Divesh Srivastava. Two-dimensional substringindexing.

J. Comput. Syst. Sci. , 66(4):763–774, 2003.[17] Greg N Frederickson. Ambivalent data structures for dynamic 2-edge-connectivity and k smallest span-ning trees. SIAM J. Comput. , 26(2):484–538, 1997.[18] Michael L. Fredman, J´anos Koml´os, and Endre Szemer´edi. Storing a sparse table with o (1) worst caseaccess time. J. ACM , 31(3):538–544, 1984.[19] Kimmo Fredriksson and Szymon Grabowski. Eﬃcient algorithms for pattern matching with generalgaps, character classes, and transposition invariance.

Inf. Retr. , 11(4):335–357, 2008.[20] Isaac Goldstein, Tsvi Kopelowitz, Moshe Lewenstein, and Ely Porat. Conditional Lower Bounds forSpace/Time Tradeoﬀs. In

Proc. 15th WADS , pages 421–436. Springer, 2017.[21] Tuukka Haapasalo, Panu Silvasti, Seppo Sippu, and Eljas Soisalon-Soininen. Online dictionary matchingwith variable-length gaps. In

Proc. 10th SEA , pages 76–87, 2011.[22] K Hofmann, P Bucher, L Falquet, and A Bairoch. The PROSITE database, its status in 1999.

NucleicAcids Res , 27(1):215–219, 1999.[23] Wing-Kai Hon, Manish Patil, Rahul Shah, Sharma V. Thankachan, and Jeﬀrey Scott Vitter. Indexesfor document retrieval with relevance. In

Space-Eﬃcient Data Structures, Streams, and Algorithms -Papers in Honor of J. Ian Munro on the Occasion of His 66th Birthday , pages 351–362, 2013.[24] Wing-Kai Hon, Manish Patil, Rahul Shah, and Shih-Bin Wu. Eﬃcient index for retrieving top-k mostfrequent documents.

J. Discrete Algorithms , 8(4):402–417, 2010.1625] Wing-Kai Hon, Rahul Shah, Sharma V. Thankachan, and Jeﬀrey Scott Vitter. Space-eﬃcient frame-works for top-k string retrieval.

J. ACM , 61(2):1–36, 2014. Announced at 50th FOCS.[26] Wing-Kai Hon, Sharma V. Thankachan, Rahul Shah, and Jeﬀrey Scott Vitter. Faster compressed top-kdocument retrieval. In

Proc. 23rd DCC , pages 341–350, 2013.[27] Costas S Iliopoulos and M Sohel Rahman. Indexing factors with gaps.

Algorithmica , 55(1):60–70, 2009.[28] Orgad Keller, Tsvi Kopelowitz, and Moshe Lewenstein. Range non-overlapping indexing and successivelist indexing. In

Proc. 11th WADS , pages 625–636, 2007.[29] Tsvi Kopelowitz, Seth Pettie, and Ely Porat. Higher lower bounds from the 3sum conjecture. In

Proc.27th SODA , pages 1272–1287, 2016.[30] Kasper Green Larsen, J Ian Munro, Jesper Sindahl Nielsen, and Sharma V Thankachan. On hardnessof several string indexing problems.

Theoret. Comput. Sci. , 582:74–82, 2015.[31] Moshe Lewenstein. Indexing with gaps. In

Proc. 18th SPIRE , pages 135–143, 2011.[32] Gerhard Mehldau and Gene Myers. A system for pattern matching applications on biosequences.

Bioinformatics , 9(3):299–314, 1993.[33] J. Ian Munro, Gonzalo Navarro, Jesper Sindahl Nielsen, Rahul Shah, and Sharma V. Thankachan. Top-kterm-proximity in succinct space.

Algorithmica , 78(2):379–393, 2017. Announced at 25th ISAAC.[34] J. Ian Munro, Gonzalo Navarro, Rahul Shah, and Sharma V. Thankachan. Ranked document selection.

Theor. Comput. Sci. , 812:149–159, 2020.[35] Eugene W. Myers. Approximate matching of network expressions with spacers.

J. Comput. Bio. ,3(1):33–51, 1992.[36] Gonzalo Navarro. Spaces, trees, and colors: The algorithmic landscape of document retrieval on se-quences.

ACM Comput. Surv. , 46(4):1–47, 2014.[37] Gonzalo Navarro and Yakov Nekrich. Time-optimal top-k document retrieval.

SIAM J. Comput. ,46(1):80–113, 2017. Announced at 23rd SODA.[38] Gonzalo Navarro and Mathieu Raﬃnot. Fast and simple character classes and bounded gaps patternmatching, with applications to protein searching.

J. Comput. Bio. , 10(6):903–923, 2003.[39] Gonzalo Navarro and Sharma V. Thankachan. New space/time tradeoﬀs for top-k document retrievalon sequences.

Theor. Comput. Sci. , 542:83–97, 2014. Announced at 20th SPIRE.[40] Gonzalo Navarro and Sharma V. Thankachan. Reporting consecutive substring occurrences underbounded gap constraints.

Theor. Comput. Sci. , 638:108–111, 2016. Announced at 26th CPM.[41] Yakov Nekrich and Gonzalo Navarro. Sorted range reporting. In

Proc 13th SWAT , pages 271–282, 2012.[42] Rahul Shah, Cheng Sheng, Sharma V. Thankachan, and Jeﬀrey Scott Vitter. Top-k document retrievalin external memory. In

Proc. 21st ESA , pages 803–814, 2013.[43] Dekel Tsur. Top-k document retrieval in optimal space.

Inf. Process. Lett. , 113(12):440–443, 2013.[44] Peter Weiner. Linear pattern matching algorithms. In

Proc. 14th FOCS , pages 1–11, 1973.[45] Gelin Zhou. Two-dimensional range successor in optimal time and almost linear space.