Gapped Indexing for Consecutive Occurrences
Philip Bille, Inge Li Gørtz, Max Rishøj Pedersen, Teresa Anna Steiner
GGapped Indexing for Consecutive Occurrences
Philip Bille [email protected]
Inge Li Gørtz [email protected]
Max Rishøj Pedersen [email protected]
Teresa Anna Steiner [email protected]
Abstract
The classic string indexing problem is to preprocess a string S into a compact data structure thatsupports efficient pattern matching queries. Typical queries include existential queries (decide if thepattern occurs in S ), reporting queries (return all positions where the pattern occurs), and countingqueries (return the number of occurrences of the pattern). In this paper we consider a variant of stringindexing, where the goal is to compactly represent the string such that given two patterns P and P anda gap range [ α, β ] we can quickly find the consecutive occurrences of P and P with distance in [ α, β ], i.e.,pairs of occurrences immediately following each other and with distance within the range. We presentdata structures that use (cid:101) O ( n ) space and query time (cid:101) O ( | P | + | P | + n / ) for existence and counting and (cid:101) O ( | P | + | P | + n / occ / ) for reporting. We complement this with a conditional lower bound based onthe set intersection problem showing that any solution using (cid:101) O ( n ) space must use (cid:101) Ω( | P | + | P | + √ n )query time. To obtain our results we develop new techniques and ideas of independent interest includinga new suffix tree decomposition and hardness of a variant of the set intersection problem. The classic string indexing problem is to preprocess a string S into a compact data structure that supportsefficient pattern matching queries. Typical queries include existential queries (decide if the pattern occursin S ), reporting queries (return all positions where the pattern occurs), and counting queries (return thenumber of occurrences of the pattern). An important variant of this problem is the gapped string indexingproblem [6, 8, 10, 14, 27, 28, 31]. Here, the goal is to compactly represent the string such that given twopatterns P and P and a gap range [ α, β ] we can quickly find occurrences of P and P with distance in[ α, β ]. Searching and indexing with gaps is frequently used in computational biology applications [6, 11, 13,14, 19, 21, 22, 32, 35, 38].Another variant is string indexing for consecutive occurrences [9, 40]. Here, the goal is to compactlyrepresent the string such that given a pattern P and a gap range [ α, β ] we can quickly find consecutiveoccurrences of P with distance in [ α, β ], i.e., pairs of occurrences immediately following each other and withdistance within the range.In this paper, we consider the natural combination of these variants that we call gapped indexing forconsecutive occurrences . Here, the goal is to compactly represent the string such that given two patterns P and P and a gap range [ α, β ] we can quickly find the consecutive occurrences of P and P with distance in[ α, β ].We can apply standard techniques to obtain several simple solutions to the problem. To state the bounds,let n be the size of S . If we store the suffix tree for S , we can answer queries by searching for both querystrings, merging the results, and removing all non-consecutive occurrences. This leads to a solution using O ( n ) space and (cid:101) O ( | P | + | P | + occ P + occ P ) query time, where occ P and occ P denote the number ofoccurrences of P and P , respectively . However, occ P + occ P may be as large as Ω( n ) and much largerthan the size of the output.Alternatively, we can obtain a fast query time in terms of the output at the cost of increasing the spaceto Ω( n ). To do so, store for each node v in the suffix tree the set of all consecutive occurrences ( i, j ) where (cid:101) O and (cid:101) Ω ignores polylogarithmic factors a r X i v : . [ c s . D S ] F e b is a position below v in a 2D range searching data structure organized by the lexicographic order of j andtheir distance from descendants of v . To answer a query, we then perform a 2D range search in the structurecorresponding to P using the lexicographic range in the suffix tree defined by P and the gap range. Thisleads to a solution for reporting queries using (cid:101) O ( n ) space and (cid:101) O ( | P | + | P | + occ) time, where occ is sizeof the output. For existence and counting, we obtain the same bound without the occ term.In this paper, we introduce new solutions that significantly improve the above time-space trade-offs.Specifically, we present data structures that use (cid:101) O ( n ) space and query time (cid:101) O ( | P | + | P | + n / ) for existenceand counting and (cid:101) O ( | P | + | P | + n / occ / ) for reporting. We complement this with a conditional lowerbound based on the set intersection problem showing that any solution using (cid:101) O ( n ) space must use (cid:101) Ω( | P | + | P | + √ n ) query time. To obtain our results we develop new techniques and ideas of independent interestincluding a new suffix tree decomposition and hardness of a variant of the set intersection problem. Throughout the paper, let S be a string of length n . Given two patterns P and P a consecutive occurrence in S is a pair of occurrences ( i, j ), 0 ≤ i < j < | S | where i is an occurrence of P and j an occurrenceof P , such that no other occurrences of either P or P occurs in between. The distance of a consecutiveoccurrence ( i, j ) is j − i . Our goal is to preprocess S into a compact data structure that given pattern strings P and P and a gap range [ α, β ] supports the following queries: • Exists ( P , P , α, β ): determine if there is a consecutive occurrence of P and P with distance withinthe range [ α, β ]. • Count ( P , P , α, β ): return the number of consecutive occurrences of P and P with distance withinthe range [ α, β ]. • Report ( P , P , α, β ): report all consecutive occurrences of P and P with distance within the range[ α, β ].We present new data structures with the following bounds: Theorem 1.
Given a string of length n , we can(i) construct an O ( n ) space data structure that supports Exists ( P , P , α, β ) and Count ( P , P , α, β ) queriesin O ( | P | + | P | + n / log (cid:15) n ) time for constant (cid:15) > , or(ii) construct an O ( n log n ) space data structure that supports Report ( P , P , α, β ) queries in O ( | P | + | P | + n / occ / log n log log n ) time, where occ is the size of the output. Hence, ignoring polylogarithmic factors, Theorem 1 achieves (cid:101) O ( n ) space and query time (cid:101) O ( | P | + | P | + n / ) for existence and counting and (cid:101) O ( | P | + | P | + n / occ / ) for reporting. Compared to the abovementioned simple suffix tree approaches that finds all occurrences of the query strings and merges them, wematch the (cid:101) O ( n ) space, while reducing the dependency on n in the query time from worst-case Ω( | P | + | P | + n )to (cid:101) O ( | P | + | P | + n / ) for Exists and
Count queries and (cid:101) O ( | P | + | P | + n / occ / ) for Report queries.We complement Theorem 1 with a conditional lower bound based on the set intersection problem. Specif-ically, we use the Strong SetDisjointness Conjecture from [20] to obtain the following result:
Theorem 2.
Assuming the Strong SetDisjointness Conjecture, any data structure on a string S of length n that supports Exists queries in O ( n δ + | P | + | P | ) time, for δ ∈ [0 , / , requires (cid:101) Ω (cid:0) n − δ − o (1) (cid:1) space. Thisbound also holds if we limit the queries to only support ranges of the form [0 , β ] , and even if the bound β isknown at preprocessing time. With δ = 1 /
2, Theorem 2 implies that any near linear space solution must have query time (cid:101) Ω( | P | + | P | + √ n ). Thus, Theorem 1 is optimal within a factor roughly n / . On the other hand, with δ = 0,Theorem 2 implies that any solution with optimal (cid:101) O ( | P | + | P | ) query time must use (cid:101) Ω( n − o (1) ) space. Note2hat this matches the trade-off achieved by the above mentioned simple solution that combines suffix treeswith two-dimensional range searching data structures.Finally, note that Theorem 2 holds even when the gap range is of the form [0 , β ]. As a simple extensionof our techniques, we show how to improve our solution from Theorem 1 to match Theorem 2 in this specialcase. To obtain our results we develop new techniques and show new interesting properties of consecutive occur-rences. We first consider
Exists and
Count queries. The key idea is to split gap ranges into large and smalldistances. For large distances that there can only be a limited number of consecutive occurrences and weshow how these can be efficiently handled using a segmentation of the string. For small distances, we clusterthe suffix tree and store precomputed answers for selected pairs of nodes. Since the number of distinctdistances is small we obtain an efficient bound on the space.We extend our solution for
Exists and
Count queries to handle
Report queries. To do so we develop a newdecomposition of suffix trees, called the induced suffix tree decomposition that recursively divides the suffixtree in half by index in the string. Hence, the decomposition is a balanced binary tree, where every nodestores the suffix tree of a substring of S . We show how to traverse this structure to efficiently recover theconsecutive occurrences.For our conditional lower bound we show a reduction based on the set intersection problem. Along theway we show that set intersection remains hard even if all elements in the instance have the same frequency. As mentioned, string indexing for gaps and consecutive occurrences are the most closely related lines of workto this paper. Another related area is document indexing , where the goal is to preprocess a collection ofstrings, called documents , to report those documents that contain patterns subject to various constraints.For a comprehensive overview of this area see the survey by Navarro [36].A well studied line of work within document indexing is document indexing for top- k queries [12, 23, 24,25, 26, 33, 34, 37, 39, 42, 43]. The goal is to efficiently report the top- k documents of smallest weight, wherethe weight is a function of the query. Specifically, the weight can be the distance of a pair of occurrencesof the same or two different query patterns [25, 33, 37, 42]. The techniques for top- k indexing (see e.g.Hon et al. [25]) can be adapted to efficiently solve gapped indexing for consecutive occurrences in the specialcase when the gap range is of the form [0 , β ]. However, since these techniques heavily exploit that the goalis to find the top- k closest occurrences , they do not generalize to general gap ranges.There are several results on conditional lower bounds for pattern matching and string indexing [4, 5,20, 29, 30]. Notably, Ferragina et al. [16] and Cohen and Porat [15] reduce the two dimensional substringindexing problem to set intersection (though the goal was to prove an upper, not a lower bound). In the twodimensional substring indexing problem the goal is to preprocess pairs of strings such that given two patternswe can output the pairs that contain a pattern each. Larsen et al. [30] prove a conditional lower bound forthe document version of indexing for two patterns, i.e., finding all documents containing both of two querypatterns. Goldstein et al. [20] show that similar lower bounds can be achieved via conjectured hardness ofset intersection. Thus, there are several results linking indexing for two patterns and set intersection. Ourreduction is still quite different, since we need a translation from intersection to distance. The paper is organized as follows. In Section 2 we define notation and recall some useful results. In Section 3we show how to answer
Exists and
Count queries, proving Theorem 1(i). In Section 4 we show how to answer
Report queries, proving Theorem 1(ii). In Section 5 we prove the lower bound, proving Theorem 2. Finally,in Section 6 we apply our techniques to solve the variant where α = 0.3 Preliminaries
Strings. A string S of length n is a sequence S [0] S [1] . . . S [ n −
1] of characters from an alphabet Σ. Acontiguous subsequence S [ i, j ] = S [ i ] S [ i + 1] . . . S [ j ] is a substring of S . The substrings of the form S [ i, n − suffixes of S . The suffix tree [44] is a compact trie of all suffixes of S $, where $ is a symbol not inthe alphabet, and is lexicographically smaller than any letter in the alphabet. Each leaf is labelled with theindex i of the suffix S [ i, n −
1] it corresponds to. Using perfect hashing [18], the suffix tree can be storedin O ( n ) space and solve the string indexing problem (i.e., find and report all occurrences of a pattern P ) in O ( m + occ) time, where m is the length of P and occ is the number of times P occurs in S .For any node v in the suffix tree, we define str( v ) to be the string found by concatenating all labels onthe path from the root to v . The locus of a string P , denoted locus( P ), is the minimum depth node v suchthat P is a prefix of str( v ). The suffix array stores the suffix indices of S $ in lexicographic order. We identifyeach leaf in the suffix tree with the suffix index it represents. The suffix tree has the property that the leavesbelow any node represent suffixes that appear in consecutive order in the suffix array. For any node v inthe suffix tree, range( v ) denotes the range that v spans in the suffix array. The inverse suffix array is theinverse permutation of the suffix array, that is, an array where the i th element is the index of suffix i in thesuffix array. Orthogonal range successor.
The orthogonal range successor problem is to preprocess an array A [0 , . . . , n −
1] into a data structure that efficiently supports the following queries: • RangeSuccessor ( a, b, x ): return the successor of x in A [ a, . . . , b ], that is, the minimum y > x such thatthere is an i ∈ [ a, b ] with A [ i ] = y . • RangePredecessor ( a, b, x ): return the predecessor of x in A [ a, . . . , b ], that is, the maximum y < x suchthat there is an i ∈ [ a, b ] with A [ i ] = y . In this section we give a data structure that can answer
Exists and
Count queries. The main idea is to splitthe query interval into “large” and “small” distances. For large distances we exploit that there can only bea small number of consecutive occurrences and we check them with a simple segmentation of S . For smalldistances we cluster the suffix tree and precompute answers for selected pairs of nodes.We first show how to use orthogonal range successor queries to find consecutive occurrences. Then wedefine the clustering scheme used for the suffix tree and give the complete data structure. Assume we have found the loci of P and P in the suffix tree. Then we can answer the following queries ina constant number of orthogonal range successor queries: • FindConsecutive P ( i ): given an occurrence i of P , return the consecutive occurrence ( i, j ) of P and P , if it exists, and No otherwise. • FindConsecutive P ( j ): given an occurrence j of P , return the consecutive occurrence ( i, j ) of P and P , if it exists, and No otherwise.Given a query FindConsecutive P ( i ), we answer as follows. Compute j = RangeSuccessor (range(locus( P )) , i )to get the closest occurrence of P after i . Compute i (cid:48) = RangePredecessor (range(locus( P )) , j ) to get theclosest occurrence of P before j . If i = i (cid:48) then no other occurrence of P exists between i and j and theyare consecutive. In that case we return ( i, j ). Otherwise, we return No .Similarly, we can answer FindConsecutive P ( j ) by first doing a RangePredecessor and then a
RangeSuccessor query. Thus, given the loci of both patterns and a specific occurrence of either P or P , we can in a constant4umber of RangeSuccessor and
RangePredecessor queries find the corresponding consecutive occurrence, if itexists.
To build the data structure we will use a cluster decomposition of the suffix tree.
Cluster Decomposition
A cluster decomposition of a tree T is defined as follows: For a connectedsubgraph C ⊆ T , a boundary node v is a node v ∈ C such that either v is the root of T , or v has an edgeleaving C – that is, there exists an edge ( v, u ) in the tree T such that u ∈ T \ C . A cluster is a connectedsubgraph C of T with at most two boundary nodes. A cluster with one boundary node is called a leaf cluster .A cluster with two boundary nodes is called a path cluster . For a path cluster C , the two boundary nodesare connected by a unique path. We call this path the spine of C . A cluster partition is a partition of T into clusters, i.e. a set CP of clusters such that (cid:83) C ∈ CP V ( C ) = V ( T ) and (cid:83) C ∈ CP E ( C ) = E ( T ) and no twoclusters in CP share any edges. Here, E ( G ) and V ( G ) denote the edge and vertex set of a (sub)graph G ,respectively. We need the next lemma which follows from well-known tree decompositions [1, 2, 3, 17] (seeBille and Gørtz [7] for a direct proof). Lemma 3.
Given a tree T with n nodes and a parameter τ , there exists a cluster partition CP such that | CP | = O ( n/τ ) and every C ∈ CP has at most τ nodes. Furthermore, such a partition can be computed in O ( n ) time. Data Structure
We build a clustering of the suffix tree of S as in Lemma 3, with cluster size at most τ ,where τ is some parameter satisfying 0 < τ ≤ n . Then the counting data structure consists of: • The suffix tree of S , with some additional information for each node. For each node v we store: – The range v spans in the suffix array, i.e., range( v ). – A bit that indicates if v is on a spine. – If v is on a spine, a pointer to the lower boundary node of the spine. – If v is a leaf, the local rank of v . That is, the rank of v in the text order of the leaves in thecluster that contains v . Note that this is at most τ . • The inverse suffix array of S . • A range successor data structure on the suffix array of S . • An array M ( u, v ) of length (cid:98) nτ (cid:99) +1 for every pair of boundary nodes ( u, v ). For 1 ≤ x ≤ (cid:98) nτ (cid:99) , M ( u, v )[ x ]is the number of consecutive occurrences ( i, j ) of str( u ) and str( v ) with distance at most x . We set M ( u, v )[0] = 0.Denote M ( u, v )[ α, β ] = M ( u, v )[ β ] − M ( u, v )[ α − M ( u, v )[ α, β ] is the number of consecutiveoccurrences of str( u ) and str( v ) with a distance in [ α, β ]. Space Analysis.
We store a constant amount per node in the suffix tree. The suffix tree and inverse suffixarray occupy O ( n ) space. For the orthogonal range successor data structure we use the data structure ofNekrich and Navarro [41] which uses O ( n ) space and O (log (cid:15) n ) time, for constant (cid:15) >
0. There are O (cid:0) n /τ (cid:1) pairs of boundary nodes and for each pair we store an array of length O ( n/τ ). Therefore the total spaceconsumption is O (cid:0) n + n /τ (cid:1) . 5 .3 Query Algorithm We now show how to count the consecutive occurrences ( i, j ) with a distance in the interval, i.e. α ≤ j − i ≤ β .We call each such pair a valid occurrence .To answer a query we split the query interval [ α, β ] into two: [ α, (cid:98) nτ (cid:99) ] and [ (cid:98) nτ (cid:99) + 1 , β ], and handle theseseparately. > nτ . We start by finding the loci of P and P in the suffix tree. As shown in Section 3.1, this allows us to findthe consecutive occurrence containing a given occurrence of either P or P . We implicitly partition thestring S into segments of (at most) (cid:98) n/τ (cid:99) characters by calculating τ segment boundaries. Segment i , for0 ≤ i < τ , contains characters S [ i · (cid:98) nτ (cid:99) , ( i + 1) · (cid:98) nτ (cid:99) −
1] and segment τ (if it exists) contains the characters S [ τ · (cid:98) nτ (cid:99) , n − P in each segment by performing a series of RangePredecessor queries, starting from the beginning of the last segment. Each time an occurrence i is found we perform thenext query from the segment boundary to the left of i , continuing until the start of the string is reached.For each occurrence i of P found in this way, we use FindConsecutive P ( i ) to find the consecutive occurrence( i, j ) if it exists. We check each of them, discard any with distance ≤ nτ and count how many are valid. ≤ nτ . In this part, we only count valid occurrences with distance ≤ nτ . Consider the loci of P and P in the suffixtree. Let C i denote the cluster that contains locus( P i ) for i = 1 ,
2. There are two main cases.
At least one locus is not on a spine
If either locus is in a small subtree hanging off a spine in a clusteror in a leaf cluster, we directly find all consecutive occurrences as follows: If locus( P ) is in a small subtreethen we use FindConsecutive P ( i ) on each leaf i below locus( P ) to find all consecutive occurrences, countthe valid occurrences and terminate. If only locus( P ) is in a small subtree then we use FindConsecutive P ( j )for each leaf j below locus( P ), count the valid occurrences and terminate. Both loci are on the spine
If neither locus is in a small subtree then both are on a spine. Let ( b , b )denote the lower boundary nodes of the clusters C and C , respectively. There are two types of consecutiveoccurrences ( i, j ):(i) Occurrences where either i or j are inside C resp. C .(ii) Occurrences below the boundary nodes, that is, i is below b and j is below b .See Figure 1(a). We describe how to count the different types of occurrences next. Type (i) occurrences
To find the valid occurrences ( i, j ) where either i ∈ C or j ∈ C we doas follows. First we find all the consecutive occurrences ( i, j ) where i is a leaf in C by computing FindConsecutive P ( i ) for all leaves i below locus( P ) in C . We count all valid occurrences we find inthis way. Then we find all remaining consecutive occurrences ( i, j ) where j is a leaf in C by computing FindConsecutive P ( j ) for all leaves j below locus( P ) in C . If FindConsecutive P ( j ) returns a valid occurrence( i, j ) we use the inverse suffix array to check if the leaf i is below b . This can be done by checking whether i ’s position in the suffix array is in range( b ). If i is below b we count the occurrence, otherwise we discardit. Type (ii) occurrences
Next, we count the consecutive occurrences ( i, j ), where both i and j arebelow b and b , respectively. We will use the precomputed table, but we have to be a careful not toovercount. By its construction, M ( b , b )[ α, min( (cid:98) nτ (cid:99) , β )] is the number of consecutive occurrences ( i (cid:48) , j (cid:48) )of str( b ) and str( b ), where α ≤ j (cid:48) − i (cid:48) ≤ min( (cid:98) nτ (cid:99) , β ). However, not all of these occurrence ( i (cid:48) , j (cid:48) ) are6igure 1: (a) Any consecutive occurrences ( i, j ) of P and P is either also a consecutive occurrence of str( b )and str( b ), or i or j are within the cluster. The suffix array is shown in the bottom with the correspondingranges marked. (b) Example of a false occurrence. Here ( i (cid:48) , j (cid:48) ) is a consecutive occurrence of str( b ) andstr( b ), but not a consecutive occurrence of P and P due to i . The string S is shown in bottom with thepositions of the occurrences marked.necessarily consecutive occurrences of P and P , as there could be an occurrence of P in C or P in C which is between i (cid:48) and j (cid:48) . We call such a pair ( i (cid:48) , j (cid:48) ) a false occurrence . See Figure 1(b). We proceed asfollows.1. Set c = M ( b , b )[ α, min( (cid:98) nτ (cid:99) , β )].2. Construct the lists L i containing the leaves in C i that are below locus( P i ) sorted by text order for i = 1 ,
2. We can obtain the lists as follows. Let [ a, b ] be the range of locus( P i ) and [ a (cid:48) , b (cid:48) ] = range( b i ).Sort the leaves in [ a, a (cid:48) − ∪ [ b (cid:48) + 1 , b ] using their local rank.3. Until both lists are empty iteratively pick and remove the smallest element e from the start of eitherlist. There are two cases. • e is an element of L . – Compute j (cid:48) = RangeSuccessor (range( b ) , e ) to get the closest occurrence of str( b ) after e . – Compute i (cid:48) = RangePredecessor (range( b ) , j (cid:48) ) to get the closest occurrence of str( b ) before j (cid:48) . • e is an element of L . – Compute i (cid:48) = RangePredecessor (range( b ) , e ) to get the previous occurrence i (cid:48) of str( b ). – Compute j (cid:48) = RangeSuccessor (range( b ) , j (cid:48) ) to get the following occurrence j (cid:48) of str( b ).If α ≤ j (cid:48) − i (cid:48) ≤ min( (cid:98) nτ (cid:99) , β ) and i (cid:48) < e < j (cid:48) decrement c by one. We skip any subsequent occurrencesthat are also inside ( i (cid:48) , j (cid:48) ). As the lists are sorted by text order, all occurrences that are within thesame consecutive occurrence ( i (cid:48) , j (cid:48) ) are handled in sequence.Finally, we add the counts of the different type of occurrences. Correctness.
Consider a consecutive occurrence ( i, j ) where j − i > nτ . Such a pair must span a segmentboundary, i.e., i and j cannot be in the same segment. As ( i, j ) is a consecutive occurrence, i is the lastoccurrence of P in its segment and j is the first occurrence of P in its segment. With the RangePredecessor P that are the last in their segment. We thus check and count all validoccurrences of large distance in the initial pass of the segments.If either locus is in a small subtree we use FindConsecutive P ( . ) or FindConsecutive P ( . ) on the leavesbelow that locus, which by the arguments in Section 3.1 will find all consecutive occurrences.Otherwise, both loci are on a spine. To count type (i) occurrences we use FindConsecutive P ( i ) for allleaves i below locus( P ) in C and FindConsecutive P ( j ) for all leaves j below locus( P ) in C . However, anyvalid occurrence ( i, j ) where both i ∈ C and j ∈ C is found by both operations. Therefore, whenever wefind a valid occurrence ( i, j ) via i = FindConsecutive P ( j ) for j ∈ C , we only count the occurrence if i isbelow b . Thus we count all type (i) occurrences exactly once.To count type (ii) occurrences we start with c = M ( b , b )[ α, min( (cid:98) nτ (cid:99) , β )], which is the number ofconsecutive occurrences ( i (cid:48) , j (cid:48) ) of str( b ) and str( b ), where α ≤ j (cid:48) − i (cid:48) ≤ min( (cid:98) nτ (cid:99) , β ). Each ( i (cid:48) , j (cid:48) ) is eitheralso a consecutive occurrence of P and P , or there exists an occurrence of P or P between i (cid:48) and j (cid:48) .Let ( i (cid:48) , j (cid:48) ) be a false occurrence and let w.l.o.g. i be an occurrence of P with i (cid:48) < i < j (cid:48) . Then i is aleaf in C , since ( i (cid:48) , j (cid:48) ) is a consecutive occurrence of str( b ) and str( b ). In step 3 we check for each leafinside the clusters below the loci, if it is between a consecutive occurrence ( i (cid:48) , j (cid:48) ) of str( b ) and str( b ) andif α ≤ j (cid:48) − i (cid:48) ≤ min( (cid:98) nτ (cid:99) , β ). In that case ( i (cid:48) , j (cid:48) ) is a false occurrence and we adjust the count c . As ( i (cid:48) , j (cid:48) )can have multiple occurrences of P and P inside it, we skip subsequent occurrences inside ( i (cid:48) , j (cid:48) ). Afteradjusting for false occurrences, c is the number of type (ii) occurrences. Time Analysis.
We find the loci in O ( | P | + | P | ) time. Then we perform a number of range successorand find consecutive queries. The time for a find consecutive query is bounded by the time to do a constantnumber of range successor queries. To count the large distances we check at most τ segment boundariesand thus perform O ( τ ) range successor and find consecutive queries. If either locus is not on a spine wecheck the leaves below that locus. There are at most τ such leaves due to the clustering. To count type (i)occurrences we check the leaves below the loci and inside the clusters. There are at most 2 τ such leaves intotal. To count type (ii) occurrences we check two lists constructed from the leaves inside the clusters belowthe loci. There are again at most 2 τ such leaves in total. For each of these O ( τ ) leaves we use a constantnumber of range successor and find consecutive queries. Thus the time for this part is bounded by the timeto perform O ( τ ) range successor queries.Using the data structure of Nekrich and Navarro [41], each range successor query takes O (log (cid:15) n ) time sothe total time for these queries is O ( τ log (cid:15) n ). For type (ii) occurrences we sort two lists of size at most τ froma universe of size τ , which we can do in O ( τ ) time. Thus, the total query time is O ( | P | + | P | + τ log (cid:15) n ).Setting τ = Θ( n / ) we get a data structure that uses O (cid:0) n + n /τ (cid:1) = O ( n ) space and has querytime O ( | P | + | P | + τ log (cid:15) n ) = O ( | P | + | P | + n / log (cid:15) n ), for constant (cid:15) >
0. Given an
Exists query weanswer with a
Count query, terminating when the first valid occurrence is found. This concludes the proofof Theorem 1(i).
In this section, we describe our data structure for reporting queries. Note that in Section 3, we explicitlyfind all valid occurrences except for type (ii) occurrences, where we use the precomputed values. In thissection, we describe how we can use a recursive scheme to report these.The main idea, inspired by fast set intersection by Cohen and Porat [15], is to build a binary structurewhich allows us to recursively divide into subproblems of half the size. Intuitively, the subdivision is a binarytree where every node contains the suffix tree of a substring of S . We use this structure to find type (ii)occurrences by recursing on smaller trees. We define the binary decomposition of the suffix tree next. Thedetails of the full solution follow after that. 8 X X X Figure 2: The suffix tree of
NANANANABATMAN $ together with its children trees T [0 ,
7] and T [8 , Let T be a suffix tree of a string S of length n . For an interval [ a, b ] of text positions , we define T [ a, b ] to bethe subtree of T induced by the leaves in [ a, b ]: That is, we consider the subtree consisting of leaves in [ a, b ]together with their ancestors. We then delete each node that has only one child in the subtree and contractits ingoing and outgoing edge. See Figure 2.The induced suffix tree decomposition of T now consists of a higher level binary tree structure, the de-composition tree , where each node corresponds to an induced subtree of the suffix tree. The root correspondsto T [0 , n − • The root of the decomposition tree corresponds to T [0 , n −
1] and has level 0. • For each T [ a, b ] of level i in the decomposition, if b − a >
1, its two children in the decompositiontree are T [ a, c ] and T [ c + 1 , b ] where c = (cid:98) a + b (cid:99) ; we will sometimes refer to these as “children trees” todifferentiate from children in the suffix tree.The decomposition tree is a balanced binary tree and the total size of the induced subtrees in the decom-position is O ( n log n ): There are at most 2 i decomposition tree nodes on level i , each of which corresponds toan induced subtree of size O (cid:0) n i (cid:1) , and thus the total size of the trees on each of the O (log n ) levels is O ( n ).For each node v in T [ a, b ], we define the successor node of v in each of the children trees of T [ a, b ] in thefollowing way: If v exists in the child tree, the successor node is v . Else, it is the closest descendant whichis present. Note that from the way the induced subtrees are constructed, v has at most one successor nodein each child tree.The induced suffix tree decomposition of S consists of: • Each T [ a, b ] stored as a compact trie. • For each T [ a, b ] we store a “cropped” suffix array SA [ a,b ] , that is, the suffix array of S [ a, b ] with theoriginal indices within S . 9 For each node v in T [ a, b ] we store a pointer from v to its successor nodes in each child tree, if it exists,and the interval in SA [ a,b ] that corresponds to the leaves below v .Since we store only constant information per node in any T [ a, b ], the total space usage of this is O ( n log n ). The reporting data structure consists of: • The induced suffix tree decomposition for S , • An orthogonal range successor data structure on the suffix array, and • The data structure from Section 3 for each T [ a, b ] in the induced suffix tree decomposition with pa-rameters n i and τ i , where n i = (cid:98) n i (cid:99) and τ i = Θ( n / i ), such that n i /τ i = (cid:98) n / i (cid:99) . The only change isthat we do not store an orthogonal range successor data structure for each of the induced subtrees. Space Analysis.
We use the O ( n log log n ) space and O (log log n ) time orthogonal range successor struc-ture of Zhou [45]. The existence data structure for each T [ a, b ] of level i is linear in n i . Thus, by thearguments of Section 4.1, the total space is O ( n log n ). The main idea behind the algorithm is the following: For large distances, as in Section 3, we implicitlysegment S to find all consecutive occurrences of at least a certain distance. For small distances, we aregoing to use the cluster decomposition and counting arrays to decide whether valid occurrences exist. Thatis, if one of the loci is in a small subtree, we use FindConsecutive P ( . ) resp. FindConsecutive P ( . ) to find allconsecutive occurrences. Else, we perform a query as in Section 3 to decide whether any valid occurrencesexist, and if yes, we recurse on smaller subtrees.The idea here is, that in the induced suffix tree decomposition, the trees are divided in half by text position - therefore, a consecutive occurrence either will be fully contained in the left child tree, fully contained in theright child tree, or have the property that the occurrence of P is the maximum occurrence in the left childtree and the occurrence of P is the minimum occurrence in the right child tree. We will check the bordercase each time when we recurse.In detail, we do the following: We find the loci of P and P in the suffix tree. As in the previous section,we check τ segment boundaries with τ = Θ( n / ) to find all consecutive occurrences with distance within[max( α, (cid:98) n / (cid:99) ) , β ]. Now, we only have to find consecutive occurrences of distance within [ α, min( β, (cid:98) n / (cid:99) )]in T = T [0 , n − n i = (cid:98) n i (cid:99) and β i = min( β, (cid:98) n / i (cid:99) ) and let T [ a, b ] be an induced subtree oflevel i .To find all consecutive occurrences with distance within [ α, β i ] in T [ a, b ] of level i , given the loci of P and P in T [ a, b ], recursively do the following: • If any of the loci is not on a spine of a cluster, we find all consecutive occurrences using
FindConsecutive P ( . )resp. FindConsecutive P ( . ) and check for each of them if they are valid; we report all such, then termi-nate. • Else, we use the query algorithm for small distances from Section 3 to decide whether a valid occurrencewith distance within [ α, β i ] exists in T [ a, b ].If such a valid occurrence exists, we recurse; that is, set c = (cid:98) a + b (cid:99) . We use RangePredecessor to findthe last occurrence of P before and including c , and RangeSuccessor to find the first occurrence of P after c . Then we check if they are consecutive (again using RangePredecessor and
RangeSuccessor ), andif it is a valid occurrence. If yes, we add it to the output. Then, for both S [ a, c ] and S [ c + 1 , b ], weimplicitly partition into segments of size (cid:98) n / i +1 (cid:99) and find and output all valid occurrences of distance10 n / i +1 . Then we follow pointers to the successor nodes of the current loci to find the loci of P and P in the children trees T [ a, c ] and T [ c + 1 , b ] and recurse on those trees to find all consecutive occurrencesof distance within [ α, β i +1 ] Correctness.
At any point before we recurse on level i , we check all consecutive occurrences of distance > n / i +1 by segmenting the current substring of S . By the arguments of the previous section, we will find allsuch valid occurrences. Thus, on the subtrees of level i + 1, we need only care about consecutive occurrenceswith distance in [ α, β i +1 ].By the properties of the induced suffix tree decomposition, a consecutive occurrence of P and P thatis present in T [ a, b ] will either be fully contained in T [ a, c ], or in T [ c + 1 , b ], or the occurrence of P is thelast occurrence before and including c and the occurrence of P is the first occurrence after c . We check theborder case each time we recurse. Thus, no consecutive occurrences get lost when we recurse. If we stopthe recursion, it is either because one of the loci was in a small subtree or that no valid occurrences withdistance within [ α, β i ] exists in T [ a, b ]. In the first case we found all valid occurrences with distance within[ α, β i ] in T [ a, b ] by the same arguments as in Section 3. Thus, we find all valid occurrences of P and P . Time Analysis.
For finding the loci, we first spend O ( | P | + | P | ) time in the initial suffix tree T [0 , n − O (occ) leaves. Thus,we traverse at most O (occ log n ) nodes.Each time we recurse, we spend a constant number of RangeSuccessor and
RangePredecessor queries tocheck the border cases. Additionally, we spend O ( n / i ) such queries on each node of level i that we visit inthe decomposition tree: For finding the “large” occurrences, and additionally either for reporting everythingwithin a small subtree or doing an existence query. For finding large occurrences, there are O ( n / i ) segmentsto check. The number of orthogonal range successor queries used for existence queries or reporting within asmall subtree is bounded by the number of leaves within a cluster, which is also O ( n / i ).Now, let x be the number of decomposition tree nodes we traverse and let l i , i = 1 , . . . , x , be the level ofeach such node. The goal is to bound (cid:80) xi =1 (cid:0) n li (cid:1) / . By the argument above, x = O (occ log n ). Note thatbecause the decomposition tree is binary we have that (cid:0)(cid:80) xi =1 12 li (cid:1) ≤ log n . The number of queries to theorthogonal range successor data structure is thus asymptotically bounded by: x (cid:88) i =1 (cid:16) n l i (cid:17) / = n / x (cid:88) i =1 (cid:18) l i (cid:19) / · ≤ n / (cid:32) x (cid:88) i =1 (cid:18) l i (cid:19) · (cid:33) / (cid:32) x (cid:88) i =1 (cid:33) / = n / (cid:32) x (cid:88) i =1 l i (cid:33) / x / = O ( n / occ / log n )For the inequality, we use H¨older’s inequality, which holds for all ( x , . . . , x k ) ∈ R k and ( y , . . . , y k ) ∈ R k p and q both in (1 , ∞ ) such that 1 /p + 1 /q = 1: k (cid:88) i =1 | x i y i | ≤ (cid:32) k (cid:88) i =1 | x i | p (cid:33) /p (cid:32) k (cid:88) i =1 | y i | q (cid:33) /q (1)We apply (1) with p = 3 / q = 3.Since the data structure of Zhou [45] uses O (log log n ) time per query, the total running time of thealgorithm is O ( | P | + | P | + n / occ / log n log log n ). This concludes the proof of Theorem 1(ii). We now prove the conditional lower bound from Theorem 2 based on set intersection. We use the frameworkand conjectures as stated in Goldstein et al. [20]. Throughout the section, let I = S , , . . . , S m be a collectionof m sets of total size N from a universe U . The SetDisjointness problem is to preprocess I into a compactdata structure, such that given any pair of sets S i and S j , we can quickly determine if S i ∩ S j = ∅ . We usethe following conjecture. Conjecture 1 (Strong SetDisjointness Conjecture) . Any data structure that can answer SetDisjointnessqueries in t query time must use (cid:101) Ω (cid:16) N t (cid:17) space. We define the following weaker variant of the SetDisjointness problem: the f -FrequencySetDisjointnessproblem is the SetDisjointness problem where every element occurs in precisely f sets. We now show thatany solution to the f -FrequencySetDisjointness problem implies a solution to SetDisjointness, matching thecomplexities up to polylogarithmic factors. Lemma 4.
Assuming the Strong SetDisjointness Conjecture, every data structure that can answer f -FrequencySetDisjointness queries in time O ( N δ ) , for δ ∈ [0 , / , must use (cid:101) Ω (cid:0) N − δ − o (1) (cid:1) space.Proof. Assume there is a data structure D solving the f -FrequencySetDisjointness problem in time O ( N δ )and space O (cid:0) N − δ − (cid:15) (cid:1) for constant (cid:15) with 0 < (cid:15) <
1. Let I = S , . . . , S m be a given instance of SetDis-jointness, where each S i is a set of elements from universe U , and assume w.l.o.g. that m is a power oftwo.Define the frequency of an element, f e , as the number of sets in I that contain e . We construct log m instances I , . . . , I log m of the f -FrequencySetDisjointness problem. For each j , 1 ≤ j ≤ log m , the instance I j contains the following sets: • For each i ∈ [1 , m ] a set S ji containing all e ∈ S i that satisfy 2 j − ≤ f e < j ; • j − “dummy sets”, which contain extra copies of elements to make sure that all elements have thesame frequency. That is, we add every element with 2 j − ≤ f e < j to the first 2 j − f e dummy sets.These sets will not be queried in the reduction.Instance I j has O ( m ) sets and every element occurs exactly 2 j times. Further, the total number of elementsin all the instances is at most 2 N . We now build f -FrequencySetDisjointness data structures D j = D ( I j )for each of the log m instances.To answer a SetDisjointness query for two sets S i and S i , we query D j for the sets S ji and S ji , for each1 ≤ j ≤ log m . If there exists a j such that S ji and S ji are not disjoint, we output that S i and S j are notdisjoint. Else, we output that they are disjoint.If there exists e ∈ S i ∩ S i , let j be such that 2 j − ≤ f e < j . Then e ∈ S ji ∩ S ji , and we will correctlyoutput that the sets are not disjoint. If S i and S i are disjoint, then, since S ji is a subset of S i and S ji isa subset of S i , the queried sets are disjoint in every instance. Thus we also answer correctly in this case.12igure 3: Instance of f -FrequencySetDisjointness problem reduced to Exists . Alphabet Σ = { , } and fixedfrequency f = 2, resulting in block size B = 2 · N j denote the total number of elements in I j . For each j , we have N j ≤ N and thus N − δ − (cid:15)j ≤ (2 N ) − δ − (cid:15) . Thus, the space complexity is asymptotically bounded by (cid:100) log m (cid:101) (cid:88) j =1 N − δ − (cid:15)j = O ( N − δ − (cid:15) log m ) . Similarly, we have N δj = O ( N δ ) and so the time complexity is asymptotically bounded by (cid:100) log m (cid:101) (cid:88) j =1 N δj = O ( N δ log m ) . This is a contradiction to Conjecture 1.
We can reduce the f -FrequencySetDisjointness problem to Exists queries of the gapped indexing problem:Assume we are given an instance of the f -FrequencySetDisjointness problem with a total of N elements.Each distinct element occurs f times. Assume again w.l.o.g. that the number of sets m is a power of two.Assign to each set S i in the instance a unique binary string w i of length log m . Build a string S as follows:Consider an arbitrary ordering e , e , ... of the distinct elements present in the instance. Let $ be an extraletter not in the alphabet. The first B = f · log m + f letters are a concatenation of w i $ of all sets S i that e is contained in, sorted by i . This block is followed by B copies of $ . Then, we have B symbols consistingof the strings for each set that e is contained in, again followed by B copies of $ , and so on. See Figure 3for an example.For a query for two sets S i and S j , where i < j , we set P = w i and P = w j , α = 0, and β = B . If thesets are disjoint, then there are no occurrences which are at most B apart. Otherwise w i and w j occur inthe same block, and w j comes after w i . The length of the string S is 2 N log m + 2 N : In the block for eachelement, we have log m + 1 letters for each of its occurrences, and it is followed by a $ block of the samelength.This means that if we can solve Exists queries in s ( n ) space and t ( n ) + O ( | P | + | P | ) time, where n is thelength of the string, we can solve the f -FrequencySetDisjointness problem in s (2 N log m + 2 N ) space and t (2 N log m + 2 N ) + O (log m ) time. Together with Lemma 4, Theorem 2 follows. [0 , β ] Gaps
In this section, we consider the special case where the queries are one sided intervals of the form [0 , β ]. Wegive a data structure supporting the following tradeoffs:13 heorem 5.
Given a string of length n , we can(i) construct an O ( n ) space data structure that supports Exists ( P , P , , β ) queries in O ( | P | + | P | + √ n log (cid:15) n ) time for constant (cid:15) > , or(ii) construct an O ( n log n ) space data structure that supports Count ( P , P , , β ) and Report ( P , P , , β ) queries in O ( | P | + | P | + ( √ n · occ) log log n ) time, where occ is the size of the output. Note that since the results match (up to log factors) the best known results for set intersection, this isabout as good as we can hope for. We mention here that for this specific problem, a similar tradeoff followsfrom the strategies used by Hon et al. [25]. The results from that paper include (among others) a datastructure for documents such that given a query of two patterns P and P and a number k , one can outputthe k documents with the closest occurrences of P and P . Thus, the problem is slightly different, however,with some adjustments, the results from Theorem 5 follow (up to a log factor). We show a simple, directsolution.The data structure is a simpler version of the data structure considered in the previous sections. Themain idea is that for each pair of boundary nodes u and v , we do not have to store an array of distances, butonly one number that carries all the information: the smallest distance of a consecutive occurrence of str( u )and str( v ). Thus, for existence, we can cluster with τ = √ n to achieve linear space, and we do not need tocheck large distances separately. For the reporting solution, we store the decomposition from Section 4.1,and use the matrix M to decide where to recurse. In the following we will describe the details. Existence data structure.
For solving
Exists queries in this setting, we cluster the suffix tree withparameter τ = √ n . Again, we store the linear space orthogonal range successor data structure by Nekrichand Navarro [41] on the suffix array. For each pair of boundary nodes ( u, v ), we store at M ( u, v ) the minimumdistance of a consecutive occurrence of str( u ) and str( v ). The total space is linear. To query, we proceedsimilarly as in Section 3 for the “small distances”: We find the loci of P and P . If any of the loci is noton the spine, we check all consecutive occurrences using FindConsecutive P ( . ) resp. FindConsecutive P ( . ). Ifboth loci are on the spine, denote b , b the lower boundary nodes of the respective clusters. Find M ( b , b ).If M ( b , b ) ≤ β , we can immediately return Yes : If a valid occurrence ( i (cid:48) , j (cid:48) ) of str( b ) and str( b ) exists,then either ( i (cid:48) , j (cid:48) ) is a consecutive occurrence of P and P , or there exists a consecutive occurrence ofsmaller distance. Otherwise, that is if M ( b , b ) > β , all valid occurrences ( i, j ) have the property thateither i is in the cluster of locus( P ) or j is in the cluster of locus( P ), and we check all such pairs using FindConsecutive P ( . ) resp. FindConsecutive P ( . ). The running time is O ( | P | + | P | + √ n log (cid:15) n ). Reporting data structure.
For the reporting data structure, we store the decomposition of the suffix treeas described in Section 4.1 and the O ( n log n ) space orthogonal range successor data structure by Zhou [45]on the suffix array. For each induced subtree of level i in the decomposition, we store the existence datastructure we just described. Reporting algorithm.
The algorithm follows a similar, but simpler, recursive structure as in Section 4.We begin by finding the loci of P and P . If either of the loci is not on a spine, we find all consecutiveoccurrences using FindConsecutive P ( . ) resp. FindConsecutive P ( . ), check if they are valid, report these, andterminate. If both loci are on a spine, we check M ( b , b ) for the lower boundary nodes b and b . If M ( b , b ) > β , all valid occurrences ( i, j ) have the property that either i is in the cluster of locus( P ) or j is in the cluster of locus( P ). We check all such pairs using FindConsecutive P ( . ) resp. FindConsecutive P ( . ),report the valid occurrences, and terminate. If M ( b , b ) ≤ β , we recurse on the children trees. That is, wecheck the border case and follow pointers to the loci in the children trees. Analysis.
The space is O ( n log n ), just as in Section 4.For time analysis, we spend O ( (cid:112) n li ) orthogonal range successor queries on the nodes in the decompositiontree of level l i where we stop the recursion. For all other nodes we visit in the tree traversal, we only spend a14onstant number of queries. In total, we visit O (occ log( n/ occ) + occ) decomposition tree nodes (by followingthe analysis in [15]), and we spend O ( (cid:112) n li ) orthogonal range successor queries on O (occ) many such nodes.We use the same notation as in Section 4. By x = O (occ) we now denote the number of nodes where westop the algorithm and output. Since each such node can be seen as a leaf in a binary tree, (cid:80) xi =1 12 li ≤ p = q = 2). We get as anasymptotic bound for the number of orthogonal range successor queries: x (cid:88) i =1 (cid:114) n l i = √ n x (cid:88) i =1 (cid:114) l i · ≤ √ n (cid:118)(cid:117)(cid:117)(cid:116) x (cid:88) i =1 l i (cid:118)(cid:117)(cid:117)(cid:116) x (cid:88) i =1 ≤ √ nx = O ( √ n · occ) . Note that since occ log( n/ occ) = O (occ (cid:112) n/ occ) = O ( √ n · occ), this brings the total number of orthogonalrange successor queries to O (occ + √ n · occ). Using the data structure by Zhou [45], the time bound fromTheorem 5 follows. We have considered the problem of gapped indexing for consecutive occurrences. We have given a linearspace data structure that can count the number of such occurrences. For the reporting problem, we havegiven a near-linear space data structure. The running time for both includes an O ( n / ) term, which formsa gap of O ( n / ) to the conditional lower bound of O ( √ n ). Thus, the most obvious open question is whetherwe can close this gap, either by improving the data structure or finding a stronger lower bound.Further, we have used the property that there can only be few consecutive occurrences of large distances.Thus, our solution cannot be easily extended to finding all pairs of occurrences with distance within thequery interval. An open question is if it is possible to get similar results for that problem. Lastly, documentversions of similar problems have concerned themselves with finding all documents that contain P and P or the top- k of smallest distance; conditional lower bounds for these problems are also known. It would beinteresting to see if any of these results be extended to finding all documents that contain a (consecutive)occurrence of P and P that has a distance within a query interval. References [1] Stephen Alstrup, Jacob Holm, Kristian de Lichtenberg, and Mikkel Thorup. Minimizing diameters ofdynamic trees. In
Proc. 24th ICALP , pages 270–280, 1997.[2] Stephen Alstrup, Jacob Holm, and Mikkel Thorup. Maintaining center and median in dynamic trees.In
Proc. 7th SWAT , pages 46–56, 2000.[3] Stephen Alstrup and Theis Rauhe. Improved labeling scheme for ancestor queries. In
Proc. 13th SODA ,pages 947–953, 2002.[4] Amihood Amir, Timothy M. Chan, Moshe Lewenstein, and Noa Lewenstein. On hardness of jumbledindexing. In
Proc. 41st ICALP , pages 114–125, 2014.[5] Amihood Amir, Tsvi Kopelowitz, Avivit Levy, Seth Pettie, Ely Porat, and B. Riva Shalom. Mind thegap: Essentially optimal algorithms for online dictionary matching with one gap. In
Proc. 27th ISAAC ,pages 12:1–12:12, 2016. 156] Johannes Bader, Simon Gog, and Matthias Petri. Practical variable length gap pattern matching. In
Proc. 15th SEA , pages 1–16, 2016.[7] Philip Bille and Inge Li Gørtz. The tree inclusion problem: In linear space and faster.
ACM Trans.Algorithms , 7(3):1–47, 2011.[8] Philip Bille and Inge Li Gørtz. Substring range reporting.
Algorithmica , 69(2):384–396, 2014.[9] Philip Bille, Inge Li Gørtz, Max Rishøj Pedersen, Eva Rotenberg, and Teresa Anna Steiner. StringIndexing for Top- k Close Consecutive Occurrences. In
Proc. 40th FSTTCS , volume 182, pages 14:1–14:17, 2020.[10] Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and Søren Vind. String indexing for patterns withwildcards.
Theory Comput. Syst. , 55(1):41–60, 2014.[11] Philip Bille, Inge Li Gørtz, Hjalte Wedel Vildhøj, and David Kofoed Wind. String matching withvariable length gaps.
Theoret. Comput. Sci. , 443, 2012. Announced at SPIRE 2010.[12] Sudip Biswas, Arnab Ganguly, Rahul Shah, and Sharma V Thankachan. Ranked document retrievalfor multiple patterns.
Theor. Comput. Sci. , 746:98–111, 2018.[13] P Bucher and A Bairoch. A generalized profile syntax for biomolecular sequence motifs and its functionin automatic sequence interpretation. In
Proc. 2nd ISMB , pages 53–61, 1994.[14] Manuel C´aceres, Simon J Puglisi, and Bella Zhukova. Fast indexes for gapped pattern matching. In
Proc. 46th SOFSEM , pages 493–504, 2020.[15] Hagai Cohen and Ely Porat. Fast set intersection and two-patterns matching.
Theor. Comput. Sci. ,411(40-42):3795–3800, 2010.[16] Paolo Ferragina, Nick Koudas, S. Muthukrishnan, and Divesh Srivastava. Two-dimensional substringindexing.
J. Comput. Syst. Sci. , 66(4):763–774, 2003.[17] Greg N Frederickson. Ambivalent data structures for dynamic 2-edge-connectivity and k smallest span-ning trees. SIAM J. Comput. , 26(2):484–538, 1997.[18] Michael L. Fredman, J´anos Koml´os, and Endre Szemer´edi. Storing a sparse table with o (1) worst caseaccess time. J. ACM , 31(3):538–544, 1984.[19] Kimmo Fredriksson and Szymon Grabowski. Efficient algorithms for pattern matching with generalgaps, character classes, and transposition invariance.
Inf. Retr. , 11(4):335–357, 2008.[20] Isaac Goldstein, Tsvi Kopelowitz, Moshe Lewenstein, and Ely Porat. Conditional Lower Bounds forSpace/Time Tradeoffs. In
Proc. 15th WADS , pages 421–436. Springer, 2017.[21] Tuukka Haapasalo, Panu Silvasti, Seppo Sippu, and Eljas Soisalon-Soininen. Online dictionary matchingwith variable-length gaps. In
Proc. 10th SEA , pages 76–87, 2011.[22] K Hofmann, P Bucher, L Falquet, and A Bairoch. The PROSITE database, its status in 1999.
NucleicAcids Res , 27(1):215–219, 1999.[23] Wing-Kai Hon, Manish Patil, Rahul Shah, Sharma V. Thankachan, and Jeffrey Scott Vitter. Indexesfor document retrieval with relevance. In
Space-Efficient Data Structures, Streams, and Algorithms -Papers in Honor of J. Ian Munro on the Occasion of His 66th Birthday , pages 351–362, 2013.[24] Wing-Kai Hon, Manish Patil, Rahul Shah, and Shih-Bin Wu. Efficient index for retrieving top-k mostfrequent documents.
J. Discrete Algorithms , 8(4):402–417, 2010.1625] Wing-Kai Hon, Rahul Shah, Sharma V. Thankachan, and Jeffrey Scott Vitter. Space-efficient frame-works for top-k string retrieval.
J. ACM , 61(2):1–36, 2014. Announced at 50th FOCS.[26] Wing-Kai Hon, Sharma V. Thankachan, Rahul Shah, and Jeffrey Scott Vitter. Faster compressed top-kdocument retrieval. In
Proc. 23rd DCC , pages 341–350, 2013.[27] Costas S Iliopoulos and M Sohel Rahman. Indexing factors with gaps.
Algorithmica , 55(1):60–70, 2009.[28] Orgad Keller, Tsvi Kopelowitz, and Moshe Lewenstein. Range non-overlapping indexing and successivelist indexing. In
Proc. 11th WADS , pages 625–636, 2007.[29] Tsvi Kopelowitz, Seth Pettie, and Ely Porat. Higher lower bounds from the 3sum conjecture. In
Proc.27th SODA , pages 1272–1287, 2016.[30] Kasper Green Larsen, J Ian Munro, Jesper Sindahl Nielsen, and Sharma V Thankachan. On hardnessof several string indexing problems.
Theoret. Comput. Sci. , 582:74–82, 2015.[31] Moshe Lewenstein. Indexing with gaps. In
Proc. 18th SPIRE , pages 135–143, 2011.[32] Gerhard Mehldau and Gene Myers. A system for pattern matching applications on biosequences.
Bioinformatics , 9(3):299–314, 1993.[33] J. Ian Munro, Gonzalo Navarro, Jesper Sindahl Nielsen, Rahul Shah, and Sharma V. Thankachan. Top-kterm-proximity in succinct space.
Algorithmica , 78(2):379–393, 2017. Announced at 25th ISAAC.[34] J. Ian Munro, Gonzalo Navarro, Rahul Shah, and Sharma V. Thankachan. Ranked document selection.
Theor. Comput. Sci. , 812:149–159, 2020.[35] Eugene W. Myers. Approximate matching of network expressions with spacers.
J. Comput. Bio. ,3(1):33–51, 1992.[36] Gonzalo Navarro. Spaces, trees, and colors: The algorithmic landscape of document retrieval on se-quences.
ACM Comput. Surv. , 46(4):1–47, 2014.[37] Gonzalo Navarro and Yakov Nekrich. Time-optimal top-k document retrieval.
SIAM J. Comput. ,46(1):80–113, 2017. Announced at 23rd SODA.[38] Gonzalo Navarro and Mathieu Raffinot. Fast and simple character classes and bounded gaps patternmatching, with applications to protein searching.
J. Comput. Bio. , 10(6):903–923, 2003.[39] Gonzalo Navarro and Sharma V. Thankachan. New space/time tradeoffs for top-k document retrievalon sequences.
Theor. Comput. Sci. , 542:83–97, 2014. Announced at 20th SPIRE.[40] Gonzalo Navarro and Sharma V. Thankachan. Reporting consecutive substring occurrences underbounded gap constraints.
Theor. Comput. Sci. , 638:108–111, 2016. Announced at 26th CPM.[41] Yakov Nekrich and Gonzalo Navarro. Sorted range reporting. In
Proc 13th SWAT , pages 271–282, 2012.[42] Rahul Shah, Cheng Sheng, Sharma V. Thankachan, and Jeffrey Scott Vitter. Top-k document retrievalin external memory. In
Proc. 21st ESA , pages 803–814, 2013.[43] Dekel Tsur. Top-k document retrieval in optimal space.
Inf. Process. Lett. , 113(12):440–443, 2013.[44] Peter Weiner. Linear pattern matching algorithms. In
Proc. 14th FOCS , pages 1–11, 1973.[45] Gelin Zhou. Two-dimensional range successor in optimal time and almost linear space.