Fine-Grained Complexity of Regular Expression Pattern Matching and Membership
FFine-Grained Complexity of Regular ExpressionPattern Matching and Membership
Philipp Schepper
CISPA Helmholtz Center for Information Security, Saarbrücken, GermanySaarbrücken Graduate School of Computer Science, Saarland Informatics Campus, Saarbrücken,[email protected]
Abstract
The currently fastest algorithm for regular expression pattern matching and membership improvesthe classical O ( nm ) time algorithm by a factor of about log / n . Instead of focussing on generalpatterns we analyse homogeneous patterns of bounded depth in this work. For them a classificationsplitting the types in easy (strongly sub-quadratic) and hard (essentially quadratic time underSETH) is known. We take a very fine-grained look at the hard pattern types from this classificationand show a dichotomy: few types allow super-poly-logarithmic improvements while the algorithmsfor the other pattern types can only be improved by a constant number of log-factors, assuming the Formula-SAT Hypothesis . Theory of computation → Pattern matching
Keywords and phrases
Fine-Grained Complexity, Regular Expression, Pattern Matching, Dichotomy
Related Version
Full version of the paper accepted at ESA 2020 [22]. All presented lower boundsand an alternative proof of the upper bounds for pattern matching using the polynomial method arecontained in the author’s Master’s thesis.
Funding
Supported by the European Research Council (ERC) consolidator grant No. 725978SYSTEMATICGRAPH.
Acknowledgements
I thank Karl Bringmann for the supervision during the research for my Master’sThesis which this paper is based on and especially the pointer to
Batch-OV which simplified theupper bounds extremely.
Regular expressions with the operations alternative | , concatenation ◦ , Kleene Plus +, andKleene Star ? are used in many fields of computer science. For example to search in textsand files or to replace strings by other strings as the unix tool sed does. But they are alsoused to analyse XML files [17, 18], for network analysis [12, 25], human computer interaction[13], and in biology to search for proteins in DNA sequences [16, 20].The most intuitive problem for regular expressions is the membership problem. There weask whether a given text t can be generated by a given regular expression p , i.e. is t ∈ L ( p )?We also call p a pattern in the following. A similar problem is the pattern matching problem,where we are interested whether some substring of the given text t can be matched by p .To simplify notation we define the matching language of p as M ( p ) := Σ ∗ L ( p )Σ ∗ . Then wewant to check whether t ∈ M ( p ). The standard algorithm for both problems runs in time O ( nm ) where n is the text length and m the pattern size [23].Based on the “Four Russians” trick Myers showed an algorithm with running time O ( nm/ log n ) [19]. This result was improved to an O ( nm log log n/ log / n ) time algorithmby Bille and Thorup [5]. Although for several special cases of pattern matching andmembership improved sub-quadratic time algorithms have been given [3, 11, 14], it remainedan open question whether there are truly sub-quadratic time algorithms for the general a r X i v : . [ c s . CC ] S e p Fine-Grained Complexity of Regular Expression Pattern Matching and Membership
Table 1
Hard pattern types that have to be considered.Pattern matching ◦ ? ◦|◦ ◦| + ◦ + ◦ ◦ + | |◦| |◦ +Membership + |◦| + |◦ + | + |◦ case. The first conditional lower bounds were shown by Backurs and Indyk [4]. Theyintroduced so-called homogenous patterns and classified their hardness into easy, i.e. stronglysub-quadratic time solvable, and hard, requiring essentially quadratic time assuming the Strong Exponential Time Hypothesis (SETH). This classification of Backurs and Indykwas completed by a dichotomy for all homogeneous pattern types by Bringmann, Grønlund,and Larsen [8]. They reduced the hardness of all hard pattern types to the hardness offew pattern types of bounded depth. By this it was sufficient to check few cases instead ofinfinitely many.To understand what a homogeneous pattern is, we observe that one can see patternsas rooted and node labeled trees where the inner nodes correspond to the operations ofthe pattern. Then a pattern is homogenous if the operations on each level of the tree areequal. The type of the pattern is the sequence of operations from the root to the leaves. SeeSection 2 for a formal introduction.But as SETH rules out only polynomial improvements, super-poly-logarithmic runtimeimprovements are still feasible. Such improvements are know for
Orthogonal Vectors (OV) [2, 9], for example, although there is a known conditional lower bound based onSETH. But for pattern matching and membership no faster algorithms are known. By areduction from
Formula-SAT
Abboud and Bringmann showed that in general patternmatching and membership cannot be solved in time O ( nm/ log (cid:15) n ) under the Formula-SAT Hypothesis [1].For
Formula-SAT one is given a De Morgan formula F over n inputs and size s , i.e.the formula is a tree where each inner gate computes the AND or OR of two other gatesand each of the s leaves is labeled with one of the n variables or their negation. The taskis to find a satisfying assignment for F . While the naive approach takes time O (2 n s ) toevaluate F on all possible assignments, there are polynomial improvements for formulas ofsize s = o ( n ) [10, 15, 21]. But despite intense research there is currently no faster algorithmknown for s = n . Thus it seem reasonable to assume the following hypothesis: (cid:73) Hypothesis 1.1 ( Formula-SAT Hypothesis (FSH) [1]) . There is no algorithm that cansolve
Formula-SAT on De Morgan formulas of size s = n in O (2 n /n (cid:15) ) time, forsome (cid:15) > , in the Word-RAM model. Although the new lower bound of O ( nm/ log (cid:15) n ) is quite astonishing since before onlypolynomial improvements have been ruled out, the bound is for the general case. It remainedan open question whether it also holds for homogeneous patterns of bounded depth. Usingthe results by Bringmann, Grønlund, and Larsen [8] relating the hardness of different patterntypes to each other, it suffices to check the pattern types in Table 1 for the correspondingproblem.We answer this last question and give a dichotomy for these hard pattern types: For fewpattern types we give the currently fastest algorithm for pattern matching and membership.For the remaining patterns we show improved lower bounds of the form Ω( nm/ log c n ).Where c is a “small” constant only depending on the type of the pattern that arises from ourreductions. . Schepper 3 nm Ω( √ log min( n,m )) Thm. 3.1 nm Ω( √ log min( n,m )) Thm. 3.1Θ (cid:16) nm poly log n (cid:17) Sec. 4.1Θ (cid:16) nm poly log n (cid:17) ( ◦ ? )Sec. 4.3, Lem. 2.2Θ (cid:16) nm poly log n (cid:17) Sec. 4.4 ◦ ? |O ( n log m + m )[4] Θ (cid:16) nm poly log n (cid:17) Sec. 4.2Θ (cid:16) nm poly log n (cid:17) ( ◦ ? )Sec. 4.3, Lem. 2.2Θ (cid:16) nm poly log n (cid:17) Sec. 4.5 ◦ ? + O ( n log m + m )[11]Θ (cid:16) nm poly log n (cid:17) Sec. 4.3StringMatchingΘ( n + m )[14] ◦ ? + | Complete SubtreeΘ( n + m )immediateSimplifiesLem. 2.1 Θ (cid:16) nm poly log n (cid:17) ( ◦ ? )Sec. 4.3, Lem. 2.2 ? + | DictionaryMatchingΘ( n + m )[3]Θ( n + m ) | ? ◦ + SimplifiesLem. 2.1Complete SubtreeΘ( n + m )[8] Θ (cid:16) nm poly log n (cid:17) ( ◦|◦ )Sec. 4.2, Lem. 2.2Θ (cid:16) nm poly log n (cid:17) ( ◦ ? )Sec. 4.3, Lem. 2.2Θ (cid:16) nm poly log n (cid:17) ( ◦| +)Sec. 4.5, Lem. 2.2 ◦ ? + Θ (cid:16) nm poly log n (cid:17) ( ◦ + ◦ )Sec. 4.1, Lem. 2.2Θ (cid:16) nm poly log n (cid:17) ( ◦ ? )Sec. 4.3, Lem. 2.2Θ (cid:16) nm poly log n (cid:17) ( ◦ + | )Sec. 4.4, Lem. 2.2 ◦ ? | + ? Figure 1
The classification of the patterns for pattern matching. The red bounds are shown inthis paper while the blue ones follow as corollaries. (cid:73)
Theorem 1.2.
For texts of length n and patterns of size m we have the following timebounds for the stated problems: nm/ Ω( √ log min( n,m )) for |◦| - and |◦ + -pattern matching, and + |◦| - and + |◦ + -membership Θ( nm/ poly log n ) for pattern matching and membership with types ◦ + | , ◦| + , ◦ + ◦ , ◦|◦ ,and ◦ ? and for | + |◦ -membership, unless FSH is false. This dichotomy result gives us a simple classification for the hard pattern types. Dependingon the pattern type one can decide if there is super-poly-logarithmic algorithm, or if eventhe classical algorithm is optimal up to a constant number of log-factors. See Figure 1 for anoverview of the results for pattern matching. The corresponding figures for membership areshown in Appendix C. Further, the dichotomy shows that the type of a pattern has a largerimpact on the hardness than the depth. The alternative as outer operation of the “easier”patterns allows us to split the pattern into independent sub-patterns. This is crucial for thespeed-up since pattern matching for ◦ + and ◦| is near-linear time solvable [4, 11]. Contraryalmost all hard pattern types have a concatenation as outer operation which does not allowthis decomposition into independent problems. Further, the length of the matched texts canvary largely. The pattern ( a | aba )( b | bca )( a | ab ), for example, can match strings of length3 to 8. We exploit both properties in our reductions, especially to encode a boolean OR. Fine-Grained Complexity of Regular Expression Pattern Matching and Membership
In Section 2 we give a formal definition of homogeneous patterns and state the problemswe start reducing from and the ones we reduce to. We show the algorithms for the upperbounds in Section 3. In Section 4 we give the improved lower bounds for pattern matchingwhile the ones for membership are given in Section 5.
Regular Expressions.
Recall, that patterns over a finite alphabet Σ are build recursivelyfrom other patterns using the operations | , ◦ , +, and ? . We construct the patterns and thelanguage of each pattern (i.e. the set of words matched by the pattern) as follows. Eachsymbol σ ∈ Σ is a pattern representing the language L ( σ ) = { σ } . Let in the following p and p be two patterns. For the alternative operation we define L ( p | p ) = L ( p ) ∪ L ( p ).For the concatenation we define L ( p ◦ p ) = { w w | w ∈ L ( p ) ∧ w ∈ L ( p ) } . For theKleene Plus we set L ( p +1 ) = { w | ∃ k ≥ ∃ w , . . . , w k ∈ L ( p ) : w = w · · · w k } . With ε asthe empty word we have L ( p ? ) = L ( p +1 ) ∪ { ε } for the Kleene Star.Based on this construction it is easy to see patterns as rooted and node-labeled treeswhere each inner node is labeled by an operation and the leaves are labeled by symbols. Wecall this tree the parse tree of a pattern in the following. Then each node is connected to thenode representing the sub-pattern p and also for p in the case of the binary operations ◦ and | . We define the size of a pattern to be the number of inner nodes plus the number ofleaves in the parse tree. We extend the definition of the alternative and the concatenation inthe natural way to more than two sub-patterns. To simplify notation we omit the symbol ◦ from the patterns in the following.We call a pattern homogeneous if for each level of the parse tree, all inner nodes arelabeled with the same operation. We define the type of a homogeneous pattern p to be thesequence of operations from the root of the parse tree of p to the deepest leaf. The depth ofa pattern is the depth of the tree, which is equal to the number of operations in the type.For example, the pattern [( abc | c )( a | dc ) c ( db | c | bd )] + is of type + ◦|◦ and has depth 4. Relations between Pattern Types.
Backurs and Indyk showed in [4] the first quadratictime lower bound for several homogeneous patterns based on SETH. This classificationwas completed by the dichotomy result of Bringmann, Grønlund, and Larsen in [8]. Asthere are infinitely many homogeneous pattern types, they showed linear-time reductionsbetween different pattern types. By these reductions lower bounds also transfer to other(more complicated) pattern types and faster algorithms also give improvements for other(equivalent) patterns. (cid:73)
Lemma 2.1 (Lemma 1 and 8 in the full version of [8]) . For any type T , applying any ofthe following rules yields a type T such that both are equivalent for pattern matching andmembership under linear-time reductions, respectively:For pattern matching: remove prefix + and replace prefix | + by | .For membership: replace any substring + | + by + | and replace prefix r? by r + for any r ∈ { + , |} ∗ .For both problems: replace any substring pp , for any p ∈ {◦ , | , ?, + } , by p .We say that T simplifies if one of these rules applies. Applying these rules in any order willeventually lead to an unsimplifiable type. (cid:73) Lemma 2.2 (Lemma 6 and 9 in the full version of [8]) . For types T and T , there is a linear-time reduction from T -pattern matching/membership to T -pattern matching/membership ifone of the following sufficient conditions holds: . Schepper 5 T is a prefix of T ,we may obtain T from T by replacing a ? by + ? ,we may obtain T from T by inserting a | at any position,only for membership: T starts with ◦ and we may obtain T from T by prepending a + to T . Together with the already known sub-quadratic time algorithms for various pattern types[3, 4, 8, 11, 14], it suffices to check the remaining cases in Table 1 to get a fine-graineddichotomy for the hard pattern types (i.e. the ones requiring essentially quadratic time underSETH).
Hypothesis.
As mentioned in the introduction, we follow the ideas of Abboud and Bring-mann in [1] and show reductions from
Formula-SAT to pattern matching to prove lowerbounds. Likewise as in their result, we also start from the intermediate problem
Formula-Pair : Given a monotone
De Morgan formula F with size s , that is a De Morgan formulawhere each leaf is labeled with a variable, i.e. no negation allowed, and each variable is usedonly once. Further, one is given two sets A, B of half-assignments to s/ F with | A | = n and | B | = m . The task is to find a pair a ∈ A, b ∈ B such that F ( a, b ) = true .There is an intuitive reduction from Formula-SAT to Formula-Pair as shown in [1].Thus, FSH implies the following hypothesis, which we prove in Appendix A: (cid:73)
Hypothesis 2.3 ( Formula-Pair Hypothesis (FPH)) . For all k ≥ , there is no algorithmthat can solve Formula-Pair for a monotone De Morgan formula F of size s and sets A, B ⊆ { , } s/ of size n and m , respectively, in time O ( nms k / log k +2 n ) in the Word-RAMmodel. Batch-OV.
For the upper bounds we transform texts and patterns into bit-vectors suchthat they are orthogonal if and only if the text is matched by the pattern. This gives us areduction from pattern matching to
Orthogonal Vectors (OV) ([9, 24]). But to improvethe runtime we process many texts simultaneously using the following lemmas. (cid:73)
Lemma 2.4 ( Batch-OV (cf. [9])) . Let
A, B ⊆ { , } d with | A | = | B | = n and d ≤ c − √ log n for some constant c > . We can decide for all vectors a ∈ A whether there is avector b ∈ B such that h a, b i = 0 in time n / (cid:15)c √ log n for sufficiently small (cid:15) > . We generalise this balanced case to the unbalanced case which we use later: (cid:73)
Lemma 2.5 (Unbalanced
Batch-OV ) . Let
A, B ⊆ { , } d with | A | = n and | B | = m and d ≤ c − √ log min( n,m ) for some constant c > . We can decide for all vectors a ∈ A whetherthere is a vector b ∈ B such that h a, b i = 0 in time nm/ (cid:15)c √ log min( n,m ) for sufficiently small (cid:15) > . Proof. If n ≤ m , partition B into d m/n e sets of size n and run the algorithm from Lemma 2.4on every instance in time d m/n e n / (cid:15)c √ log n ≈ nm/ (cid:15)c √ log n . Analogously for n > m . (cid:74) For patterns p of type |◦| and |◦ + let p = ( p | p | . . . | p k ) be the pattern of size m . Likewisefor the patterns with a Kleene Plus as additional outer operation. Let further t = t · · · t n bethe text of length n . The main idea of the fast algorithm is to compute a set of matchedsubstrings: M = { ( i, j ) | ∃ ‘ ∈ [ k ] : t i · · · t j ∈ L ( p ‘ ) } ⊆ [ n ] × [ n ]. From M we construct a Fine-Grained Complexity of Regular Expression Pattern Matching and Membership graph where the nodes correspond to different prefixes that can be matched. The tuplesin M represent edges between these nodes. Then it remains to check whether the nodecorresponding to t is reachable. (cid:73) Theorem 3.1 (Upper Bounds) . We can solve in time nm/ Ω √ log min( n,m ) : |◦| -pattern matching and + |◦| -membership. |◦ + -pattern matching and + |◦ + -membership. To compute M we split the patterns into large and small ones. For the large patterns wecompute the corresponding values of M sequentially while for the small patterns we reduceto unbalanced Batch-OV and use the fast algorithm for this problem shown in Lemma 2.5. + |◦| and |◦| As mentioned in the beginning of this section, we compute the set M of matched substringby partitioning the sub-patterns into large and small ones. (cid:73) Lemma 3.2.
Given a text t of length n and patterns { p i } i of type ◦| such that P i | p i | = m .We can compute M in time nm/ Ω( √ log min( n,m )) . (cid:73) Lemma 3.3 (Large Sub-Patterns) . Given a text t of length n and patterns p , . . . , p ‘ oftype ◦| such that P ‘i =1 | p i | ≤ m . We can compute M in time O ( ‘n log min( n, m ) + m ) . Proof.
From a result by Cole and Hariharan [11] we know that there is a O ( n log ˆ m + ˆ m )time algorithm for ◦| -pattern matching with patterns of size ˆ m . We run this algorithmsequentially for every pattern. We can ignore all p i with | p i | > | Σ | n since they matchmore than n symbols. We get | p i | ≤ min( | Σ | n, m ) ≤ min( n , m ) ≤ min( n , m ). Sincelog min( n , m ) = 2 log min( n, m ), each iteration takes time O ( n log min( n, m ) + | p i | ) andthe claim follows. (cid:74)(cid:73) Lemma 3.4 (Small Sub-Patterns) . Given a text t of length n and patterns p , . . . , p m oftype ◦| . There is a f ∈ Ω( √ log min( n,m )) such that the following holds: If | p i | ≤ f for all i ∈ [ m ] , then we can compute M in time nm/ Ω( √ log min( n,m )) with small error probability. We postpone the proof of this lemma and first combine the results for small and large patternsto proof the main theorem.
Proof of Lemma 3.2.
Choose f ∈ Ω( √ log min( n,m )) as in Lemma 3.4 and split the patternsinto large patterns of size > f and small patterns of size ≤ f .For the at most m/f large patterns compute M > by Lemma 3.3 in time O ( m/f · n log min( n, m ) + m ) ∈ nm/ Ω( √ log min( n,m )) . Duplicate the ‘ small patterns m/‘ times andcompute M ≤ for the m small patterns by Lemma 3.4 in the claimed running time. (cid:74) Proof of Theorem 3.1 Item 1.
Construct M by Lemma 3.2. Check for |◦| -pattern matchingwhether M = ∅ since any matched substring is sufficient.For + |◦| -membership we construct a graph G with nodes v , . . . , v n where we put an edgefrom v i − to v j if ( i, j ) ∈ M . Then v n is reachable from v iff there is a decomposition of t into substrings which can be matched by the p i s. This reachability check can be performedin time O ( n + | M | ) by a depth-first search starting from v . (cid:74) For the proof of Lemma 3.4 we proceed as follows. For the construction of M for smallsub-patterns we define some threshold f and check for every substring of t of length at . Schepper 7 most f whether there is a pattern that matches this substring. This check is reduced to Batch-OV by encoding the substrings and patterns as bit-vectors.For small alphabets with | Σ | < f this encoding is rather simple since we can use a one-hotencoding of the alphabet. But for larger alphabets this does not work as the dimension of thevectors would increase too much and the fast algorithm for Batch-OV could not be usedanymore. Therefore, we define a randomised encoding χ to ensure that the final bit-vectorsare not too large. For simplicity we can assume | Σ | = Θ(min( n, m )) by padding Σ with freshsymbols. The construction in the following lemma is based on the idea of Bloom-Filters [6]. (cid:73) Lemma 3.5 (Randomised Characteristic Vector) . For a finite universe Σ and a threshold f ≤ O ( √ log | Σ | ) there is a randomised χ : P (Σ) → { , } d with d ∈ O ( f log | Σ | ) such that forall σ ∈ Σ and S ⊆ Σ with | S | ≤ f the following holds:If σ ∈ S , then χ ( σ ) := χ ( { σ } ) ⊆ χ ( S ) , i.e. ∀ i ∈ [ d ] : χ ( { σ } )[ i ] = 1 = ⇒ χ ( S )[ i ] = 1 .If χ ( σ ) ⊆ χ ( S ) , then σ ∈ S with high probability, i.e. ≥ − / poly( | Σ | ) . Proof.
We define χ element-wise and set for S ⊆ Σ: χ ( S )[ i ] := W s ∈ S χ ( s )[ i ], i.e. the bitwiseOR over χ ( s ) for s ∈ S . Hence, the first claim already holds by definition. For each σ ∈ Σwe define χ ( σ ) independently by setting χ ( σ )[ i ] = 1 with probability 1 /f for all i ∈ [ d ]. Let S ⊆ Σ with | S | ≤ f and σ ∈ Σ \ S . For all i ∈ [ d ]:Pr[ χ ( σ )[ i ] (cid:42) χ ( S )[ i ]] = Pr[ χ ( σ )[ i ] = 1 ∧ χ ( S )[ i ] = 0]= f (cid:16) − f (cid:17) | S | ≥ f (cid:16) − f (cid:17) f ≥ e − f Pr[ χ ( σ ) ⊆ χ ( S )] = d Y i =1 Pr[ χ ( σ )[ i ] ⊆ χ ( S )[ i ]] = d Y i =1 (1 − Pr[ χ ( σ )[ i ] (cid:42) χ ( S )[ i ]]) ≤ d Y i =1 (cid:16) − e − f (cid:17) = (cid:16) − e − f (cid:17) d Setting d = f c ln | Σ | for some arbitrary c > e , we get:= (cid:16) − e − f (cid:17) f · c ln | Σ | ≤ e − / e · c ln | Σ | = | Σ | − c/ e = 1 / poly | Σ | (cid:74) Proof of Lemma 3.4.
Define f = 2 √ (cid:15)/ · √ log min( n,m ) with (cid:15) as in Lemma 2.5 and let a besome fresh symbol we add to Σ. Let χ : P (Σ) → { , } f be as in Lemma 3.5. For simplicityone can think of χ as the one-hot encoding of alphabet Σ.We define T j := { t i · · · t i + j − | ≤ i ≤ n − j + 1 } and P j := { p i | L ( p i ) ⊆ Σ j } for all j ∈ [ f ]. Then replace all symbols and sub-patterns of type | by bit-vectors by applying χ .Finally, pad every vector in T j and P j by f − j repetitions of χ ( a ) and flip all values of P j bit-wise such that 1s become 0s and vice versa. Let T be the set of all ≤ nf modified textsand P be the set of all m transformed patterns.We observe that a text-vector in T is orthogonal to a pattern-vector in P iff the originaltext was matched by the original pattern. Since f · f ≤ √ (cid:15) √ log min( n,m ) ≤ √ (cid:15) √ log min( nf,m ) ,we can apply Lemma 2.5 for T and P : nf m ( (cid:15)/ √ (cid:15) ) √ log min( nf,m ) ≤ nm ( √ (cid:15) −√ (cid:15)/ · √ log min( n,m ) ∈ nm Ω( √ log min( n,m )) (cid:74) + |◦ + and |◦ + First observe that even for small patterns M can be too large to be computed explicitly. For t = 0 n n and p = 0 + + we have M = [1 , n ] × [ n + 1 , n ] and thus cannot write down M explicitly in time o ( nm ). Fine-Grained Complexity of Regular Expression Pattern Matching and Membership
To get around this problem we first define the run-length encoding r ( u ) of a text u as in[4]: We have r ( ε ) = ε . For a non-empty string starting with σ , let ‘ be the largest integersuch that the first ‘ symbols of u are σ . Append the tuple ( σ, ‘ ) to the run-length encodingand recurse on u after removing the first ‘ symbols. We use the same approach for patternsof type ◦ +. But if there occurs a σ + during these ‘ positions, we add ( σ, ≥ ‘ ) to the encoding,otherwise ( σ, = ‘ ). For example, r ( aaa + b + bc ) = ( a, ≥ b, ≥ c = 1).The idea is to compute a subset of M which only contains those ( i, j ) such that thereis no distinct ( i , j ) in the subset with i ≤ i and j ≥ j and both substrings of t arematched by the same pattern p ‘ . We augment each tuple with two boolean flags, indicatingwhether the first and last run of the pattern p ‘ contains a Kleene Plus. From this set M ⊆ { , } × [ n ] × [ n ] × { , } we can fully recover M . For our above example we get M = { (1 , n, n + 1 , } . (cid:73) Lemma 3.6.
Given a text t of length n and patterns { p i } i of type ◦ + such that P i | p i | = m .We can compute M in time nm/ Ω( √ log min( n,m )) . (cid:73) Lemma 3.7 (Large Sub-Patterns) . Given a text t of length n and patterns p , . . . , p ‘ oftype ◦ + such that P ‘i =1 | p i | ≤ m . We can compute M in time O ( ‘n log min( n, m ) + m ) . Proof.
We modify all patterns such that their first and last run is of the form ( σ, = ‘ ), i.e. weremove every Kleene Plus from these two runs. There is a O ( n log ˆ m + ˆ m ) time algorithm for ◦ +-pattern matching with patterns of size ˆ m shown in [4]. We run this algorithm sequentiallyfor each altered pattern. For every tuple ( i, j ) the algorithm outputs, we add ( f, i, j, e ) to M where f and e are set to 1 iff the first and last run of the pattern contain a Kleene Plus,respectively.We can ignore all p i with | p i | > | Σ | n because they match more than n symbols. Since | p i | ≤ min( | Σ | n, m ) ≤ min( n , m ) ≤ min( n , m ) = 2 log min( n, m ), each iteration takestime O ( n log min( n, m ) + | p i | ) and the claim follows. (cid:74)(cid:73) Lemma 3.8 (Small Sub-Patterns) . For a text t of length n and patterns p , . . . , p m of type ◦ + , there is a f ∈ Ω( √ log min( n,m )) such that the following holds: If | p i | ≤ f for all i ∈ [ m ] ,then we can compute M in time nm/ Ω( √ log min( n,m )) . We postpone the proof of this lemma and first show the final upper bound as the proof ofLemma 3.2 also works for Lemma 3.6.
Proof of Theorem 3.1 Item 2.
Use Lemma 3.6 to construct M and check for |◦ +-patternmatching whether M = ∅ .For + |◦ +-membership we define a graph G = ( V, E ). Instead of having nodes v , . . . , v n as for + |◦| -membership we have for each node v i three versions, V := { v i , v i , v i | ≤ i ≤ n } .The versions correspond to the different ways a suffix or prefix of a run can be matched.For node v i we need that all symbols are explicitly matched by a pattern. For v i we needthat the suffix of the run containing t i has to be matched by a pattern starting with t + i . For v i we say that the prefix has to be matched by a pattern ending with t + i − . Hence, we addedges for the runs simulating the σ + of a pattern: For each run ( σ, ‘ ) from position i to j in t with ‘ > v k − , v k ) and ( v k , v k +1 ) to the graph for i ≤ k < j . Further,we add edges ( v i , v i ) and ( v i , v i ) to change between the states for all 0 ≤ i ≤ n . While thisconstruction solely depends on the text, we add for each ( f, i, j, e ) ∈ M the edge ( v fi − , v ej )to the graph. We claim that there is a path from v to v n if and only if t ∈ L (( p | · · · | p k ) + ).We prove this claim in Appendix B. See Figure 2 for an example of the construction. . Schepper 9 v v v n v n v n v a a a a abccbt = Figure 2
Graph for the pattern ( a + | a + b | bc + | cba | b + a ) + and text aaaabccba . The time for the construction is linear in the output size. The graph has Θ( n ) nodes and | M | + O ( n ) edges. As the DFS runs in linear time, the overall runtime follows. (cid:74) It remains to show how the set M is constructed for small patterns. Proof of Lemma 3.8.
Set f := 2 √ (cid:15)/ √ log min( n,m ) with (cid:15) as in Lemma 2.5 and consider all ≤ n/f many long runs of length ≥ f in t . Check for each long run by an exhaustive searchwhether there is a p i such that the following holds: The run in the text is matched by one ofthe ≤ | p i | runs in p i and the remaining runs of p i can match the contiguous parts of the text.This check can be performed in the following time for all large runs: nf m X i =1 | p i | ≤ nf m X i =1 f ≤ nmf Since a pattern can have at most f runs and each run matches now at most f symbols, itremains to check substrings of t of length at most f . Hence, define T = { t i · · · t i + j − | ∀ j ∈ [ f ] , i ∈ [ n − j + 1] } and ignore all substrings with more than f runs or runs longer than f . Convert these substrings and the patterns into bit-vectors by replacing the runs by thefollowing bit-vectors of length 2 log | Σ | + 2 f :( c, r )
7→ h c ih c i r f − r r f − r ( c, = r )
7→ h c ih c i r f − r r f − r ( c, ≥ r )
7→ h c ih c i r f − r f h c i denotes the unique binary representation of symbol c and h c i its bit-wise negation. Onecan easily see that two such vectors are orthogonal if and only if the runs match each other.Thus, a text and a pattern vector resulting from this transformation are orthogonal iff thetext is matched by the pattern. By padding the vectors with 1s we normalise their lengthbut still preserve orthogonality between text and pattern vectors with the same number ofruns. Let T and P be the resulting sets with ≤ nf and m elements, respectively.From log | Σ | ≤ log min( n, m ) ≤ f we get f (2 log | Σ | + 2 f ) ≤ f ≤ √ (cid:15) √ log min( nf ,m ) andhence can apply Lemma 2.5 for T and P . Actually we have to partition P depending onwhether a pattern has a Kleene Plus in its first and last run. Thus, we need four iterationsbut we can always duplicate patterns such that there are m patterns in each group. nf m (cid:15)/ √ (cid:15) √ log min( nf ,m ) ≤ nm ( √ (cid:15) − / √ (cid:15) ) √ log min( n,m ) ∈ nm Ω( √ log min( n,m )) (cid:74) Abboud and Bringmann showed in [1] a lower bound for pattern matching (and membership)in general of O ( nm/ log (cid:15) n ), unless FSH is false. We use this result and the corresponding reduction as a basis to show similar lower bounds for the remaining hard pattern types.But we also do not start our reductions directly from Formula-SAT but from
Formula-Pair as defined in Section 2 and use the corresponding
Formula-Pair Hypothesis fromHypothesis 2.3. (cid:73)
Theorem 4.1.
There are constants c ◦ ? = 76 , c ◦ + ◦ = c ◦| + = 72 , c ◦|◦ = 81 , and c ◦ + | = 27 such that pattern matching with patterns of type T ∈ {◦ ?, ◦ + ◦ , ◦|◦ , ◦ + | , ◦| + } cannot be solvedin time O ( nm/ log c T n ) even for constant sized alphabets, unless FPH is false. We show the lower bounds by a reduction from
Formula-Pair to pattern matching: (cid:73)
Lemma 4.2.
Given a
Formula-Pair instance with a formula of size s , depth d , and sets A and B with n and m ≤ n assignments. (If m > n , swap A and B .) We can reduce thisto pattern matching with a text t and a pattern p of type T ∈ {◦ ?, ◦ + ◦ , ◦|◦ , ◦ + | , ◦| + } over aconstant sized alphabet in time linear in the output size. | t | ∈ O ( n d s log s ) except for ◦ + | , there we have | t | ∈ O ( n d s log s ) Further, | p | ∈O ( mb dT s log s ) with b ◦ ? = 6 , b ◦ + ◦ = b ◦| + = 5 , b ◦|◦ = 8 , and b ◦ + | = 1 . Proof of Theorem 4.1.
We show the result only for patterns of type ◦ + ◦ , the proof for theother types is analogous.Let F be a formula of size s with two sets of n half-assignments each, and d be the depthof F . Applying the depth-reduction technique of Bonet and Buss [7] gives us an equivalentformula F with size s ≤ s and depth d ≤ s . By Lemma 4.2 we get a pattern matchinginstance with a text t and pattern p . Both of size O ( n d s log s ) = O ( n s s log s ) = O ( ns log s ). Now assume there is an algorithm for pattern matching with the statedrunning time and run it on t and p : O (cid:18) ns log s · ns log s log ( ns log s ) (cid:19) ⊆ O (cid:18) n s
12 ln 5+4 log s log n (cid:19) ⊆ O (cid:18) n s . log n (cid:19) But this contradicts FPH which was assumed to be true. (cid:74) ◦ + ◦ As the details of the reductions heavily depend on the pattern types, we give each reductionin a separate section. But we use the reduction for ◦ + ◦ as a basis for the other proofs. Forall reductions we first encode the evaluation of a formula on two half-assignments, then theencoding for finding such a pair. We define the actual text t g and the actual pattern p g .The universal text u g and universal pattern q g are needed for technical purposes and do notdepend on the assignments. A formula of size s (i.e. s leaves) has s − s − g a unique integer in [2 s − h g i for the binary encoding of theID of gate g . We can always see h g i as a sequence of b log(2 s − c + 1 ≤ b log s c + 2 = Θ(log s )bits padded with zeros if necessary. For a fixed gate g we define a separator gadget G := 2 h g i INPUT Gate
The text and the pattern depend on the variable that is read:For F g ( a, b ) = a i define t g := 0 a i p g := 0 + + as the pattern.For F g ( a, b ) = b i define t g := 011 as the text and p g := 0 + b i + as the pattern.Define u g := 0011 as the universal text and q g := 0 + + as the universal pattern. . Schepper 11 AND Gates
We define: t g := t Gt , p g := p Gp , u g := u Gu , and q g := q Gq . OR Gates
The texts and the patterns for gate g are defined as follows where the parenthesesare just for grouping and are not part of the text or pattern: t g := ( u GGu ) G ( u GGu ) G ( t GGt ) G ( u GGu ) G ( u GGu ) u g := ( u GGu ) G ( u GGu ) G ( u GGu ) G ( u GGu ) G ( u GGu ) q g := ( u GGu ) G ( u GGu ) G ( q GGq ) G ( u GGu ) G ( u GGu ) p g := ( u GGu G ) + ( q GGp ) G ( p GGq )( Gu GGu ) + (cid:73) Lemma 4.3 (Correctness of the Construction) . For all assignments a, b and gates g : F g ( a, b ) = true ⇐⇒ t g ( a ) ∈ L ( p g ( b )) t g ( a ) ∈ L ( q g ) u g ∈ L ( q g ) ∩ L ( p g ( b )) Proof.
The proofs of the second and third claim follow inductively from the encoding of thegates and especially because of the encoding of the INPUT gate. For the first claim we do astructural induction on the output gate of the formula.
INPUT Gate “ ⇒ ” Follows directly from the definition.
INPUT Gate “ ⇐ ” If the gate is not satisfied, then there are not enough 0s or 1s in the textthan the pattern has to match.
AND Gate “ ⇒ ” Follows directly from the definition.
AND Gate “ ⇐ ” By the uniqueness of the binary encoding, the G in the middle of the textand the pattern have to match. Since the whole text is matched, we get t ∈ L ( p ) and t ∈ L ( p ) and F g ( a, b ) is satisfied by the induction hypothesis. OR Gate “ ⇒ ” F g ( a, b ) = F g ( a, b ) ∨ F g ( a, b ) = true . Assume w.l.o.g. that F g ( a, b ) = true ,the other case is symmetric. Repeat ( u GGu G ) + only once to transform q GGp intothe second u GGu by our third claim of the lemma. Now p GGq matches t GGt bythe second claim and the assumption t ∈ L ( p ). Finally, we match Gu GGu Gu GGu by two repetitions of ( Gu GGu ) + . OR Gate “ ⇐ ” By the uniqueness of the binary encoding there are exactly 14 G s in thetext and the pattern can match 11 G s when taking both repetitions once. Since eachadditional repetition increases the number by 3, exactly one repetition is taken twice.If the first repetition is taken once, the following q GGp has to match the second u GGu in the text. But then p is transformed into t showing that F g is satisfied by the inductivehypothesis. The case for the second repetition is symmetric. (cid:74) Length of the Text and the Pattern.
All texts and patterns for a specific gate only dependon the texts and patterns for the two sub-gates. Thus, we can compute the texts and patternsin a bottom-up manner and the encoding can be done in time linear in the size of the output.It remains to analyse the length of the texts and the size of the patterns: (cid:73)
Lemma 4.4. | u r | , | t r | , | p r | , | q r | ∈ O (5 d s log s ) . Proof. p g is obviously smaller than u g . Since the sizes of u g , t g , and q g are asymptoticallyequal, it suffices to analyse the length of u g : | u g | ≤ | u | + 5 | u | + O (log s ). Inductively overthe d ( F g ) levels of F g , i.e. the depth of F g , this yields | u g | ≤ O (5 d ( F g ) s log s ). The factor of s log s is due to the O ( s ) inner gates each introducing O (log s ) additional symbols. (cid:74) In the first part of the reduction we have seen how to evaluate a formula on one specificpair of half-assignments. It remains to design a text and a pattern such that such a pairof half-assignments can be chosen. For this let A = { a (1) , . . . , a ( n ) } be the first set and B = { b (1) , . . . , b ( m ) } be the second set of half-assignments. Inspired by the reduction inSection 3.4 in the full version of [4] we define the final text and pattern as follows: t := n K i =1 (cid:16) u r u r u r t ( a ( i ) )3 u r u r u r u r (cid:17) p := 3 u r u r u r u r m K j =1 (cid:16) + ( u r + u r + q r p ( b ( j ) )3( u r + q r (cid:17) u r u r u r u r Where we set a ( j ) = a ( j mod n ) for j ∈ [ n + 1 , n ]. We call the concatenations in t and p foreach i and j the i th text group and the j th pattern group, respectively. (cid:73) Lemma 4.5.
If there are a ( k ) and b ( l ) such that F ( a ( k ) , b ( l ) ) = true , then t ∈ M ( p ) . Proof.
Assume w.l.o.g. a ( k ) and b ( k ) satisfy F . Otherwise we have to shift the indices forthe text and the pattern accordingly in the proof. We match the prefix of p to the suffixof the n th text group. Then we match the n + i th text group by the i th pattern group for i = 1 , . . . , k −
1: Both ( u r + are repeated twice. Then the remaining parts are matched ina straightforward way by transforming the q r s into t ( a ( i ) ) and u r , and p ( b ( i ) ) into u r .Then, we match the k th and k + 1th pattern group to the n + k th text group and a partof the n + k + 1th text group: + ( u r + u r + q r p ( b ( k ) ) 3( u r + q r + ( u r + u r + q r p ( b ( k +1) ) 3( u r + q r u r u r u r t ( a ( k ) ) 3 u r u r u r u r u r u r u r t ( a ( k +1) ) k + 1th pattern group k th pattern group n + k th text group beginning of n + k + 1th text group For the last step we shift the groups in the remaining text t such that it becomes easier toprove which part of the text the remaining pattern matches: t =3 u r u r u r u r n K i = n + k +2 (cid:16) u r u r u r t ( a ( i ) )3 u r u r u r u r (cid:17) = n K i = n + k +2 (cid:16) u r u r u r u r u r u r u r t ( a ( i ) ) (cid:17) u r u r u r u r For each of the remaining pattern groups the first repetition is taken three times. With thisthe n + i th group of t and the i th pattern group are matched in a straightforward way for i = k + 2 , . . . , m . The suffix of the pattern is matched to the start of the n + m + 1th textgroup in the obvious way. (cid:74)(cid:73) Lemma 4.6. If t ∈ M ( p ) , then there are a ( k ) and b ( l ) such that F ( a ( k ) , b ( l ) ) = true . Proof.
By the design of the pattern and the text, there must be a j ≤ n such that the prefixof the pattern is matched to the suffix of the j − . Schepper 13 pattern has to match the same sequence in some other text group because nowhere else thefour 3 u r could be matched. Thus, not all text groups and pattern groups match each otherprecisely and there is a text group k and a pattern group l such that the pattern group doesnot match the whole text group or it matches more than this group. Choose the first ofthese groups, i.e. the pair with smallest k and l .Since all prior groups have been matched precisely, the first repetition can be taken atmost twice. Otherwise the following u r could not be transformed into a part of the text.Now assume it is repeated exactly once. Then the following u r matches the second u r of thetext group. Since 3 is a fresh symbol, q r has to match the third u r . But then p ( b ( k ) ) has tobe transformed into t ( a ( l ) ) and Lemma 4.3 gives us a satisfying assignments.It remains to check the case when ( u r + is repeated twice. Then q r is transformed into t ( a ( l ) ) and p ( b ( k ) ) is transformed into the fourth u r . The second repetition has to be takenexactly twice in this case. Because otherwise the 33 from the beginning of the next textgroup could not be matched. But if the pattern ( u r + is repeated twice, this pattern groupis completely matched to a text group, contradicting our assumption. (cid:74)(cid:73) Lemma 4.7.
The final text has length O ( n d s log s ) and the pattern has size O ( m d s log s ) . By this we conclude the proof of Lemma 4.2 for this pattern type. (cid:121) ◦|◦
When taking a closer look at the reduction for ◦ + ◦ one can see that all α + where onlyrepeated constantly often, especially at most three times. Thus, we can replace every α + by( α | αα | ααα ). This modification changes the size of the patterns p g which also dominatesthe size change for the outer OR. (cid:73) Lemma 4.8. | u r | , | t r | , | q r | ∈ O (5 d s log s ) and | p r | ∈ O (8 d s log s ) . Proof.
Since the size of p g increased, we get | u g | , | t g | , | q g | ∈ O ( | p g | ). | p g | ≤ | u | + 6 | u | + | q | + | q | + | p | + | p | + O (log s ) ≤ | p | + 8 | p | + O (log s ) and with the same argument asin the proof of Lemma 4.4: | p g | ≤ d ( F g ) O (log s + s log s ) = O (8 d ( F g ) s log s ). (cid:74) The correctness follows from the reduction for ◦ + ◦ and concludes the proof of Lemma 4.2for ◦|◦ . (cid:121) ◦ ? To reuse the construction from ◦ + ◦ for this pattern type, we first observe that the pattern σ + can be seen as short-hand for σσ ∗ . Hence, the definition of the INPUT and AND gatecan be reused. We also use this idea for the OR gate and simulating ( u GGu G ) + by apattern of type ◦ ? . For this we introduce the starred version ? v of a text v = v . . . v | v | , wherewe put a Kleene Star on every symbol: ? v := v ∗ v ∗ . . . v ∗| v | . By this we can reuse t g , u g , and q g ,and define for p g : p g := ( ? u ? G ? G ? u ) ? G ( u GGu ) G ( q GGp ) G ( p GGq ) G ( u GGu ) ? G ( ? u ? G ? G ? u ) (cid:73) Lemma 4.9 (Correctness of the Construction) . For all assignments a, b and gates g : F g ( a, b ) = true ⇐⇒ t g ( a ) ∈ L ( p g ( b )) t g ( a ) ∈ L ( q g ) u g ∈ L ( q g ) ∩ L ( p g ( b )) Proof.
Again the proof of the last two claims follows directly from the encoding of the gates.Since the definition of the INPUT and AND gate is the same as for ◦ + ◦ , we only show theinductive step for the OR gate. Recall, that the text is defined as t g := ( u GGu ) G ( u GGu ) G ( t GGt ) G ( u GGu ) G ( u GGu ) . “ ⇒ ” F g ( a, b ) = F g ( a, b ) ∨ F g ( a, b ) = true . Assume w.l.o.g. that F g ( a, b ) = true , the othercase is symmetric. We match the first sequence of starred symbols to the empty string (cid:15) . Then we match u GGu G to each other. By the third claim above we can match u GGu to q GGp . By the inductive hypothesis and the second claim we match t GGt to p GGq . The remaining part of the text is matched in the canonical way to the patternwhile the starred sequence matches the original text. “ ⇐ ” Observe that the G s in the pattern have to match G s in the text and that the textis matched completely. Since the first and last non-starred GG in the pattern have tobe matched to a GG in the text, one can easily see that both starred sequences eitherproduce the empty string or u GGu G and Gu GGu . Thus it remains to check threedifferent cases:Exactly one sequence produced the empty string. Let it w.l.o.g. be the first one. Thenwe get that ( u GGu ) G ( t GGt ) has to be matched by ( q GGp ) G ( p GGq ) since u and u are strings. Since the G s in the pattern match G s in the text, we get t ∈ L ( p )and thus a satisfying assignment.Both starred sequences produce a non-empty string, i.e. their non-starred version. Thetext contains 5 GG but the pattern has to match 6 GG . A contradiction.Both starred sequences produce the empty string. Since u and u are strings, theremaining text t has to be matched by the remaining pattern p : t =( u GGu ) G ( t GGt ) G ( u GGu ) p =( q GGp ) G ( p GGq ) . Since the definition of q h and u h only differ at the definition of the INPUT gates, wewe cannot match q h to something different than u h here. Hence, u Gt GGt Gu ∈L ( p Gp ). Since the number of symbols changes for every word in L ( p Gp ) is boundedby A ( p ) + A ( p ) + ‘ + 2 with ‘ = A ( G ) and the text has 2 A ( u ) + 2 A ( u ) + 4 ‘ + 7symbol changes, we get a contradiction by Claim 4.11. (cid:74)(cid:73) Definition 4.10 (Symbol Changes) . We define A ( t ) to be the number of symbol changes in thetext t : Define A ( σ ) := 0 for any symbol σ and A ( t . . . t n − t n ) := A ( t . . . t n − ) + (cid:74) t n − = t n (cid:75) .For patterns p we define A ( p ) = max t ∈L ( p ) A ( t ) . (cid:66) Claim 4.11. A ( u g ) = A ( t g ) = A ( q g ) and 2 A ( u g ) > A ( p g ). Proof.
We first observe A ( u g ) = A ( t g ) = A ( q g ) since their definitions only differ for theINPUT gate for which the claim holds. We show the main claim by a structural inductionon gate g .For the INPUT gate we have A ( u g ) = A ( p g ) = 1 and thus the claim holds. For the ANDgate the claim follows directly from the induction hypothesis since all texts and patternsstart with 0 and end with 1. . Schepper 15 For the OR gate we get A ( u g ) = 5 A ( u ) + 5 A ( u ) + 14 A ( G ) + 18 and thus:2 A ( u g ) =10 A ( u ) + 10 A ( u ) + 28 A ( G ) + 36=5 A ( u ) + 5 A ( u ) + 5 A ( u ) + 5 A ( u ) + 28 A ( G ) + 36 IH > A ( u ) + 5 A ( u ) + 2 . A ( p ) + 2 . A ( p ) + 28 A ( G ) + 36 > A ( u ) + 5 A ( u ) + A ( p ) + A ( p ) + 17 A ( G ) + 22 = A ( p g ) (cid:74) With the same arguments as before, we get the following size bounds: (cid:73)
Lemma 4.12. | u r | , | t r | , | q r | ∈ O (5 d s log s ) and | p r | ∈ O (6 d s log s ) . For the final construction we define a generalised version of the outer OR that makes use ofa helper gadget H that is specific for every type. (cid:73) Theorem 4.13.
Given t r ( · ) , u r , p r ( · ) , and q r as above. Let H be a helper gadget with thefollowing properties: L ( H ) ⊆ L (4 + (3 | ∗ (0 | | | ∗ (3 | ∗ + ) .For ‘ := | u r | + 4 : ‘ , ‘ u r ‘ ∈ L ( H ) | H | ∈ O ( | u r | ) Then we can construct a text t and a pattern p such that t ∈ M ( p ) if and only if thereare a ∈ A, b ∈ B such that F ( a, b ) = true . Furthermore, | t | = O ( n ( | u r | + | t r | )) , | p | = O ( m ( | u r | + | p r | + | q r | )) and t and p are concatenations of gadgets. For ◦ ? we define H := 44 ∗ ∗ ? u r ∗ ∗ . The proof of the theorem is given in Section 4.6. (cid:121) ◦ + | Again we only change the encoding of the OR gate and reuse the other parts from ◦ + ◦ . t g := 0 G ( t GGu ) G ( u GGt ) G u g := 0 G ( u GGu ) G ( u GGu ) G q g := 0 G ( q GGu ) G ( u GGq ) G p g := (0 | | + G ( p GGp ) G (0 | | + (cid:73) Lemma 4.14 (Correctness of the construction) . For all assignments a, b and gates g : F g ( a, b ) = true ⇐⇒ t g ( a ) ∈ L ( p g ( b )) t g ( a ) ∈ L ( q g ) u g ∈ L ( q g ) ∩ L ( p g ( b )) Proof.
Again we only show the inductive step for the OR case of the first claim. “ ⇒ ” F g ( a, b ) = F g ( a, b ) ∨ F g ( a, b ) = true . Assume w.l.o.g. that F g ( a, b ) = true , the othercase is symmetric. The first repetition is transformed into the initial 0. Then we match p GGp to t GGu by the third claim and the assumption that t ∈ L ( p ). Since thetext only consists of symbols from { , , } , the suffix u GGt G “ ⇐ ” Since the GG in the pattern has to match one of the two GG in the text, there areonly two possible ways how the text was matched by the pattern. Assume w.l.o.g. the GG of the pattern matched the first GG of the text. Then the first G of the text and thefirst G of the pattern match each other. Hence, t ∈ L ( p ) and the induction hypothesisguarantees a satisfying assignment. (cid:74) (cid:73) Lemma 4.15. | t r | , | u r | , | q r | ∈ O (2 d s log s ) and | p r | ∈ O ( s log s ) . Proof.
Again we have O ( | u g | ) = O ( | t g | ) = O ( | q g | ). For u g we get: | u g | ≤ | u | + 2 | u | + O (log s ) ≤ d ( F g ) O (log s + s log s ) = O (2 d ( F g ) s log s )with the same argument as for the previous size bounds. For p g we have | p g | ≤ | p | + | p | + O (log s ) ≤ O ( s log s ). (cid:74) We define H := 4 + (3 | + (0 | | | + (3 | + + and use Theorem 4.13 to conclude theproof of Lemma 4.2 for this pattern type. (cid:121) ◦| + To reuse the definitions from the previous sections for the last time we have to allow unaryalternatives. By this we can see a pattern σ + as a pattern of type | +. This is reasonablesince we can replace σ + by ( σ | σ + ) which represents exactly the same language as just σ + .One could also use a fresh symbol α which will never appear in the text and replace σ + by( α | σ + ).We introduce the barred version of a text to match the resulting pattern to the originaltext but also to the repetition of a single symbol. (cid:73) Definition 4.16 (Barred Version of a Text) . Let τ be a symbol and t = t · · · t n be a text oflength n . Define the barred version of t as a pattern of type ◦| as t τ := ( t | τ ) · · · ( t n | τ ) . We change the encoding of the OR gate to the following: t g := 0 | u GGu G | +1 ( u GGu ) G ( t GGt ) G ( u GGu )1 | Gu GGu | +1 u g := 0 | u GGu G | +1 ( u GGu ) G ( u GGu ) G ( u GGu )1 | Gu GGu | +1 q g := 0 | u GGu G | +1 ( u GGu ) G ( q GGq ) G ( u GGu )1 | Gu GGu | +1 p g := 0 + u GGu G ( q GGp ) G ( p GGq ) Gu GGu + (cid:73) Lemma 4.17 (Correctness of the construction) . For all assignments a, b and gates g : F g ( a, b ) = true ⇐⇒ t g ( a ) ∈ L ( p g ( b )) t g ( a ) ∈ L ( q g ) u g ∈ L ( q g ) ∩ L ( p g ( b )) Proof.
Again we only show the proof for the OR gate in the first claim. “ ⇒ ” F g ( a, b ) = F g ( a, b ) ∨ F g ( a, b ) = true . Assume w.l.o.g. that F g ( a, b ) = true , the othercase is symmetric. We match the first barred text to a repetition of 0s. Then q GGp matches u GGu by the third claim of the lemma. p GGq matches t GGt by theinduction hypothesis and the second claim of the lemma. The second barred patternmatches its original text while the repetition of 1s is matched by 1 + . “ ⇐ ” Since the whole text has to be matched and the G s in the pattern have to match G sin the text, there are three possibilities how the GG s of the pattern can be matched tothe GG s in the text:The first GG of the pattern matches the first GG of the text and the second of the textis matched by the second of the pattern. This implies u Gt ∈ L ( p Gp ) and sincethe G can only match itself, t ∈ L ( p ) and a satisfying assignment by the inductionhypothesis. . Schepper 17 The first GG of the pattern matches the first GG of the text and the second GG ofthe pattern matches the third GG of the text. We get u Gt GGt Gu ∈ L ( p Gp ).Using the same argument as for ◦ ? we get that the number of symbol changes forevery word in L ( p Gp ) is at most A ( p ) + A ( p ) + A ( G ) + 2 while the text has2 A ( u ) + 2 A ( u ) + 4 A ( G ) + 6 symbol changes. Analogous to Claim 4.11 we can showthat this case cannot occur since 2 A ( u h ) > A ( p h ).The first GG of the pattern matches the second GG of the text and the third of thetext is matched to the second of the pattern. This case is symmetric to the first caseand implies t ∈ L ( p ). (cid:74)(cid:73) Lemma 4.18. | u r | , | t r | , | q r | , | p r | ∈ O (5 d s log s ) . By defining H := 4 + (3 | u r (3 | + for the outer OR we finish the proof of Lemma 4.2. (cid:121) Let A = { a (1) , . . . , a ( n ) } be the first set and B = { b (1) , . . . , b ( m ) } be the second set of half-assignments. Inspired by the reduction in Section 3.6 in the full version of [4] we define thefinal text and pattern as follows: t := n K i =1 (cid:16) u r ‘ u r ‘ t r ( a ( i ) )34 ‘ u r ‘ (cid:17) p :=3 m K j =1 (cid:16) + q r H + p r ( b ( j ) )3 H + q r ‘ u r ‘ (cid:17) q r ‘ u r ‘ ‘ := | u r | + 4 and a ( j ) = a ( j mod n ) for j ∈ [ n + 1 , n ]. Again we call the concatenationsin t and p for each i and j the i th text group and the j th pattern group, respectively. Recall,that we have the following assumption for H : L ( H ) ⊆ L (4 + (3 | ∗ (0 | | | ∗ (3 | ∗ + ).4 ‘ , ‘ u r ‘ ∈ L ( H ) | H | ∈ O ( | u r | ) (cid:73) Lemma 4.19.
If there are a ( k ) and b ( l ) such that F ( a ( k ) , b ( l ) ) = true , then t ∈ M ( p ) . Proof.
Assume w.l.o.g. a ( k ) and b ( k ) satisfy F . Otherwise we have to shift the indices forthe text and the pattern in the proof accordingly. We match the i th pattern group to the i th text group for i = 1 , . . . , k − p . The match is performedstraightforward by matching 4 ‘ to H . p r ( b ( i ) ) matches u r and the second q r matches t r ( a ( i ) ).Then we match the k th pattern group to the k th text group and a part of the k + 1thtext group as follows, which is again possible by the assumptions: + q r H + p r ( b ( k ) )3 H + q r � u r � u r � u r � t r ( a ( k ) )3 4 � u r � u r � u r � k th pattern group k th text group k + 1th text group We shift the remaining part t of the text such that it becomes easier to show which part theremaining pattern matches: t =33 t r ( a ( k +1) )34 ‘ u r ‘ n K i = k +2 (cid:16) u r ‘ u r ‘ t r ( a ( i ) )34 ‘ u r ‘ (cid:17) = n K i = k +1 (cid:16) t r ( a ( i ) )34 ‘ u r ‘ u r ‘ u r ‘ (cid:17) t r ( a (3 n ) )34 ‘ u r ‘ After this shift we match the i th pattern group to the i th text group of t for i = k + 2 , . . . , m by matching H to 4 ‘ and the other parts in a straightforward way. Finally, the suffix of thepattern is matched to the prefix of the m + 1th text group of t in the canonical way. (cid:74)(cid:73) Lemma 4.20. If t ∈ M ( p ) , then there are a ( k ) and b ( l ) such that F ( a ( k ) , b ( l ) ) = true . Proof.
By the design of the pattern and the text, there must be a j ≤ n such that initial “3”of the pattern is matched to the first “3” of the j th text group. Furthermore, we know thatthe suffix of the pattern has to match the suffix of some text group and the following 333.Hence, not all pattern groups match exactly one text group but only a prefix or more thanone text group. We choose the first of these groups (i.e. the pair with smallest k and l ).Since all prior groups have been matched precisely, q r has to be transformed into u r because “3” is a fresh symbol. Since the 3s have to be aligned, H can only match 4 ‘ u r ‘ or just 4 ‘ by assumption. In the first case p r ( b ( l ) ) matches t r ( a ( k ) ) and we get a satisfyingassignment.Assume for contradictions sake that H matches 4 ‘ . Then p r ( b ( l ) ) matches u r . If thesecond H matches 4 ‘ , the text group is matched precisely by the pattern group and we havea contradiction. Thus, we can assume H matches 4 ‘ t r ( a ( k ) )34 ‘ . But then the following33 + in the pattern has to match a single “3”. Again a contradiction. (cid:74)(cid:73) Lemma 4.21. | t | ∈ O ( n ( | u r | + | t r | )) and | p | ∈ O ( m ( | u r | + | p r | + | q r | )) . This finishes the proof of Theorem 4.13. (cid:121)
Instead of giving all reductions from scratch, we reduce pattern matching to membershipand make use of the results in Lemma 4.2. By this we get the same bounds as for patternmatching given in Theorem 4.1. For the remaining pattern type | + |◦ we give a new reductionfrom scratch which is necessary due to the missing concatenation as outer operation. (cid:73) Lemma 5.1 (Reducing Pattern Matching to Membership) . Given a text t and a pattern p with type in {◦ ? , ◦ + ◦ , ◦|◦ , ◦ + | , ◦| + } over a constant sized alphabet.We can construct a text t and a pattern p of the same type as p in linear time such that t ∈ M ( p ) ⇐⇒ t ∈ L ( p ) . Further, | t | ∈ O ( | t | ) and | p | ∈ O ( | t | + | p | ) , except for ◦ + | , therewe even have | p | ∈ O ( | p | ) . Proof for Patterns of Type ◦ ? . We define t := t and p := ? tp ? t where ? t is the starred textas defined in Section 4.3. Then the claim follows directly. (cid:74) . Schepper 19 Proof for Patterns of Type ◦ + ◦ . Let Σ = { , . . . , s } be the alphabet. We first encode everysymbol such that we can simulate a universal pattern (i.e. matching any symbol) by somegadget U of type ◦ +. Let f : Σ → Σ s +1 be this encoding with f ( x ) = 1 · · · ( x − xx ( x +1) · · · s .Since we can extend f in the natural way to texts by applying it to every symbol, we canalso modify patterns of type ◦ + ◦ by applying f to every symbol without changing the type.After applying f we still have t ∈ M ( p ) ⇐⇒ f ( t ) ∈ M ( f ( p )).For the step from pattern matching to membership we set U := 1 + + · · · s + and R :=12 · · · s . Obviously R ∈ L ( U ) and f ( σ ) ∈ L ( U ) for all σ ∈ Σ. But we also get
R / ∈ L ( f ( σ ))since R does not contain a repetition of σ . Finally, we define t := R | t | +1 f ( t ) R | t | +1 and p = R + U | t | f ( p ) U | t | R + . We claim t ∈ M ( p ) ⇐⇒ t ∈ L ( p ). “ ⇒ ” If t ∈ M ( p ), then there is a substring ˆ t of t matched by p . By the above observations, f ( p ) matches f (ˆ t ) which is a substring of f ( t ). Then we use U | t | to match the not matchedsuffix and prefix of f ( t ) and a part of R | t | +1 . The remaining repetitions of R are matchedby the R + in the beginning and the end. “ ⇐ ” If t ∈ L ( p ), then f ( p ) has to match some substring of f ( t ) because R cannot bematched by the above observation. (cid:74) Proof for Patterns of Type ◦|◦ . Let L := 2 d log | t |e ∈ O ( | t | ). For a set S of symbols, we alsowrite S for the pattern representing the alternative of all symbols in S . Let a be a newsymbol: t := a L − ta L − p := log L K i =0 (cid:16) a i | a i +1 (cid:17) (Σ ∪ { a } ) L p (Σ ∪ { a } ) L log L K i =0 (cid:16) a i | a i +1 (cid:17) This increases the size of the pattern by an additive term of: O ( | Σ | · L ) + O log L X i =0 i + 2 i +1 ! = O ( | Σ | L + L ) = O ( | Σ | L ) “ ⇒ ” If t ∈ M ( p ), then there is a substring t i · · · t j of t that is matched to p . Thus wecan match (Σ ∪ { a } ) i − p (Σ ∪ { a } ) n − j to t . The first L − i + 1 and the last L − n + j repetitions of Σ ∪ { a } are matched to a s. Hence there remain at least 2 L − L − ≤ L − a s as prefix and suffix. We match them to J log Li =0 ( a i | a i +1 ) as follows:When allowing empty strings in our pattern we can rewrite the concatenation as follows: log L K i =0 (cid:16) a i | a i +1 (cid:17) ≡ log L K i =0 a i (cid:16) (cid:15) | a i (cid:17) ≡ a L − L K i =0 (cid:16) (cid:15) | a i (cid:17) Thus, we can ignore the first part of the pattern since it always matches the first andlast 2 L − a . It remains to show that the concatenation can match a z for all z ∈ [0 , L − z ∈ [0 , L −
1] since the i th bit contributes 2 i to the sum. Thus, we choose (cid:15) in thepattern above if and only if the i th bit is zero. “ ⇐ ” If t ∈ L ( p ), we know that p matched some substring of t since p cannot match a s. (cid:74) Proof for Patterns of Type ◦ + | . Define t := 1 t p := Σ + p Σ + . The claim followsdirectly since a Kleene Plus matches at least one symbol. (cid:74) Proof for Patterns of Type ◦| + . We define t := 1 | t | +1 t | t | +1 and p := 1 + Σ | t | p Σ | t | + . “ ⇒ ” If t ∈ M ( p ), then p matches the corresponding part in t . The not matched prefix of t is matched by the sequence of alternatives. The remaining 1s in t are matched by Σ + . “ ⇐ ” If t ∈ L ( p ), then p has to match some part of t because the prefix and suffix of p match at least | t | + 1 symbols. (cid:74) | + |◦ Even though the remaining hard pattern type | + |◦ does not have a concatenation as outeroperation, we can still show a similar lower bound as for the other types. (cid:73) Theorem 5.2. | + |◦ -membership cannot be solved in time O ( nm/ log n ) even for constantsized alphabets, unless FPH is false. To proof the theorem it suffices to show that
Formula-Pair can be reduced to membershipwith a text of length O ( ns log s ) and a pattern of size O ( ms log s ). Then the claim directlyfollows from the definition of FPH as for the other types. Idea of the Reduction.
As for the other lower bounds, we first encode the evaluation of theformula on two fixed half-assignments. We define for each gate g a text t g and two dictionaries D Mg and D Sg of words. The final dictionary for a gate g is defined as D g = S g ∈ F g D Sg ∪ D Mg .The final pattern is D + r where r is the root of F . D Mg corresponds to p g and allows us to match the whole text t g if the formula is satisfied.The texts of the sub-gates are then matched by the corresponding dictionaries. But for theOR gate we have to be able to ignore the evaluation of one sub-formula. For this we definethe set D Sg which corresponds to q g and allows us to match the text independently fromthe assignments. As main idea we include the path from the root of the formula to thecurrent gate in the encoding. This trace is appended to the text as a prefix and in reverseas suffix. The words in D Mg for OR gates g allow us to jump to a gate in such a trace ofexactly one sub-formula. Then we use corresponding words from D S to propagate this jumpto the sub-formulas. Because the included trace started at the root, we can proceed to theINPUT gates. There we add words to accept all evaluations of the gate. For the way backup we add the corresponding words in reverse to the dictionaries.We make sure that these words are just used at one specific position by embedding theencoding of the corresponding gate in the trace. Since the gate number can be made uniquethese words can only be used at one specific position. This procedure allows us to writedown the words as a set and not as a concatenation as for the other reductions. Encoding the Formula.
We identify each gate g with its ID, i.e. an integer in [2 s ]. Let h g i be the binary encoding of the gate ID with b log s c + 2 = Θ(log s ) bits padded with zerosif necessary. Further, let h , h , . . . , h d be the path from the root r = h of F to the gate g = h d of depth d ≥
0. To simplify notation we define h gi = 2 h h i ih g i
2, i.e. the encoding ofthe gate on the path and the gate where the path ends.
INPUT Gates
We set D Sg := { h gi · · · h gd h gd · · · h gi , h gi · · · h gd h gd · · · h gi | i ∈ [ d ] } .For F g ( a, b ) = a i , we set t g := h g · · · h gd a i h gd · · · h g and D Mg := { h g · · · h gd h gd · · · h g } For F g ( a, b ) = b i , we set t g := h g · · · h gd h gd · · · h g and D Mg := { h g · · · h gd b i h gd · · · h g } AND Gate
We define the text and the corresponding dictionaries as follows: t g := h g · · · h gd t t h gd · · · h g D Mg := { h g · · · h gd , h gd · · · h g } D Sg := { h gi · · · h gd h g · · · h g i − , h g i − · · · h g h g · · · h g i − , h g i − · · · h g h gd · · · h gi | i ∈ [ d ] } . Schepper 21 OR Gate
We define the text and the additional dictionaries for g as: t g := h g · · · h gd t h gd t h gd · · · h g D Mg := { h g · · · h gd , h gd h g · · · h g d , h g d · · · h g h gd · · · h g }∪{ h g · · · h gd h g · · · h g d , h g d · · · h g h gd , h gd · · · h g } D Sg := { h gi · · · h gd h g · · · h g i − , h g i − · · · h g h gd h g · · · h g i − , h g i − · · · h g h gd · · · h gi | i ∈ [ d ] } (cid:73) Lemma 5.3.
For all assignments a, b and gates g : t g ( a ) ∈ L ( h g · · · h gi − ( D g ( b )) + h gi − · · · h g ) for all i ∈ [ d ] . t g ( a ) / ∈ L ( h g · · · h gi − ( D g ( b )) + h gj − · · · h g ) for all i = j ∈ [0 , d ] , where h g h g − and h g − h g denote the empty string. Proof.
The first claim follows by a structural induction on the output gate using only wordsfrom D Sg for the current gate g . Likewise we show the second case by a structural inductionon the output gate. INPUT Gate
The statement holds by the definition of the dictionary.
AND Gate
Assume the claim is false for g . We can only match the “prefix” h gi · · · h gd with the word h gi · · · h gd h g · · · h g i − . And analogously for the “suffix”. The joiningpart of t t has to be matched by some h g k − · · · h g h g · · · h g k − for k ∈ [0 , . . . , d ] (pos-sibly the empty string). Hence, t ∈ L ( h g · · · h g i − ( D g ( b )) + h g k − · · · h g ) and t ∈L ( h g · · · h g k − ( D g ( b )) + h g j − · · · h g ). But from i = j it follows that k = i or k = j andwe have a contradiction to the induction hypothesis for g or g . OR Gate
The “prefix” h gi · · · h gd has to be matched by h gi · · · h gd h g · · · h g i − and analogouslyfor the “suffix”. If the joining part of t h gd t was matched by h g k − · · · h g h gd h g · · · h g k − forsome k ∈ [ d ], the same proof as for the AND gate applies. Otherwise, either h g d · · · h g h gd or h gd h g · · · h g d was used. Let it w.l.o.g. be the first one. Since i ∈ [0 , d ], we have i = d + 1and hence a contradiction to the inductive hypothesis for g . (cid:74)(cid:73) Lemma 5.4 (Correctness of the Construction) . For all assignments a, b and gates g : F g ( a, b ) = true ⇐⇒ t g ( a ) ∈ L (( D g ( b )) + ) . Proof.
We proof the claim by an induction on the output gate.
INPUT Gate
Follows directly from the construction of the text and the dictionary.
AND Gate “ ⇒ ” We can use D +1 and D +2 to match t and t by the induction hypothesis,respectively. The remaining parts are matched by the words in D Mg . AND Gate “ ⇐ ” The initial and last h g of the text have to be matched. Since the gate g ispart of the encoding, we can only use words from D Mg for this. It follows directly that t is matched by words from D because the initial h g has to be matched too and thewords in D Sg are not eligible for this. The same argument shows that t is matched bywords from D . Hence, the claim follows by the induction hypothesis. OR Gate “ ⇒ ” Assume w.l.o.g. that F g ( a, b ) = true , the other case is symmetric. Wematch the prefix of t g in the obvious way by the corresponding word from D Mg . Byassumption we match t with words from D . The prefix h g . . . h g d of t is matchedby the corresponding word in D Mg . By the first claim of the previous lemma, we have t ∈ L ( h g . . . h g d ( D g ) + h g d . . . h g ) and the remaining suffix can be matched by thecorresponding word from D Mg . OR Gate “ ⇐ ” By Lemma 5.3 the joining part of t h gd t has to be matched by either h gd h g . . . h g d or h g d . . . h g h gd . Let it w.l.o.g. be the first one. Then t has to be matchedby words from D again by the lemma. The inductive hypothesis gives us a satisfyingassignment. (cid:74) (cid:73) Lemma 5.5.
We have the following size bounds: | t r | ∈ O ( sd log s ) ⊆ O ( s log s ) | D r | ∈ O ( sd ) ⊆ O ( s ) ∀ x ∈ D r : | x | ∈ O ( d log s ) ⊆ O ( s log s ) Proof.
The lemma follows directly from the definitions and the observations that | t g | ≤| t | + | t | + O ( d log s ), | D Mg | ∈ O (1), and | D Sg | ∈ O ( d ). (cid:74) Outer OR.
Let A = { a (1) , . . . , a ( n ) } be the first set and B = { b (1) , . . . , b ( m ) } be the secondset of half-assignments. Again we encode A by the text and B by the pattern. For this weobserve that the first step of the reduction produced a pattern of type + |◦ . Thus, we can usethe outer alternative to encode the outer OR to select a specific b ( j ) . To match the wholetext, we blow up the text and the pattern and pad each symbol with three new symbols suchthat we can distinguish between the following three matching states: (1) ignore the paddingand match a part of the original text to the original pattern, i.e. we evaluate the formula ontwo half-assignments. (2) Match an arbitrary prefix, i.e. the symbols before the actual matchin state (1). (3) Match some arbitrary suffix, i.e. the symbols after the actual match fromstate (1). We allow a change between these states only at the end of a text group and requirethat we go through all three states if and only if the text can be matched by the pattern. (cid:73) Definition 5.6 (Blow-Up of a Text) . Let t = t · · · t n be a text of length n and u be somearbitrary string. We define t ⇑ u := ut ut · · · ut n and extend it in the natural way to sets. Using this we define the final text and pattern as follows: t :=563 n K i =1 (cid:16) t ( a ( i ) )3 ⇑ (cid:17) p := p +1 | p +2 | · · · | p + m p j :=5604 | | | | | D r ( b ( j ) ) ⇑ | | | | | (cid:73) Lemma 5.7.
If there are a ( k ) and b ( l ) such that F ( a ( k ) , b ( l ) ) = true , then t ∈ L ( p ) . Proof.
It suffices to show that we can match t to p + l . The prefix of t and the first k − x x ∈ { , , , } while the last threesymbols of the k − t ( a ( k ) ) ⇑ ∈ L (( D r ( b ( l ) ) ⇑ ) + ).The following 456345 is matched by the corresponding pattern while the remaining symbolsof the text are matched in a straight forward way by repetitions of 6 x (cid:74)(cid:73) Lemma 5.8. If t ∈ L ( p ) , then there are a ( k ) and b ( l ) such that F ( a ( k ) , b ( l ) ) = true . Proof.
By the structure of the pattern we can already fix l . As there is no way to matchthe text just with words 56 x x
45, the word 563 must have been used at the end ofsome group to switch to the first state. Hence, let the k th text group be the first groupnot matched by words of the form 56 x
4. Observe that we cannot directly switch to anapplication of 6 x
45 and thus get t ( a ( k ) ) ⇑ ∈ L (( D r ( b ( l ) ) ⇑ ) + ). Since the blow-up 456always matches each other, we can ignore it and get t ( a ( k ) ) ∈ L ( D r ( b ( l ) ) + ) proving the claimby Lemma 5.4. (cid:74)(cid:73) Corollary 5.9.
The final text has length O ( nsd log s ) ⊆ O ( ns log s ) and the pattern hassize O ( msd log s ) ⊆ O ( ms log s ) . This finishes the proof of Theorem 5.2. (cid:121) . Schepper 23
References Amir Abboud and Karl Bringmann. Tighter connections between formula-sat and shavinglogs. In Ioannis Chatzigiannakis, Christos Kaklamanis, Dániel Marx, and Donald Sannella,editors, , volume 107 of
LIPIcs , pages 8:1–8:18. SchlossDagstuhl - Leibniz-Zentrum für Informatik, 2018. Full version: arXiv:1804.08978 . doi:10.4230/LIPIcs.ICALP.2018.8 . Amir Abboud, Richard Ryan Williams, and Huacheng Yu. More applications of the polynomialmethod to algorithm design. In Piotr Indyk, editor,
Proceedings of the Twenty-Sixth AnnualACM-SIAM Symposium on Discrete Algorithms, SODA 2015, San Diego, CA, USA, January4-6, 2015 , pages 218–230. SIAM, 2015. doi:10.1137/1.9781611973730.17 . Alfred V. Aho and Margaret J. Corasick. Efficient string matching: An aid to bibliographicsearch.
Commun. ACM , 18(6):333–340, 1975. doi:10.1145/360825.360855 . Arturs Backurs and Piotr Indyk. Which regular expression patterns are hard to match? InIrit Dinur, editor,
IEEE 57th Annual Symposium on Foundations of Computer Science, FOCS2016, 9-11 October 2016, Hyatt Regency, New Brunswick, New Jersey, USA , pages 457–466.IEEE Computer Society, 2016. Full version: arXiv:1511.07070 . doi:10.1109/FOCS.2016.56 . Philip Bille and Mikkel Thorup. Faster regular expression matching. In Susanne Albers,Alberto Marchetti-Spaccamela, Yossi Matias, Sotiris E. Nikoletseas, and Wolfgang Thomas,editors,
Automata, Languages and Programming, 36th International Colloquium, ICALP2009, Rhodes, Greece, July 5-12, 2009, Proceedings, Part I , volume 5555 of
Lecture Notes inComputer Science , pages 171–182. Springer, 2009. doi:10.1007/978-3-642-02927-1_16 . Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors.
Commun. ACM ,13(7):422–426, 1970. doi:10.1145/362686.362692 . Maria Luisa Bonet and Samuel R. Buss. Size-depth tradeoffs for boolean fomulae.
Inf. Process.Lett. , 49(3):151–155, 1994. doi:10.1016/0020-0190(94)90093-0 . Karl Bringmann, Allan Grønlund, and Kasper Green Larsen. A dichotomy for regular expressionmembership testing. In Chris Umans, editor, , pages 307–318.IEEE Computer Society, 2017. Full version: arXiv:1611.00918 . doi:10.1109/FOCS.2017.36 . Timothy M. Chan and Ryan Williams. Deterministic apsp, orthogonal vectors, and more:Quickly derandomizing razborov-smolensky. In Robert Krauthgamer, editor,
Proceedings ofthe Twenty-Seventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2016,Arlington, VA, USA, January 10-12, 2016 , pages 1246–1255. SIAM, 2016. doi:10.1137/1.9781611974331.ch87 . Ruiwen Chen, Valentine Kabanets, and Nitin Saurabh. An improved deterministic
Mathematical Foundations of Computer Science 2014 - 39th InternationalSymposium, MFCS 2014, Budapest, Hungary, August 25-29, 2014. Proceedings, Part II ,volume 8635 of
Lecture Notes in Computer Science , pages 165–176. Springer, 2014. doi:10.1007/978-3-662-44465-8\_15 . Richard Cole and Ramesh Hariharan. Verifying candidate matches in sparse and wildcardmatching. In John H. Reif, editor,
Proceedings on 34th Annual ACM Symposium on Theoryof Computing, May 19-21, 2002, Montréal, Québec, Canada , pages 592–601. ACM, 2002. doi:10.1145/509907.509992 . Theodore Johnson, S. Muthukrishnan, and Irina Rozenbaum. Monitoring regular expressionson out-of-order streams. In Rada Chirkova, Asuman Dogac, M. Tamer Özsu, and Timos K.Sellis, editors,
Proceedings of the 23rd International Conference on Data Engineering, ICDE2007, The Marmara Hotel, Istanbul, Turkey, April 15-20, 2007 , pages 1315–1319. IEEEComputer Society, 2007. doi:10.1109/ICDE.2007.369001 . Kenrick Kin, Björn Hartmann, Tony DeRose, and Maneesh Agrawala. Proton: multitouchgestures as regular expressions. In Joseph A. Konstan, Ed H. Chi, and Kristina Höök, editors,
CHI Conference on Human Factors in Computing Systems, CHI ’12, Austin, TX, USA - May05 - 10, 2012 , pages 2885–2894. ACM, 2012. doi:10.1145/2207676.2208694 . Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern matching instrings.
SIAM J. Comput. , 6(2):323–350, 1977. doi:10.1137/0206024 . Ilan Komargodski, Ran Raz, and Avishay Tal. Improved average-case lower bounds fordemorgan formula size. In , pages 588–597. IEEE Computer Society,2013. doi:10.1109/FOCS.2013.69 . David Landsman. RNP-1, an RNA-binding motif is conserved in the DNA-binding cold shockdomain.
Nucleic Acids Research , 20(11):2861–2864, 06 1992. doi:10.1093/nar/20.11.2861 . Quanzhong Li and Bongki Moon. Indexing and querying XML data for regular path ex-pressions. In Peter M. G. Apers, Paolo Atzeni, Stefano Ceri, Stefano Paraboschi, KotagiriRamamohanarao, and Richard T. Snodgrass, editors,
VLDB 2001, Proceedings of 27th Inter-national Conference on Very Large Data Bases, September 11-14, 2001, Roma, Italy , pages361–370. Morgan Kaufmann, 2001. URL: . Makoto Murata. Extended path expressions for XML. In Peter Buneman, editor,
Proceedingsof the Twentieth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of DatabaseSystems, May 21-23, 2001, Santa Barbara, California, USA . ACM, 2001. doi:10.1145/375551.375569 . Eugene W. Myers. A four russians algorithm for regular expression pattern matching.
J. ACM ,39(2):430–448, 1992. doi:10.1145/128749.128755 . Gonzalo Navarro and Mathieu Raffinot. Fast and simple character classes and bounded gapspattern matching, with applications to protein searching.
Journal of Computational Biology ,10(6):903–923, 2003. PMID: 14980017. doi:10.1089/106652703322756140 . Rahul Santhanam. Fighting perebor: New and improved algorithms for formula and QBFsatisfiability. In , pages 183–192. IEEE Computer Society,2010. doi:10.1109/FOCS.2010.25 . Philipp Schepper. Fine-grained complexity of regular expression pattern matching andmembership. In Fabrizio Grandoni, Grzegorz Herman, and Peter Sanders, editors, , volume 173 of
LIPIcs , pages 80:1–80:20. Schloss Dagstuhl - Leibniz-Zentrum fürInformatik, 2020. doi:10.4230/LIPIcs.ESA.2020.80 . Ken Thompson. Regular expression search algorithm.
Commun. ACM , 11(6):419–422, 1968. doi:10.1145/363347.363387 . Richard Ryan Williams. The polynomial method in circuit complexity applied to algorithmdesign (invited talk). In Venkatesh Raman and S. P. Suresh, editors, , volume 29 of
LIPIcs , pages 47–60. SchlossDagstuhl - Leibniz-Zentrum für Informatik, 2014. doi:10.4230/LIPIcs.FSTTCS.2014.47 . Fang Yu, Zhifeng Chen, Yanlei Diao, T. V. Lakshman, and Randy H. Katz. Fast and memory-efficient regular expression matching for deep packet inspection. In Laxmi N. Bhuyan, MichelDubois, and Will Eatherton, editors,
Proceedings of the 2006 ACM/IEEE Symposium onArchitecture for Networking and Communications Systems, ANCS 2006, San Jose, California,USA, December 3-5, 2006 , pages 93–102. ACM, 2006. doi:10.1145/1185347.1185360 . . Schepper 25 A FSH implies FPH
We use the following relation between
Formula-SAT and
Formula-Pair to show thatFSH implies FPH: (cid:73)
Lemma A.1 (Weak version of Lemma B.2 in the full version of [1]) . An instance of
Formula-SAT on a De Morgan formula of size s over n variables can be reduced to an instance of Formula-Pair with a monotone De Morgan formula of size k = O ( s ) and two sets of size O (2 n/ ) in linear time. Proof Idea.
Let F be the formula for Formula-SAT on n variables and size s . We define F to be the same formula as F but each leaf is labeled with a different variable and weremove the negations from the leaves.For all half-assignments x to the first half of variables of F we construct a new half-assignment a x for F as follows: Let l be a leaf in F with a variable from the first half ofinputs and let l be the corresponding variable/leaf in F . We set a x [ l ] = true if and only if l evaluates to true under x . We construct the set B analogous for the second half of inputs of F . Since F has n inputs this results in 2 n/ assignments for A and B . (cid:74)(cid:73) Lemma A.2.
FSH implies FPH.
Proof.
Assume FSH holds and FPH is false for some fixed k ≥
1. Let F be a formulafor Formula-SAT on N inputs and size s = N / (4 k ) ∈ N . By Lemma A.1 wetransform F into a monotone De Morgan formula F of size s = O ( s ) and two sets with n, m ∈ O (2 N/ ) assignments. We run the algorithm for Formula-Pair on this instance tocontradict FSH: O (cid:18) n · m · s k log k +2 n log o (1) N (cid:19) ⊆ O (cid:18) N/ N/ s k N . log k +2 N/ (cid:19) = O (cid:18) N N k +0 . . N k +2 (1 / k +2 (cid:19) = O (cid:18) N N k +1 . N k +2 (cid:19) = O (cid:18) N N . (cid:19) See the following paragraph for the additional factor of N o (1) . (cid:74) As Abboud and Bringmann [1] we use the
Word-RAM model as our computational model.The word size of the machine will be fixed to Θ(log N ) many bits for input size N . Likewisewe assume several operations that can be performed in time O (1) (e.g. AND, OR, NOT,addition, multiplication, . . . ).While this is sufficient for our reductions, we also need that the operations are robust to achange of the word size to state FPH. As in [1] we require that we can simulate the operationson words of size Θ(log N ) on a machine with word size Θ(log log N ) in time (log N ) o (1) .In the above proof the input size increased from N to n = 2 N . Hence, we have to simulatethe algorithm for Formula-Pair with word size log n = N on a machine with word sizelog N to get an algorithm for Formula-SAT . Thus, the running time slows down by a factorof (log n ) o (1) = N o (1) . B Correctness of the Graph Construction for + |◦ + -Membership We show the correctness of the graph construction given in the proof of Theorem 3.1 Item 2. (cid:66)
Claim B.1. If t ∈ L ( p ), then there is a path from v to v n . Proof.
Assume p = ( p | . . . | p k ) + . Since t ∈ L ( p ), we can decompose t into t = τ · · · τ ‘ such that for all l ∈ [ ‘ ] τ l ∈ L ( p k l ) for some k l ∈ [ k ]. Define λ l = | τ · · · τ l | as the length ofthe first l parts of t for all l ∈ [ ‘ ]. We claim that if τ · · · τ l ∈ L ( p ), then there is a path from v to v λ l .For l = 0, the claim is vacuously true as ε / ∈ L ( p ). Now assume the claim holds forarbitrary but fixed l . We define i = λ l + 1 and j = λ l +1 to simplify notation and get τ l +1 = t i · · · t j . From τ l +1 ∈ L ( p k l +1 ) and Lemma 3.6 we know ( f, i , j , e ) ∈ M for some i ≤ i ≤ j ≤ j . Further, f, e are set to 1 if and only if the first and last run of p k l +1 containsa Kleene Plus, respectively. Hence, v ej is reachable from v fi − . Now it suffices to show that(1) v fi − is reachable from v i − and (2) v j is reachable from v ej . Then the claim followsinductively as v i − is reachable from v .We first show (1). If f = 0, we must have i = i and the claim holds. Thus assume f = 1.We know τ l +1 = t i · · · t j ∈ L ( p k ) and t i · · · t j ∈ L ( p k ) for some k ∈ [ k ]. As the first run of p k contains a Kleene Plus, the symbols, t i , t i +1 , . . . , t i are all equal. That is, they form a runfrom i to i . By the construction of the graph, there are edges ( v i − , v i ) , . . . , ( v i − , v i − ).But there is also the additional edge ( v i − , v i − ) proving (1).By a symmetric argument one can show claim (2). (cid:67)(cid:66) Claim B.2.
If there is a path from v to v n , then t ∈ L ( p ). Proof.
First observe that it is not possible to reach v n from v without using edges introducedby tuples in M . Now fix some path P from v to v n and let P , . . . , P ‘ be the edges on thepath that are introduced by tuples in M . Let P l = ( v f l i l − , v e l j l ), i.e. ( f l , i l , j l , e l ) ∈ M .Assume j = 0 and i ‘ +1 = n + 1 in the following to simplify notation. For each tuplethere are two indices i l and j l such that j l − ≤ i l − ≤ i l − j l ≤ j l ≤ i l +1 − P goes through v i l − and v j l . These nodes exist, as every path from v e l − j l − to v f l i l − has to go through some node v r . We have j l + 1 = i l +1 for all l ∈ [0 , ‘ ] with j = 0 and i ‘ +1 = n + 1 and hence, t = t i · · · t j t i · · · t j · · · t i ‘ · · · t j ‘ . Thus, it suffices to show that forevery l ∈ [ ‘ ] there is a k ∈ [ k ] such that t i l · · · t j l ∈ L ( p k ).We fix l in the following and omit it as index to simplify notation. By the constructionof the graph we have ( f, i , j , e ) ∈ M and hence by Lemma 3.6 t i · · · t j ∈ L ( p k ) for some k ∈ [ k ]. We extend this result and claim t i · · · t j ∈ L ( p k ). Recall, that there is a pathfrom v i − to v fi − in P . If f = 0, then i = i and the claim follows. Otherwise, we knowthat the first run of p k contains a Kleene Plus for some symbol α . As no edge resultingfrom a tuple in M can be chosen, the edge ( v i − , v i − ) is contained in the path P . By theconstruction of the graph, the sequence t i · · · t i is contained in some run β c . But α = β andwe get t i · · · t i − t i · · · t j ∈ L ( p k ).We can apply the symmetric argument to show that t i · · · t j t j +1 · · · t j ∈ L ( p k ) provingthe claim. (cid:67) . Schepper 27 C Graphical Representation of the Results for Membership Θ( n + m )immediateΘ( n + m )[4]Θ (cid:16) nm poly log n (cid:17) ( ◦ ? )Sec. 5.1, Lem. 2.2 ? + | Θ( n + m )immediate | ? ◦ SimplifiesLem. 2.1 Θ (cid:16) nm poly log n (cid:17) ( ◦|◦ )Sec. 5.1, Lem. 2.2Θ (cid:16) nm poly log n (cid:17) ( ◦ ? )Sec. 5.1, Lem. 2.2Θ (cid:16) nm poly log n (cid:17) ( ◦| +)Sec. 5.1, Lem. 2.2 ◦ ? + Θ (cid:16) nm poly log n (cid:17) ( ◦ + ◦ )Sec. 5.1, Lem. 2.2Θ (cid:16) nm poly log n (cid:17) ( ◦ ? )Sec. 5.1, Lem. 2.2Θ (cid:16) nm poly log n (cid:17) ( ◦ + | )Sec. 5.1, Lem. 2.2 ◦ ? | Θ( n + m )immediateΘ( n + m )[4]SimplifiesLem. 2.1 ? ◦| Θ( n + m )immediate+ Θ (cid:16) nm poly log n (cid:17) Thm. 5.2SimplifiesLem. 2.1SimplifiesLem. 2.1 ◦ ? + O ( n log n + m )[8]Θ (cid:16) nm poly log n (cid:17) ( ◦ ? )Sec. 5.1, Lem. 2.2Expected:( n + m ) o (1) [8]+ ? | Θ (cid:16) nm poly log n (cid:17) ( ◦ + ◦ )Sec. 5.1, Lem. 2.2Θ (cid:16) nm poly log n (cid:17) ( ◦ ? )Sec. 5.1, Lem. 2.2Θ (cid:16) nm poly log n (cid:17) ( ◦ + | )Sec. 5.1, Lem. 2.2 ◦ ? | Θ (cid:16) nm poly log n (cid:17) ( ◦|◦ )Sec. 5.1, Lem. 2.2Θ (cid:16) nm poly log n (cid:17) ( ◦ ? )Sec. 5.1, Lem. 2.2Θ (cid:16) nm poly log n (cid:17) ( ◦| +)Sec. 5.1, Lem. 2.2 ◦ ? + Figure 3
The classification of the patterns starting with | for membership. The red bounds areshown in this paper while the blue ones follow as corollaries. Θ (cid:16) nm poly log n (cid:17) Sec. 5.1Θ (cid:16) nm poly log n (cid:17) ( ◦ ? )Sec. 5.1, Lem. 2.2Θ (cid:16) nm poly log n (cid:17) Sec. 5.1 ◦ ? | Θ( n + m )immediate Θ (cid:16) nm poly log n (cid:17) Sec. 5.1Θ (cid:16) nm poly log n (cid:17) ( ◦ ? )Sec. 5.1, Lem. 2.2Θ (cid:16) nm poly log n (cid:17) Sec. 5.1 ◦ ? +Θ( n + m )immediateΘ (cid:16) nm poly log n (cid:17) Sec. 5.1Θ( n + m )immediate ◦ ? + | SimplifiesLem. 2.1 ? + ? SimplifiesLem. 2.1 SimplifiesLem. 2.1Word BreakΘ( nm / + m )[4]SimplifiesLem. 2.1 ? ◦ +Θ( n + m )immediate | Θ (cid:16) nm poly log n (cid:17) ( ◦ ? )Sec. 5.1, Lem. 2.2 nm Ω( √ log min( n,m )) Thm. 3.1 nm Ω( √ log min( n,m )) Thm. 3.1+ ? | Θ( n + m )immediateΘ( n + m )[4]Θ (cid:16) nm poly log n (cid:17) ( ◦ ? )Sec. 5.1, Lem. 2.2 ? + | Θ( n + m )immediate ◦ Θ (cid:16) nm poly log n (cid:17) ( ◦|◦ )Sec. 5.1, Lem. 2.2Θ (cid:16) nm poly log n (cid:17) ( ◦ ? )Sec. 5.1, Lem. 2.2Θ (cid:16) nm poly log n (cid:17) ( ◦| +)Sec. 5.1, Lem. 2.2 ◦ ? + Θ (cid:16) nm poly log n (cid:17) ( ◦ + ◦ )Sec. 5.1, Lem. 2.2Θ (cid:16) nm poly log n (cid:17) ( ◦ ? )Sec. 5.1, Lem. 2.2Θ (cid:16) nm poly log n (cid:17) ( ◦ + | )Sec. 5.1, Lem. 2.2 ◦ ? | Θ (cid:16) nm poly log n (cid:17) ( ◦ + ◦ )Sec. 5.1, Lem. 2.2Θ (cid:16) nm poly log n (cid:17) ( ◦ ? )Sec. 5.1, Lem. 2.2Θ (cid:16) nm poly log n (cid:17) ( ◦ + | )Sec. 5.1, Lem. 2.2 ◦ ? | Θ (cid:16) nm poly log n (cid:17) ( ◦|◦ )Sec. 5.1, Lem. 2.2Θ (cid:16) nm poly log n (cid:17) ( ◦ ? )Sec. 5.1, Lem. 2.2Θ (cid:16) nm poly log n (cid:17) ( ◦| +)Sec. 5.1, Lem. 2.2 ◦ ? + Figure 4
The classification of the patterns starting with +, ? , or ◦◦