A Trichotomy for Regular Trail Queries
aa r X i v : . [ c s . F L ] M a r A Trichotomy for Regular Trail Queries
Wim Martens
University of Bayreuth, Germany
Matthias Niewerth
University of Bayreuth, Germany
Tina Trautner
University of Bayreuth, Germany
Abstract
Regular path queries (RPQs) are an essential component of graph query languages. Such queriesconsider a regular expression r and a directed edge-labeled graph G and search for paths in G forwhich the sequence of labels is in the language of r . In order to avoid having to consider infinitelymany paths, some database engines restrict such paths to be trails , that is, they only consider pathswithout repeated edges. In this paper we consider the evaluation problem for RPQs under trailsemantics, in the case where the expression is fixed. We show that, in this setting, there exists atrichotomy. More precisely, the complexity of RPQ evaluation divides the regular languages intothe finite languages, the class T tract (for which the problem is tractable), and the rest. Interestingly,the tractable class in the trichotomy is larger than for the trichotomy for simple paths , discoveredby Bagan et al. [5]. In addition to this trichotomy result, we also study characterizations of thetractable class, its expressivity, the recognition problem, closure properties, and show how thedecision problem can be extended to the enumeration problem, which is relevant to practice. Information systems → Query languages for non-relational en-gines; Information systems → Information retrieval query processing; Theory of computation → Problems, reductions and completeness; Theory of computation → Regular languages
Keywords and phrases
Regular languages, query languages, path queries, graph databases, databases,complexity, trails, simple paths
Graph databases are a popular tool to model, store, and analyze data [25, 34, 27, 36, 12].They are engineered to make the connectedness of data easier to analyze. This is indeed adesirable feat, since some of today’s largest companies have become so successful becausethey understood how to use the connectedness of the data in their specific domain (e.g.,Web search and social media). One aspect of graph databases is to bring tools for analyzingconnectedness to the masses.Regular path queries (RPQs) are a crucial component of graph databases, because theyallow reasoning about arbitrarily long paths and, in particular, paths that are longer thanthe size of the query. A regular path query essentially consists of a regular expression r and is evaluated on a graph database which, for the purpose of this paper, we view as anedge-labeled directed graph G . When evaluated, the RPQ r searches for paths in G forwhich the sequence of labels is in the language of r . The return type of the query varies:whereas most academic research on RPQs [23, 6, 7, 21, 3] and SPARQL [35] focus on thefirst and last node of matching paths, Cypher [26] returns the entire paths. G-Core, a recentproposal by partners from industry and academia, sees paths as “first-class citizens” in graphdatabases [2].In addition, there is a large variation on which types of paths are considered. Popularoptions are all paths , simple paths , trails , and shortest paths . In this paper, we focus on trails,which are paths in which every edge can appear at most once. This variant is the default inCypher 9 [26, 14], but did not receive much attention from the research community yet. A Trichotomy for Regular Trail Queries
In this paper, we study the data complexity of RPQ evaluation under trail semantics.That is, we study variants of RPQ evaluation in which the RPQ r is considered to be fixed.As such, the input of the problem only consists of an edge-labeled graph G and a pair ( s, t )of nodes and we are asked if there exists a trail from s to t on which the sequence of labelsmatches r . One of our main results is a trichotomy on the RPQs for which this problemis in AC , NL-complete, or NP-complete, respectively. By T tract , we refer to the class oftractable languages (assuming NP = NL).In order to increase our understanding of T tract , we study several important aspectsof this class of languages. A first set of results is on characterizations of T tract in termsof closure properties and syntactic and semantic conditions on their finite automata. Ina second set of results, we compare the expressiveness of T tract with yardstick languagessuch as FO [ < ], FO [ <, +], FO [ < ] (or aperiodic languages ), and SP tract . The latter class, SP tract , is the closely related class of languages for which the data complexity of RPQevaluation under simple path semantics is tractable. Interestingly, T tract is strictly largerthan SP tract and includes languages outside SP tract such as a ∗ bc ∗ and ( ab ) ∗ that are relevantin application scenarios in network problems, genomic datasets, and tracking provenanceinformation of food products [29] and were recently discovered to appear in public querylogs [10, 9]. Furthermore, every single-occurrence regular expression [8] is in T tract , whichcan be a convenient guideline for users of graph databases, since single-occurrence (everyalphabet symbol occurs at most once) is a very simple syntactical property. It is also popularin practice: we analyzed the 50 million RPQs found in the logs of [11] and discovered thatover 99.8% of the RPQs are single-occurrence.We then study the recognition problem for T tract , that is: given an automaton, does itslanguage belong to T tract ? This problem is NL-complete (resp., PSPACE-complete) if theinput automaton is a DFA (resp., NFA). We also treat closure under common operationssuch as union, intersection, reversal, quotients and morphisms.We conclude by showing that also the enumeration problem is tractable for T tract . Bytractable, we mean that the paths that match the RPQ can be enumerated with only poly-nomial delay between answers. Technically, this means that we have to prove that we cannotonly solve a decision variant of the RPQ evaluation problem, but we also need to find wit-nessing paths. We prove that the algorithms for the decision problems can be extended toreturn shortest paths . This insight can be combined with Yen’s Algorithm [37] to give apolynomial delay enumeration algorithm. Related Work.
RPQs on graph databases have been studied since the end of the 80’s andare now finding their way into commercial products. The literature usually considers thevariant of RPQ evaluation where one is given a graph database G , nodes s, t , and an RPQ r , and then needs to decide if G has a path from s to t (possibly with loops) that matches r .For arbitrary and shortest paths, this problem is well-known to be tractable, since it boilsdown to testing intersection emptiness of two NFAs.Mendelzon and Wood [23] studied the problem for simple paths, which are paths with-out node repetitions. They observed that the problem is already NP-complete for regularexpressions a ∗ ba ∗ and ( aa ) ∗ . These two results rely heavily on the work of Fortune et al. [13]and LaPaugh and Papadimitriou [19].Our work is most closely related to the work of Bagan et al. [5] who, like us, studiedthe complexity of RPQ evaluation where the RPQ is fixed. They proved a trichotomy Bagan et al. [5] called the class C tract , which stands for “tractable class”. We distinguish between SP tract and T tract here to avoid confusion between simple paths and trails. . Martens, M. Niewerth and T. Trautner 3 s tab ab s ta ba s tba ba s ta a bb ba Figure 1
Directed, edge-labeled graphs that have a trail from s to t for the case where the RPQ should only match simple paths. In this paper we will referto this class as SP tract , since it contains the languages for which the simple path problemis tractable, whereas we are interested in a class for trails . Martens and Trautner [22]refined this trichotomy of Bagan et al. [5] for simple transitive expressions , by analyzing thecomplexity where the input consists of both the expression and the graph. Trails versus Simple Paths.
We conclude with a note on the relationship between simplepaths and trails. For many computational problems, the complexities of dealing with simplepaths or trails are the same due to two simple reductions, namely: (1) constructing the linegraph or (2) splitting each node into two, see for example Perl and Shiloach [28, Theorem2.1 and 2.2]. As soon as we consider labeled graphs, the line graph technique still works, butnot the nodes-splitting technique, because the labels on paths change. As a consequence,we know that finding trails is at most as hard as finding simple paths, but we do not knowif it has the same complexity when we require that they match a certain RPQ r .In this paper we show that the relationship is strict, assuming NL = NP. An easyexample is the language ( ab ) ∗ , which is NP-hard for simple paths [19, 23], but—assumingthat a and b -edges are different—in NL for trails. This is because every path from s to t that matches ( ab ) ∗ can be reduced to a trail from s to t that matches ( ab ) ∗ by removingloops (in the path, not in the graph) that match ( ab ) ∗ or ( ba ) ∗ . In Figure 1 we depict foursmall graphs, all of which have trails from s to t . (In the two rightmost graphs, there isexactly one path labeled ( ab ) ∗ , which is also a trail.) We use [ n ] to denote the set of integers { , . . . , n } . By Σ we always denote a finite alphabet,i.e., a finite set of symbols . We always denote symbols by a , b , c , d and their variants, like a ′ , a , b , etc. A word is a finite sequence w = a · · · a n of symbols.We consider edge-labeled directed graphs G = ( V, E ), where V is a finite set of nodesand E ⊆ V × Σ × V is a set of (labeled) edges. A path p from node s to t is a se-quence ( v , a , v )( v , a , v ) · · · ( v m , a m , v m +1 ) with v = s and v m +1 = t and such that( v i , a i , v i +1 ) ∈ E for each i ∈ [ m ]. By | p | we denote the number of edges of a path. A pathis a trail if all the edges ( v i , a i , v i +1 ) are different and a simple path if all the nodes v i aredifferent. (Notice that each simple path is a trail but not vice versa.) We denote a · · · a m by lab( p ). Given a language L ⊆ Σ ∗ , path matches L if lab( p ) ∈ L . For a subset E ′ ⊆ E ,path p is E ′ -restricted if every edge of p is in E ′ . Given a trail p and two edges e and e in p , we denote the subpath of p from e to e by p [ e , e ].Let L be a regular language. We denote by A L = ( Q L , Σ , i L , F L , δ L ) the (complete)minimal DFA for L and by N the number | Q L | of states. Strongly connected components of(the graph of) A L are simply called components . By δ L ( q, w ) we denote the state reachablefrom state q by reading w . We denote by q q that state q is reachable from q . Finally, L q denotes the set of all words accepted from q . For every state q , we denote by Loop( q ) theset { w ∈ Σ + | δ L ( q, w ) = q } of all non-empty words that allow to loop on q . For a word w A Trichotomy for Regular Trail Queries and a language L , we define wL = { ww ′ | w ′ ∈ L } and w − L = { w ′ | ww ′ ∈ L } . A language L is aperiodic if and only if δ L ( q, w N +1 ) = δ L ( q, w N ) for every state q and word w . (Thereare many characterizations of aperiodic languages [31].)We study the regular trail query ( RTQ ) problem for a regular language L . RTQ ( L )Given: A graph G = ( V, E ) and ( s, t ) ∈ V × V .Question: Is there a trail from s to t that matches L ?A similar problem, which was studied by Bagan et al. [5], is the RSPQ problem . The
RSPQ ( L ) problem asks if there exists a simple path from s to t that matches L . In this section, we define and characterize a class of languages of which we will prove thatit is exactly the class of regular languages L for which RTQ ( L ) is tractable (if NL = NP). It is instructive to first discuss the case of downward closed languages. A language L is downward closed ( DC ) if it is closed under taking subsequences. That is, for every word w = a · · · a n ∈ L and every sequence 0 < i < · · · < i k < n + 1 of integers, we have that a i · · · a i k ∈ L . Perhaps surprisingly, downward closed languages are always regular [16].Furthermore, they can be defined by a clean class of regular expressions (which was shownby Jullien [18] and later rediscovered by Abdulla et al. [1]), which is defined as follows. ◮ Definition 3.1. An atomic expression over Σ is an expression of the form ( a + ε ) orof the form ( a + · · · + a n ) ∗ , where a, a , . . . , a n ∈ Σ . A product is a (possibly empty)concatenation e · · · e n of atomic expressions e , . . . , e n . A simple regular expression is ofthe form p + · · · + p n , where p , . . . , p n are products. Another characterization is by Mendelzon and Wood [23], who show that a regular language L is downward closed if and only if its minimal DFA A L = ( Q L , Σ , i L , F L , δ L ) exhibits the suffix language containment property , which says that if state δ L ( q , a ) = q for some symbol a ∈ Σ, then we have L q ⊆ L q . Since this property is transitive, it is equivalent to requirethat L q ⊆ L q for every state q that is reachable from q . ◮ Theorem 3.2 [1, 16, 18, 23].
The following are equivalent: (1) L is a downward closed language. (2) L is definable by a simple regular expression. (3) The minimal DFA of L exhibits the suffix language containment property. Obviously,
RTQ ( L ) is tractable for every downward closed language L , since it is equiv-alent to deciding if there exists a path from s to t that matches L . For the same reason,deciding if there is a simple path from s to t that matches L is also tractable for downwardclosed languages. However, there are languages that are not downward closed for which weshow RTQ ( L ) to be tractable, such as a ∗ bc ∗ and ( ab ) ∗ . For these two languages, the simplepath variant of the problem is intractable. They restrict q , q to be on paths from s to some state in F L , but the property trivially holds for q being a sink-state. . Martens, M. Niewerth and T. Trautner 5 The following definitions are the basis of the class of languages for which
RTQ ( L ) is tractable. ◮ Definition 3.3.
An NFA A satisfies the left-synchronized containment property if thereexists an n ∈ N such that the following implication holds: If q , q ∈ Q A such that q q and if w ∈ Loop( q ) , w ∈ Loop( q ) with w = aw ′ and w = aw ′ , then we have w n L q ⊆ L q .Similarly, A satisfies the right-synchronized containment property if the same conditionholds with w = w ′ a and w = w ′ a . ◮ Definition 3.4.
A regular language L is closed under left-synchronized power abbrevia-tions (resp., closed under right-synchronized power abbreviations ) if there exists an n ∈ N such that for all words w ℓ , w m , w r ∈ Σ ∗ and all words w = aw ′ and w = aw ′ (resp., w = w ′ a and w = w ′ a ) we have that w ℓ w n w m w n w r ∈ L implies w ℓ w n w n w r ∈ L . We note that Definition 3.4 is equivalent to requiring that there exists an n ∈ N suchthat the implication holds for all i ≥ n . The reason is that, given i > n and a word of theform w ℓ w i w m w i w r , we can write it as w ′ ℓ w n w m w n w ′ r with w ′ ℓ = w ℓ w i − n and w ′ r = w i − n w r ,for which the implication holds by Definition 3.4.Next, we show that all conditions defined in Definitions 3.3 and 3.4 are equivalent forDFAs. ◮ Theorem 3.5.
For a regular language L with minimal DFA A L , the following are equiva-lent: (1) A L satisfies the left-synchronized containment property. (2) A L satisfies the right-synchronized containment property. (3) L is closed under left-synchronized power abbreviations. (4) L is closed under right-synchronized power abbreviations. In Theorem 4.1 we will show that, if NL = NP, the languages L that satisfy the aboveproperties are precisely those for which RTQ ( L ) is tractable. To simplify terminology, wewill henceforth refer to this class as T tract . ◮ Definition 3.6.
A regular language L belongs to T tract if L satisfies one of the equivalentconditions in Theorem 3.5. For example, ( ab ) ∗ and ( abc ) ∗ are in T tract , whereas a ∗ ba ∗ , ( aa ) ∗ and ( aba ) ∗ are not. Thefollowing property immediately follows from the definition of T tract . ◮ Observation 3.7.
Every regular expression for which each alphabet symbol under a Kleenestar occurs at most once in the expression defines a language in T tract . A special case of these expressions are those in which every alphabet symbol occurs at mostonce. These are known as single-occurrence regular expressions (SORE) [8]. SOREs werestudied in the context of learning schema languages for XML [8], since they occur very oftenin practical schema languages.
As we have seen before, regular expressions in which every symbol occurs at most once definelanguages in T tract . We will define a similar notion on automata. A Trichotomy for Regular Trail Queries ◮ Definition 3.8.
A component C of some NFA A is called memoryless , if for each symbol a ∈ Σ , there is at most one state q in C , such that there is a transition ( p, a, q ) with p in C . The following theorem provides (in a non-trivial proof that requires several steps) asyntactic condition for languages in T tract . The syntactic condition is item (4) of the theorem,which we define after its statement. Condition (5) emposes an additional restriction oncondition (4), and we later use it to prove that T tract ⊆ FO [ <, +]. ◮ Theorem 3.9.
For a regular language L , the following properties are equivalent: (1) L ∈ T tract (2) There exists an NFA A for L that satisfies the left-synchronized containment property. (3) There exists an NFA A for L that satisfies the left-synchronized containment propertyand only has memoryless components. (4) There exists a detainment automaton for L with consistent shortcuts. (5) There exists a detainment automaton for L with consistent shortcuts and only memory-less components. We use finite automata with counters or CNFAs from Gelade et al. [15], that we slightlyadapt to make the construction easier. For convenience, we provide a full definition inAppendix A. Let A be a CNFA with one counter c . Initially, the counter has value 0. Theautomaton has transitions of the form ( q , a, P ; q , U ) where P is a precondition on c and U an update operation on c . For instance, the transition ( q , a, c = 5; q , c := c −
1) means:if A is in state q , reads a , and the value of c is five, then it can move to q and decrease c by one. If we decrease a counter with value zero, its value remains zero. We denote theprecondition that is always fulfilled by true .We say that A is a detainment automaton if, for every component C of A :every transition inside C is of the form ( q , a, true ; q , c := c − C is of the form ( q , a, c = 0; q , c := k ) for some k ∈ N ; Intuitively, if a detainment automaton enters a non-trivial component C , then it must staythere for at least some number of steps, depending on the value of the counter c . The counter c is decreased for every transition inside C and the automaton can only leave C once c = 0.We say that A has consistent jumps if, for every pair of components C and C , if C C and there are transitions ( p i , a, true ; q i , c := c −
1) inside C i , then there is also a transition( p , a, P ; q , U ) for some P ∈ { true , c = 0 } and some update U . We note that C and C maybe the same component. The consistent jump property is the syntactical counterpart of theleft-synchronized containment property. The memoryless condition carries over naturally toCNFAs, ignoring the counter. Proof sketch of Theorem 3.9.
The implications (3) ⇒ (2) and (5) ⇒ (4) are trivial. Wesketch the proofs of (1) ⇒ (5) ⇒ (3) and (4) ⇒ (2) ⇒ (1) below, establishing the theorem.(1) ⇒ (5) uses a very technical construction that essentially exploits that—if the au-tomaton stays in the same component for a long time—the reached state only depends onthe last N symbols read in the component. This is formalized in Lemma 4.3 and allows usto merge any pair of two states p, q which contradict that some component is memoryless.To preserve the language, words that stay in some component C for less than N symbolshave to be dealt with separately, essentially avoiding the component altogether. Finally, The adaptation is that we let counters decrease instead of increase. Furthermore, it only needs zero-tests. If q is in a trivial component, then k should be 0 for the transition to be useful. The values of P and U depend on whether C is the same as C or not. . Martens, M. Niewerth and T. Trautner 7 aperiodic languages (= FO [ < ]) ( ac ∗ bc ∗ ) ∗ DCSP tract aa ∗ bc ∗ T tract ( ab ) ∗ FO [ < ] a ∗ ba ∗ FO [ <, +] a ∗ ba ∗ ( cd ) ∗ Figure 2
Expressiveness of subclasses of the aperiodic languages the left-synchronized containment property allows us to simply add transitions required tosatisfy the consistent jumps property without changing the language.(5) ⇒ (3) and (4) ⇒ (2): We convert a given CNFA to an NFA by simulating thecounter (which is bounded) in the set of states. The consistent jump property impliesthe left-synchronized containment property on the resulting NFA. The property that allcomponents are memoryless is preserved by the construction.(2) ⇒ (1): One can show that the left-synchronized containment property is invariantunder the powerset construction. ◭ We compare T tract to some closely related and yardstick languages to get an idea of itsexpressiveness. From Theorem 3.5 we can conclude that every language in T tract is aperiodic.Furthermore, every downward closed ( DC ) language is in T tract , since T tract relaxed thecontainment property.Bagan et al. [5] introduced the class SP tract , which characterizes the class of regularlanguages L for which the regular simple path query (RSPQ) problem is tractable. ◮ Theorem 3.10 (Theorem 2 in Bagan et al. [5]).
Let L be a regular language. (1) If L is finite, then RSPQ ( L ) ∈ AC . (2) If L ∈ SP tract and L is infinite, then RSPQ ( L ) is NL-complete. (3) If L / ∈ SP tract , then RSPQ ( L ) is NP-complete. One characterization of SP tract is the following (Theorem 4 in [5]): ◮ Definition 3.11. SP tract is the set of regular languages L such that there exists an i ∈ N for which the following holds: for all w ℓ , w, w r ∈ Σ ∗ and w , w ∈ Σ + we have that, if w ℓ w i ww i w r ∈ L , then w ℓ w i w i w r ∈ L . From this definition it is easy to see that every language in SP tract is also in T tract , since ourdefinition imposes an extra “synchronizing” condition on w and w , namely that they sharethe same first (or last) symbol (Definition 3.4). We now fully classify the expressiveness of T tract and SP tract compared to yardsticks as DC , FO [ < ], and FO [ <, +] (see also Figure 2). ◮ Theorem 3.12.(a) DC ( SP tract ( ( FO [ < ] ∩ T tract ) They called the class C tract , which stands for “tractable class”. We distinguish between SP tract and T tract here to avoid confusion between simple paths and trails. A Trichotomy for Regular Trail Queries (b) T tract ( FO [ <, +] (c) T tract and FO [ < ] are incomparable Since FO [ <, +] ( FO [ < ], we also have T tract ( FO [ < ]. This section is devoted to the proof of the following theorem. ◮ Theorem 4.1.
Let L be a regular language. (1) If L is finite then RTQ ( L ) ∈ AC . (2) If L ∈ T tract and L is infinite, then RTQ ( L ) is NL-complete. (3) If L / ∈ T tract , then RTQ ( L ) is NP-complete. We will start with (1). Clearly, we can express every finite language L as an FO-formula.Since we can also test in FO that no edge is used more than once, the graphs for which RTQ ( L ) holds are FO-definable. This implies that RTQ ( L ) is in AC . T tract We now sketch the proof of Theorem 4.1(2). We note that we define several concepts (trailsummary, local edge domains, admissible trails) that have a natural counterpart for simplepaths in Bagan et al.’s proof of the trichotomy for simple paths [5, Theorem 2]. However,the underlying proofs of the technical lemmas are quite different. For instance, stronglyconnected components of languages in SP tract behave similarly to A ∗ for some A ⊆ Σ, whilecomponents of languages in T tract are significantly more complex. Indeed, the trichotomyfor trails leads to a strictly larger class of tractable languages.For the remainder of this section, we fix the constant K = N . We first describe theNL algorithm. Then we observe that, if the algorithm answers “yes”, we can also outputa shortest trail. We will show that in the case where L belongs to T tract , we can identifya number of edges that suffice to check if the path is (or can be transformed into) a trailthat matches L . This number of edges only depends on L and is therefore constant forthe RTQ ( L ) problem. These edges will be stored in a path summary . We will define pathsummaries formally and explain how to use them to check whether a trail between the inputnodes that matches L exists.To this end, we need a few definitions. Let A = ( Q, Σ , I, F, δ ) be an NFA. We extend δ to paths, in the sense that we denote by δ ( q, p ) the set of states that A can reach from q after reading lab( p ). For q ∈ Q , we say that a run from q of A over a path p = ( v , a , v )( v , a , v ) · · · ( v m , a m , v m +1 ) is a sequence q · · · q m of states such that q i ∈ δ ( q i − , a i ), forevery i ∈ [ m ]. When q ∈ δ ( q, a ) for some q ∈ I , we also simply call it a run of A over p . ◮ Definition 4.2.
Let p = e · · · e m be a path and r = q · · · q m the run of A L over p . Fora set C of states of A L , we denote by left C the first edge e i with q i − ∈ C and by right C the last edge e j with q j ∈ C . A component C of A L is a long run component of p if | p [ left C , right C ] | > K . Next, we want to reduce the amount of information that we require for trails. To thisend, we use the following synchronization property for A L . ◮ Lemma 4.3.
Let L ∈ T tract , let C be a component of A L , let q , q ∈ C , and let w be aword of length N . If δ L ( q , w ) ∈ C and δ L ( q , w ) ∈ C , then δ L ( q , w ) = δ L ( q , w ) . . Martens, M. Niewerth and T. Trautner 9 The lemma encourages the use of summaries . ◮ Definition 4.4.
Let
Cuts denote the set of components of A L and Abbrv = Cuts × ( V × Q ) × E k . A component abbreviation ( C, ( v, q ) , e K · · · e ) ∈ Abbrv consists of a component C , a node v of G and state q ∈ C to start from, and K edges e K · · · e . A trail π matches acomponent abbreviation, denoted π | = ( C, ( v, q ) , e K · · · e ) , if δ L ( q, π ) ∈ C , it starts at v , andits suffix is e K · · · e . We write π | = E ′ ( C, ( v, q ) , e K · · · e ) if π | = ( C, ( v, q ) , e K · · · e ) and alledges of π are from E ′ ∪ { e , . . . , e K } . For convenience, we write e | = ∅ e .If p is a trail, then the summary S p of p is the sequence obtained from p by replac-ing, for each long run component C the subsequence p [ left C , right C ] by the abbreviation ( C, ( v, q ) , p suff ) , where v is the source node of the edge left C , q is the state, A L is in im-mediately before reading left C , and p suff is the suffix of length K of p [ left C , right C ] . We note that the length of a summary is bounded by O ( N ), i.e., a constant that dependson L . Indeed, A L has at most N components and, for each of them, we store at most K + 3many things (namely, C, v, q , and K edges). If we were able to guess a candidate for asummary S and replace all abbreviations with matching pairwise edge-disjoint trails thatare also disjoint with S , we would be able to obtain a trail that matches L . ◮ Definition 4.5. A candidate summary S is a sequence of the form S = α · · · α m with m ≤ N where each α i is either (1) an edge e ∈ E or (2) an abbreviation ( C, ( v, q ) , e K · · · e ) ∈ Abbrv . Furthermore, all components and all edges appearing in S are distinct. A path p thatis derived from S by replacing each α i ∈ Abbrv by a trail p i such that p i | = α i is called a completion of the candidate summary S . The following corollary is immediate from the definitions and Lemma 4.3, as the lemmaensures that the state after reading p inside a component does not depend on the wholepath but only on the labels of the last K edges, which are fixed. ◮ Corollary 4.6.
Let L be a language in T tract . Let S be the summary of a trail p thatmatches L and let p ′ be a completion of S . Then, p ′ is a path that matches L . Together with the following lemma, Corollary 4.6 can be used to obtain an NL algorithmthat gives us a completion of a summary S . The lemma heavily relies on other results onthe structure of components in A L that we also prove in the Appendix. ◮ Lemma 4.7.
Let L ∈ T tract , let ( C, ( v, q ) , e K · · · e ) be an abbreviation and E ′ ⊆ E . Thereexists an NL algorithm that outputs a shortest trail p such that p | = E ′ ( C, ( v, q ) , e K · · · e ) ifit exists and rejects otherwise. Using the algorithm of Lemma 4.7 we can, in principle, output a completion of S thatmatches L using nondeterministic logarithmic space. However, such a completion does notnecessarily correspond to a trail. The reason is that, even though each trail p C we guess forsome abbreviation involving a component C is a trail, the trails for different componentsmay not be disjoint. Therefore, we will define pairwise disjoint subsets of edges that can beused for the completion of the components.The following definition fulfills the same purpose as the local domains on nodes in Baganet al. [5, Definition 5]. Since our components can be more complex, we require extra condi-tions on the states (the δ L ( q, π ) ∈ C condition). ◮ Definition 4.8 (Local Edge Domains).
Let S = α · · · α k be a candidate summary and E ( S ) be the set of edges appearing in S . We define the local edge domains Edge i ⊆ E i v eπ Edge i ∪ { e , . . . e K } v e K · · · e e π Edge ℓ ∪ { e , . . . , e K } Figure 3
Sketch of case (1) and (2) in the proof of Lemma 4.10 inductively for each i from to k , where E i are the remaining edges defined by E = E \ E ( S ) and E i +1 = E i \ Edge i . If there is no trail p such that p | = α i or if α i is a simple edge, wedefine Edge i = ∅ .Otherwise, let α i = ( C, ( v, q ) , e K · · · e ) . We denote by m i the minimal length of a trail p with p | = α i and define Edge i as the set of edges used by trails π that only use edges in E i ,are of length at most m i − K , and satisfy δ L ( q, π ) ∈ C . We note that the sets E ( S ) and ( Edge i ) i ∈ [ k ] are always disjoint. ◮ Definition 4.9 (Admissible Trail).
We say that a trail p is admissible if there exist acandidate summary S = α · · · α k and trails p , . . . , p k such that p = p · · · p k is a completionof S and p i | = Edge i α i for every i ∈ [ k ] . We show that shortest trails that match L are always admissible. Thus, the existence ofa trail is equivalent to the existence of an admissible trail. ◮ Lemma 4.10.
Let G and ( s, t ) be an instance for RTQ ( L ) , with L ∈ T tract . Then everyshortest trail from s to t in G that matches L is admissible. Proof sketch.
We assume towards a contradiction that there is a shortest trail p from s to t in G that matches L and is not admissible. That means there is some ℓ ∈ N , and an edge e used in p ℓ with e / ∈ Edge ℓ . There are two possible cases: (1) e ∈ Edge i for some i < ℓ and(2) e / ∈ Edge i for any i . In both cases, we construct a shorter trail p that matches L , whichthen leads to a contradiction. We depict the two cases in Figure 3. We construct the newtrail by substituting the respective subtrail with π . ◭ So, if there is a solution to
RTQ ( L ), we can find it by enumerating the candidate sum-maries and completing them using the local edge domains. We next prove that testing if anedge is in Edge i can be done logarithmic space. We will name this decision problem P edge ( L )and define it as follows: P edge ( L )Given: A graph G = ( V, E ), nodes s, t , a candidate summary S , an edge e ∈ E and an integer i .Question: Is e ∈ Edge i ? ◮ Lemma 4.11. P edge ( L ) is in NL for every L ∈ T tract . With this, we can finally give an NL algorithm that decides whether a candidate summarycan be completed to an admissible trail that matches L . ◮ Lemma 4.12.
Let L ∈ T tract and L be infinite. Then, RTQ ( L ) is NL-complete. ◮ Corollary 4.13.
Let L ∈ T tract , G be a graph, and s , t be nodes in G . If there exists atrail from s to t that matches L , then we can output a shortest such trail in polynomial time(and in nondeterministic logarithmic space). . Martens, M. Niewerth and T. Trautner 11 T tract The proof of Theorem 4.1(3) is by reduction from the following NP-complete problem:
TwoEdgeDisjointPaths
Given: A language L , a graph G = ( V, E ), and two pairs of nodes ( s , t ),( s , t ).Question: Are there two paths p from s to t and p from s to t such that p and p are edge-disjoint?The proof is very close to the corresponding proof for simple paths by Bagan et al. [5,Lemma 2] (which is a reduction from the two vertex-disjoint paths problem). The following theorem establishes the complexity of deciding for a regular language L whether L ∈ T tract . ◮ Theorem 5.1.
Testing whether a regular language L belongs to T tract is (1) NL-complete if L is given by a DFA and (2) PSPACE-complete if L is given by an NFA or by a regular expression. We wondered if, similarly to Theorem 3.2, it could be the case that languages closedunder left-synchronized power abbreviations are always regular, but this is not the case.For example, the (infinite) Thue-Morse word [33, 24] has no subword that is a cube (i.e., nosubword of the form w ) [33, Satz 6]. The language containing all prefixes of the Thue-Morseword thus trivially is closed under left-synchronized power abbreviations (with i = 3), yet itis not regular.We now give some closure properties of SP tract and T tract . ◮ Lemma 5.2.
Both classes SP tract and T tract are closed under (i) finite unions, (ii) finite in-tersections, (iii) reversal, (iv) left and right quotients, (v) inverses of non-erasing morphisms, (vi) removal and addition of individual strings. This lemma implies that SP tract and T tract each are a positive C ne -variety of languages, i.e.,a positive variety of languages that is closed under inverse non-erasing homomorphisms. ◮ Lemma 5.3.
The classes SP tract and T tract are not closed under complement. Proof.
Let Σ = { a, b } . The language of the expression b ∗ clearly is in SP tract and T tract . Itscomplement is the language L containing all words with at least one a . It can be describedby the regular expression Σ ∗ a Σ ∗ . Since b i ab i ∈ L for all i , but b i b i / ∈ L for any i , thelanguage L is neither in SP tract nor in T tract . ◭ It is an easy consequence of Lemma 5.2 (vi) that there do not exist best lower or upperapproximations for regular languages outside SP tract or T tract . ◮ Corollary 5.4.
Let
C ∈ { SP tract , T tract } . For every regular language L such that L / ∈ C andfor every upper approximation L ′′ of L (i.e., L ( L ′′ ) with L ′′ ∈ C it holds that thereexists a language L ′ ∈ C with L ( L ′ ( L ′′ ;for every lower approximation L ′′ of L (i.e., L ′′ ( L ) it holds that there exists a language L ′ ∈ C with L ′′ ( L ′ ( L . The corollary implies that Angluin-style learning of languages in SP tract or T tract is notpossible. However, learning algorithms for single-occurrence regular expressions (SOREs)exist [8] and can therefore be useful for an important subclass of T tract . In this section we state that—using the algorithm from Theorem 4.1—the enumeration resultfrom [37] transfers to the setting of enumerating trails matching L . ◮ Theorem 6.1.
Let L be a regular language, G be a graph and ( s, t ) a pair of nodes in G .If NL = NP, then one can enumerate trails from s to t that match L in polynomial delay indata complexity if and only if L ∈ T tract . Proof sketch.
The algorithm is an adaptation of Yen’s algorithm [37] that enumerates the k shortest simple paths for some given number k , similar to what was done by Martens andTrautner [22, Theorem 18]. It uses the algorithm from Corollary 4.13 as a subprocedure. ◭ We have defined the class T tract of regular languages L for which finding trails in directedgraphs that are labeled with L is tractable iff NL = NP. We have investigated T tract indepth in terms of closure properties, characterizations, and the recognition problem, alsotouching upon the closely related class SP tract (for which finding simple paths is tractable)when relevant.In our view, graph database manufacturers can have the following tradeoffs in mindconcerning simple path ( SP tract ) and trail semantics ( T tract ) in database systems: SP tract ( T tract , that is, there are strictly more languages for which finding regularpaths under trail semantics is tractable than under simple path semantics. Some ofthe languages in T tract but outside SP tract are of the form ( ab ) ∗ or a ∗ bc ∗ , which werefound to be relevant in several application scenarios involving network problems, ge-nomic datasets, and tracking provenance information of food products [29] and appearin query logs [10, 9].Both SP tract and T tract can be syntactically characterized but, currently, the characteri-zation for SP tract (Section 3.5 in [5]) is simpler than the one for T tract . This is due to thefact that connected components for automata for languages in T tract can be much morecomplex than for automata for languages in SP tract .On the other hand, the single-occurrence condition, i.e., each alphabet symbol occurs atmost once, is a sufficient condition for regular expressions to be in T tract . This conditionis trivial to check and also captures languages outside SP tract such as ( ab ) ∗ and a ∗ bc ∗ .Moreover, the condition seems to be useful: we analyzed the 50 million RPQs found inthe logs of [11] and discovered that over 99.8% of the RPQs are single-occurrence.In terms of closure properties, learnability, or complexity of testing if a given regularlanguage belongs to SP tract or T tract , the classes seem to behave the same.The tractability for the decision version of RPQ evaluation can be lifted to the enumera-tion problem, in which case the task is to output matching paths with only a polynomialdelay between answers. Acknowledgments
We thank the participants of Shonan meeting No. 138 (and HassanChafi in particular), who provided significant inspiration for the first paragraph in the In-troduction, and Jean-Éric Pin for pointing us to positive C ne -varieties of languages. . Martens, M. Niewerth and T. Trautner 13 References Parosh Aziz Abdulla, Aurore Collomb-Annichini, Ahmed Bouajjani, and Bengt Jonsson. Us-ing forward reachability analysis for verification of lossy channel systems.
Formal Methods inSystem Design , 25(1):39–65, 2004. Renzo Angles, Marcelo Arenas, Pablo Barceló, Peter A. Boncz, George H. L. Fletcher, ClaudioGutierrez, Tobias Lindaaker, Marcus Paradies, Stefan Plantikow, Juan F. Sequeda, Oskar vanRest, and Hannes Voigt. G-CORE: A core for future graph query languages. In
InternationalConference on Management of Data (SIGMOD) , pages 1421–1432, 2018. Marcelo Arenas, Sebastián Conca, and Jorge Pérez. Counting beyond a yottabyte, or howSPARQL 1.1 property paths will prevent adoption of the standard. In
International Confer-ence on World Wide Web (WWW) , pages 629–638, 2012. Guillaume Bagan, Angela Bonifati, and Benoît Groz. A trichotomy for regular simple pathqueries on graphs.
CoRR , abs/1212.6857, 2012. URL: http://arxiv.org/abs/1212.6857 . Guillaume Bagan, Angela Bonifati, and Benoît Groz. A trichotomy for regular simple pathqueries on graphs. In
Symposium on Principles of Database Systems (PODS) , pages 261–272,2013. Pablo Barceló. Querying graph databases. In
Symposium on Principles of Database Systems(PODS) , pages 175–188, 2013. Pablo Barceló, Leonid Libkin, and Juan L. Reutter. Querying graph patterns. In
PODS ,pages 199–210. ACM, 2011. Geert Jan Bex, Frank Neven, Thomas Schwentick, and Stijn Vansummeren. Inference ofconcise regular expressions and dtds.
ACM Trans. Database Syst. , 35(2):11:1–11:47, 2010. Angela Bonifati, Wim Martens, and Thomas Tim. Navigating the maze of wikidata querylogs. In
The Web Conference (WWW) . ACM, 2019. To appear. Angela Bonifati, Wim Martens, and Thomas Timm. An analytical study of large SPARQLquery logs.
PVLDB , 11(2):149–161, 2017. Angela Bonifati, Wim Martens, and Thomas Timm. DARQL: deep analysis of SPARQLqueries. In
WWW (Companion Volume) , pages 187–190. ACM, 2018. Dbpedia. wiki.dbpedia.org . Steven Fortune, John Hopcroft, and James Wyllie. The directed subgraph homeomorphismproblem.
Theoretical Computer Science (TCS) , 10(2):111–121, 1980. Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, VictorMarsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andrés Taylor. Cypher: Anevolving query language for property graphs. In
SIGMOD Conference , pages 1433–1445. ACM,2018. Wouter Gelade, Marc Gyssens, and Wim Martens. Regular expressions with counting: Weakversus strong determinism.
SIAM J. Comput. , 41(1):160–190, 2012. Leonard H. Haines. On free monoids partially ordered by embedding.
Journal of Combinato-rial Theory , 6(1):94 – 98, 1969. Neil Immerman. Nondeterministic space is closed under complementation.
SIAM J. Comput. ,17(5):935–938, 1988. Pierre Jullien.
Contribution à l’étude des types d’ordres dispersés . PhD thesis, Universite deMarseille, 1969. Andrea S. LaPaugh and Christos H. Papadimitriou. The even-path problem for graphs anddigraphs.
Networks , 14(4):507–513, 1984. Andrea S. Lapaugh and Ronald L. Rivest. The subgraph homeomorphism problem.
Journalof Computer and System Sciences , 20(2):133 – 149, 1980. Katja Losemann and Wim Martens. The complexity of regular expressions and propertypaths in SPARQL.
ACM Transactions on Database Systems , 38(4):24:1–24:39, 2013. Wim Martens and Tina Trautner. Evaluation and enumeration problems for regular pathqueries. In
ICDT , volume 98 of
LIPIcs , pages 19:1–19:21. Schloss Dagstuhl - Leibniz-Zentrumfuer Informatik, 2018. Alberto O. Mendelzon and Peter T. Wood. Finding regular simple paths in graph databases.
SIAM Journal on Computing , 24(6):1235–1258, 12 1995. Harold Marston Morse. Recurrent geodesics on a surface of negative curvature.
Transactionsof the American Mathematical Society , 22(1):84–100, Jan 1921. doi:10.2307/1988844 . Neo4j. neo4j.com . Cypher query language reference, version 9, mar. 2018. https://github.com/opencypher/openCypher/blob/master/docs/openCypher9.pdf . Oracle spatial and graph. . Yehoshua Perl and Yossi Shiloach. Finding two disjoint paths between two pairs of verticesin a graph.
J. ACM , 25(1):1–9, 1978. Neo4J Petra Selmer. Personal communication. Thomas Place and Luc Segoufin. Decidable characterization of FO2( <, +1) and locality ofda. CoRR , abs/1606.03217, 2016. URL: https://arxiv.org/abs/1606.03217 . Marcel Paul Schützenberger. On finite monoids having only trivial subgroups.
Informationand Control , 8(2):190–194, 1965. Denis Thérien and Thomas Wilke. Over words, two variables are as powerful as one quantifieralternation. In
STOC , pages 234–240. ACM, 1998. Axel Thue.
Über unendliche Zeichenreihen . Skrifter udg. af Videnskabs-Selskabet i Christiania: 1. Math.-Naturv. Klasse. Dybwad [in Komm.], 1906. Tigergraph. . SPARQL 1.1 query language. , 2013. World WideWeb Consortium. Wikidata. wikidata.org . Jin Y. Yen. Finding the k shortest loopless paths in a network.
Management Science ,17(11):712–716, 1971. . Martens, M. Niewerth and T. Trautner 15
A Preliminaries for the Appendix
We use some additional notation for automata. We write C ( q ) to denote the stronglyconnected component of q . Let q be a state, then we write Σ (cid:8) ( q ) to denote the set ofsymbols a , such that there is a word w = aw ′ ∈ Loop( q ).We also use the following lemma. ◮ Lemma A.1 (Implicit in [5], Lemma 1 proof).
Every minimal DFA satisfyingfor all q , q ∈ Q L such that q q and Loop( q ) ∩ Loop( q ) = ∅ : L q ⊆ L q (P) accepts an aperiodic language. Background on NFAs with Counters
We recall the definition of counter NFAs from Gelade et al. [15]. We introduce a minordifference, namely that counters count down instead of up, since this makes our constructioneasier to describe. Furthermore, since our construction only requires a single counter, zerotests, and setting the counter to a certain value, we immediately simplify the definition totake this into account.Let c be a counter variable , taking values in N . A guard on c is a statement γ of theform true or c = 0. We denote by c | = γ that c satisfies the guard γ . In the case where γ is true , this is trivially fulfilled and, in the case where γ is c = 0, this is fulfilled if c equals 0.By G we denote the set of guards on c . An update on c is a statement of the form c := c − c := c , or c := k for some constant k ∈ N . By U we denote the set of updates on c . ◮ Definition A.2. A non-deterministic counter automaton (CNFA) with a single counter isa 6-tuple A = ( Q, q , c, δ, F, τ ) where Q is the finite set of states; q ∈ Q is the initial state; c is a counter variable; δ ⊆ Q × Σ × G × Q × U is the transition relation; and F ⊆ Q is theset of accepting states. Furthermore, τ ∈ N is a constant such that every update is of theform c := k with k ≤ τ . Intuitively, A can make a transition ( q, a, γ ; q ′ , π ) whenever it is in state q , reads a , and c | = γ , i.e., guard γ is true under the current value of c . It then updates c according tothe update π , in a way we explain next, and moves into state q ′ . To explain the updatemechanism formally, we introduce the notion of configuration. A configuration is a pair( q, ℓ ) where q ∈ Q is the current state and ℓ ∈ N is the value of c . Finally, an update π defines a function π : N → N as follows. If π = ( c := k ) then π ( ℓ ) = k for every ℓ ∈ N . If π = ( c := c −
1) then π ( ℓ ) = max( ℓ − , π = ( c := c ), then π ( ℓ ) = ℓ .So, counters never become negative.The initial configuration α is ( q , q, ℓ ) is accepting if q ∈ F and ℓ = 0. A configuration α ′ = ( q ′ , ℓ ′ ) immediately follows a configuration α = ( q, ℓ ) by reading a ∈ Σ, denoted α → a α ′ , if there exists ( q, a, γ ; q ′ , π ) ∈ δ with c | = γ and ℓ ′ = π ( ℓ ).For a string w = a · · · a n and two configurations α and α ′ , we denote by α ⇒ w α ′ that α → a · · · → a n α ′ . A configuration α is reachable if there exists a string w such that α ⇒ w α . A string w is accepted by A if α ⇒ w α f where α f is an accepting configuration.We denote by L ( A ) the set of strings accepted by A .It is easy to see that CNFA accept precisely the regular languages. (Due to the value τ ,counters are always bounded by a constant.) B Proofs for Section 3 (The Tractable Class)
The left-synchronizing length of an NFA A is the smallest value n such that the implicationin Definition 3.3 for the left-synchronized containment property holds. We define the right-synchronizing length analogously. ◮ Observation B.1.
Let n be the left-synchronizing length of an NFA A . Then the impli-cation of Definition 3.3 is satisfied for every n ≥ n . The reason is that w ∈ Loop( q ) . ◮ Lemma B.2.
Consider a minimal DFA A L = ( Q L , Σ , i L , F L , δ L ) with N states. Then thefollowing is true: (1) If A L satisfies the left-synchronized containment property, then the left-synchronizinglength is at most N . (2) If A L satisfies the right-synchronized containment property, then the right-synchronizinglength is at most N . Proof.
We first prove (1). Let n ∈ N be the left-synchronizing length. Since A L satisfies theleft-synchronized containment property, n is well defined. If n ≤ N , we are done, thereforewe assume n > N . By Definition 3.3, it holds that: If q , q ∈ Q A such that q q and if w ∈ Loop( q ) , w ∈ Loop( q ) with w = aw ′ and w = aw ′ , then we have w n L q ⊆ L q .Since n > N and w n L q ⊆ L q , there must be a loop in the w n part that generatesmultiples of w . Thus we can ignore the loop and obtain that w i L q ⊆ L q for an i < n .This is a contradiction to n being the left-synchronizing length (i.e., the minimality of n ).The proof of (2) is analogous. We only need to replace w = aw ′ and w = aw ′ with w = w ′ a , w = w ′ a , and left with right . ◭ From Definition 3.3, Observation B.1, and Lemma B.2, we get the following corollary. ◮ Corollary B.3.
Let A be a minimal DFA with N states, q , q ∈ Q A with q q , w ∈ Loop( q ) , and w ∈ Loop( q ) . If A satisfies theleft-synchronized containment property, w = aw ′ , and w = aw ′ , then w N L q ⊆ L q .right-synchronized containment property, w = w ′ a , and w = w ′ a , then w N L q ⊆ L q . We need two lemmas to prove Theorem 3.5. ◮ Lemma B.4. If A L has the left-synchronized containment property or right-synchronizedcontainment property, then L is aperiodic. Proof.
Let A L satisfy the left- or right-synchronized containment property. We show that L satisfies Property (P), restated here for convenience. L q ⊆ L q for all q , q ∈ Q L such that q q and Loop( q ) ∩ Loop( q ) = ∅ (P)This proves the lemma since all languages satisfying Property (P) are aperiodic, see Lemma A.1.Let q , q ∈ Q L and w satisfy q q and w ∈ Loop( q ) ∩ Loop( q ). By Corollary B.3 wethen have that w N L q ⊆ L q . Since w ∈ Loop( q ), we have that δ ∗ ( q , w N ) = q , which inturn implies that L q ⊆ L q . ◭◮ Lemma B.5. If L is closed under left- or right-synchronized power abbreviations, then L is aperiodic. . Martens, M. Niewerth and T. Trautner 17 Proof.
Let L be closed under left- or right-synchronized power abbreviations and i ∈ N beas in Definition 3.4. We show that A L satisfies the Property (P). The aperiodicity thenfollows from Lemma A.1.Let q , q ∈ Q L and w satisfy q q and w ∈ Loop( q ) ∩ Loop( q ). Let w ℓ , w m ∈ Σ ∗ be such that q = δ L ( i L , w ℓ ) and q = δ L ( q , w m ). Let w r ∈ L q . Then, w ℓ w ∗ w m w ∗ w r ⊆ L by construction. Especially, w ℓ w i w m w i w r ∈ L and, by Definition 3.4, also w ℓ w i w i w r ∈ L .Since δ L ( i L , w ℓ w i w i ) = q , this means that w r ∈ L q . Therefore, L q ⊆ L q . ◭◮ Theorem 3.5.
For a regular language L with minimal DFA A L , the following are equiva-lent: (1) A L satisfies the left-synchronized containment property. (2) A L satisfies the right-synchronized containment property. (3) L is closed under left-synchronized power abbreviations. (4) L is closed under right-synchronized power abbreviations. Proof.
Let A L = ( Q L , Σ , i L , F L , δ L ). (1) ⇒ (3): Let A L satisfy the left-synchronizedcontainment property. We will show that if there exists a word w ℓ w i w m w i w r ∈ L with i = N + N and w and w starting with the same letter, then w ℓ w i w i w r ∈ L. Tothis end, let w ℓ w i w m w i w r ∈ L . Due to the pumping lemma, there are states q , q andintegers h, j, k, ℓ, m, n ≤ N with j, m ≥ q = δ ( i L , w ℓ w h ), q = δ ( q , w j ), q = δ ( q , w k w m w ℓ ), q = δ ( q , w j ), and w n w r ∈ L q . This implies that w ℓ w h ( w j ) ∗ w k w m w ℓ ( w m ) ∗ w n w r ⊆ L .
Since A L satisfies the left-synchronized containment property and by Corollary B.3, we have( w m ) N L q ⊆ L q and therefore w ℓ w h ( w j ) ∗ ( w m ) N w n w r ⊆ L .
Now we use that L is aperiodic, see Lemma B.4: w ℓ w h ( w j ) N ( w ) ∗ ( w m ) N ( w ) ∗ w n w r ⊆ L And finally, we use that i = N + N and h, j, m, n ≤ N to obtain w ℓ ( w ) i ( w ) i w r ∈ L. (3) ⇒ (4): Let L be closed under left-synchronized power abbreviations and let j ∈ N bethe maximum of | A L | and the i + 1, where the i is from Definition 3.4. We will show thatif w ℓ ( w a ) j w m ( w a ) j w r ∈ L , then w ℓ ( w a ) j ( w a ) j w r ∈ L . If w ℓ ( w a ) j w m ( w a ) j w r ∈ L ,then we also have w ℓ ( w a ) j w m ( w a ) j +1 w r ∈ L since L is aperiodic, see Lemma B.5, and j ≥ | A L | . This can be rewritten as w ℓ w ( aw ) j − aw m w ( aw ) j − ( aw aw r ) ∈ L . As L is closed under left-synchronized power abbreviations, and i < j , this implies w ℓ w ( aw ) j − ( aw ) j − ( aw aw r ) ∈ L .
This can be rewritten into w ℓ ( w a ) j ( w a ) j w r ∈ L .(4) ⇒ (2): Let L be closed under right-synchronized power abbreviations. We will provethat A L satisfies the right-synchronized containment property, that is, if there are two states q , q in A L with q q and w ∈ Loop( q ), w ∈ Loop( q ), such that w and w end withthe same letter, then ( w a ) N L q ⊆ L q . Let q , q be such states. Then there exist w ℓ , w m with q = δ L ( i L , w ℓ ) and q = δ L ( q , w m ). If L q = ∅ , we are done. So let us assume there is a word w r ∈ L q . We define w ′ r = w N w r . Due to construction, we have w ℓ w ∗ w m w ∗ w ′ r ⊆ L .Since L is closed under right-synchronized power abbreviations, there is an i ∈ N such that w ℓ w i w i w ′ r ∈ L . Since we have a deterministic automaton and q = δ L ( i L , w ℓ w i ) this impliesthat w i w ′ r = w i w N w r ∈ L q . We now use that L is aperiodic due to Lemma B.5 to inferthat w N w r ∈ L q .(2) ⇒ (1): We will show that If there exist states q , q ∈ Q L and words w , w ∈ Σ ∗ with aw ∈ Loop( q ) and aw ∈ Loop( q ) and q q , then ( aw ) N L q ⊆ L q . Let q , q be such states and w , w as above. We define q ′ = δ L ( q ′ , w ) and q ′ = δ L ( q ′ , w ). Since A L is deterministic, the construction implies that w a ∈ Loop( q ′ ) and w a ∈ Loop( q ′ ).Furthermore, it holds that (i) L q ′ = a − L q and (ii) w L q ⊆ L q ′ . With this we willshow that ( w a ) N L q ′ ⊆ L q ′ implies ( aw ) N L q ⊆ L q . Let ( w a ) N L q ′ ⊆ L q ′ . Addingan a left hand, yields ( aw ) N aL q ′ ⊆ aL q ′ ⊆ L q because of (i). We use (ii) to replace L q ′ to get: ( aw ) N +1 L q ⊆ L q . Since L is aperiodic, see Lemma B.4, this equivalent to( aw ) N L q ⊆ L q . ◭◮ Corollary B.6.
If a regular language L satisfies Definition 3.4 and N = | A L | then, forall i > N + N and for all words w ℓ , w m , w r ∈ Σ ∗ and all words w = aw ′ and w = aw ′ (resp., w = w ′ a and w = w ′ a ) we have that w ℓ w i w m w i w r ∈ L implies w ℓ w i w i w r ∈ L . Proof.
This immediately follows from the proof of (1) ⇒ (3). ◭ The following lemma is the implication (1) ⇒ (5) from Theorem 3.9 ◮ Lemma B.7. If L ∈ T tract , then there exists a detainment automaton for L with consistentshortcuts and only memoryless components. Proof.
Let A L = ( Q L , Σ , i L , F L , δ L ) be the minimal DFA for L . The proof goes as follows:First, we define a CNFA A with two counters. Second, we show that we can convert A to anequivalent CNFA A ′ with only one counter that is a detainment automaton with consistentshortcuts and only memoryless components. This conversion is done by simulating one ofthe counters using a bigger set of states. Last, we show that L ( A ) = L ( A L ), which showsthe lemma statement as L ( A ) = L ( A ′ ).Before we start we need some additional notation. We write p y a q to denote that C ( p ) C ( q ) and there are states q ∈ C ( p ) and p ∈ C ( q ) such that ( p i , a, q i ) ∈ δ L for i ∈ { , } .Let ∼ ⊆ Q L × Q L be the smallest equivalence relation over Q L that satisfies p ∼ q if C ( p ) = C ( q ) and Σ (cid:8) ( p ) ∩ Σ (cid:8) ( q ) = ∅ . For q ∈ Q L , we denote by [ q ] the equivalence classof q . By [ Q L ] we denote the set of all equivalence classes. We also write [ C ] to denote theequivalence classes that only use states from some component C . We extend the notion C ( q )to [ Q L ], i.e., C ([ q ]) = C ( q ) for all q ∈ Q L .We will use the following observation that easily follows from Lemma C.2 using thedefinition of ∼ . ◮ Observation B.8.
Let q , q be states with [ q ] = [ q ] , then for all a ∈ Σ it holds that δ L ( q , a ) ∈ C ( q ) if and only if δ L ( q , a ) ∈ C ( q ) . We define a CNFA A = ( Q, I, c, d, δ, F, N ) that has two counters c and d . The counter c is allowed to have any initial value from [0 , N ], while the counter d has initial value 0.We note that we will eliminate counter c when converting to a one counter automaton, thusthis is not a contradiction to the definition of CNFA with one counter that we use.We use Q ′ = Q L ∪ [ Q L ], i.e., we can use the states from A L and the equivalence classesof the equivalence relation ∼ . The latter will be used to ensure that strongly connected . Martens, M. Niewerth and T. Trautner 19 components are memoryless, while the former will only be used in trivial strongly connectedcomponents. We use I = { i L , [ i L ] } and F = F L . δ (cid:8) = { ( q , a, { c > , d = 0 } ; q , { c := c − } ) | ( q , a, q ) ∈ δ L , C ( q ) = C ( q ) } δ (cid:8) = { ([ q ] , a, { c = N } ; [ q ] , { d := d − } ) | ( q , a, q ) ∈ δ L , C ( q ) = C ( q ) } δ (cid:8) = { ([ q ] , a, { c = N , d = 0 } ; q , { c := c − } ) | ( q , a, q ) ∈ δ L , C ( q ) = C ( q ) } δ → = { ( q , a, { c = 0 , d = 0 } ; q , { c := i } ) | ( q , a, q ) ∈ δ L , C ( q ) = C ( q ) , i ∈ [0 , N − } δ → = { ( q , a, { c = 0 , d = 0 } ; [ q ] , { c := N } ) | ( q , a, q ) ∈ δ L , C ( q ) = C ( q ) } δ y = { ([ q ] , a, { c = N , d = 0 } ; [ q ] , { d := N } ) | q y a q , C ( q ) = C ( q ) } δ = δ (cid:8) ∪ δ (cid:8) ∪ δ (cid:8) ∪ δ → ∪ δ → ∪ δ y We say that a component C of A L is a long run component of a given word w = a · · · a n ,if |{ i | δ ( i L , a · · · a i ) ∈ C }| > N , i.e., if the run stays in C for for than N symbols. Allother components are short run components .For short run components, we use states from Q L . We use the counter c to enforce thatthese parts are indeed short. For long run components, we first use states in [ Q L ]. Onlythe last N symbols in the component are read using states from Q L . The left-synchronizedcontainment property guarantees that for long run components the precise state is notimportant, which allows us to make these components memoryless.The transition relation is divided into transitions between states from the same compo-nent of A L (indicated by δ (cid:8) = δ (cid:8) ∪ δ (cid:8) ∪ δ (cid:8) ) and transitions between different components(indicated by δ → = δ → ∪ δ → ). Transitions in δ y are added to satisfy the consistent jumpsproperty. They are the only transitions that increase the counter d . This is necessary, as theleft-synchronized containment property only talks about the language of the state reachedafter staying in the component for some number of symbols. If we added the transitionsin δ y without using the counter, we would possibly add additional words to the language.This concludes the definition of A .We now argue that the automaton A ′ = ( Q ′ × [0 , N ] , i L , d, δ ′ , F × { } , N ) derived from A by pushing the counter c into the states is a detainment automaton with consistent jumpsthat only has memoryless components. The states of A ′ have two components, first the stateof A and second the value of the second counter that is bounded by N . We do not formallydefine δ ′ . It is derived from δ in the obvious way, i.e., by doing the precondition checks thatdepend on c on the second component of the state. Similarly, updates of c are done on thesecond component of the states.It is straightforward to see that A ′ is a detainment automaton with consistent jumpsthat only has memoryless components using the following observations:Every transition in A that does not have c = N before and after the transition requires d = 0.Let Cuts be the set of nontrivial strongly connected components of A , then the set ofnontrivial strongly connected components of A ′ is { [ C ] × { N } | C ∈ Cuts } .The consistent shortcuts are guaranteed by the transitions in δ y . As A ′ only has memorylesscomponents, the consistent shortcut property is trivially satisfied for states inside the samecomponent.We now show that L ( A L ) ⊆ L ( A ). Let w = a · · · a n be some string in L ( A L ) and q → · · · → q n be the run of A L on w . We define a function countdown : N → N that givesus how long we stay inside some component as countdown : i j − i , where j is the largestnumber such that C ( q j ) = C ( q i ). It is easy to see by the definitions of the transitions in δ → and δ (cid:8) , that the run (cid:0) p , min( N , countdown (0)) , (cid:1) → · · · → (cid:0) p n , min( N , countdown ( n )) , (cid:1) is an accepting run of A , where p i is q i if c i < N and [ q i ] otherwise. We note that thecounter d is always zero, as we do not use any transitions from δ y . The transitions in δ y are only there to satisfy the consistent shortcuts property. This shows L ( A L ) ⊆ L ( A ).Towards the lemma statement, it remains to show that L ( A ) ⊆ L ( A L ). Let therefore w = a · · · a n be some string in L ( A ), ( p , c , d ) → · · · → ( p n , c n , d n ) be an accepting runof A , and q · · · q n be the unique run of A L on w .We now show by induction on i that there are states ˆ q , . . . , ˆ q n in Q L such that thefollowing claim is satisfied. The claim easily yields that q n ∈ F L , as both counters have tobe zero for the word to be accepted. L ˆ q i ∩ a i +1 · · · a i + d i Σ ∗ ⊆ L q i and ˆ q i ∈ { p i } if c i = d i = 0[ p i ] if c i + d i > p i ∈ Q L p i if c i + d i > p i ∈ [ Q L ]The base case i = 0 is trivial by the definition of I . We now assume that the inductionhypothesis holds for i and are going to show that it holds for i +1. Let ρ = ( p i − , a i , P ; p i , U )be the transition used to read a i . We distinguish several cases depending on ρ .Case ρ ∈ δ → : In this case, c i = 0 by the definition of δ → . Therefore, the claim for i + 1follows with ˆ q i +1 = p i +1 , as ˆ q i = p i by the induction hypothesis and ( p i , a, p i +1 ) ∈ δ L bythe definition of δ → .Case ρ ∈ δ (cid:8) : We note that p i , p i +1 ∈ [ Q L ]. The claim for i + 1 follows with ˆ q i +1 = δ (ˆ q i , a i +1 ) using C ( q ′ ) = C ( δ ( q ′ , a i +1 ) for all q ′ ∈ [ p i ] (by Observation B.8), C ( p i ) = C ( p i +1 )(by definition of δ (cid:8) ), and ˆ q i ∈ p i (by the induction hypothesis).Case ρ ∈ δ (cid:8) : We want to show that L p i + N ⊆ L q N establishing the claim directly forthe position i + N using ˆ q i + N = p i + N . Therefore, we first want to apply Lemma 4.3 toshow that δ (ˆ q i , a i +1 · · · a i + N ) = p i + N . The preconditions of the lemma require us to showthat (i) C (ˆ q i ) = C ( p i ), (ii) C ( p i ) = C ( p i + N ), and (iii) C (ˆ q i ) = C ( δ L (ˆ q i , a i +1 · · · a i + N )).Precondition (i) is given by the induction hypothesis, precondition (ii) is by the definition of δ (cid:8) , i.e., that all transitions in δ (cid:8) are inside the same component of A L , and precondition (iii)is by the fact that each transition in δ (cid:8) has a corresponding transition in δ L that staysin the same component. Therefore, we can actually apply Lemma 4.3 to conclude that δ (ˆ q i , a i +1 · · · a i + N ) = p i + N . As we furthermore have that L ˆ q i ∩ a i +1 · · · a i + d i Σ ∗ ⊆ L q i bythe induction hypotheses, we can conclude that L p i + N ⊆ L q N . This establishes the claimfor position i + N using ˆ q i + N = p i + N . As we only need the claim for position n (and notfor all smaller positions), we can continue the induction at position i + N . Especially thereis no need to look at the case where ρ ∈ δ (cid:8) .Case ρ ∈ δ y : By the definition of δ y , we have that p i , p i +1 ∈ [ Q L ]. Furthermore, thereare transitions ( p i , a i +1 , p ′ ) and ( p ′′ , a i +1 , p i +1 ) in δ L such that C ( p ′ ) = C ( p i ), C ( p ′′ ) = C ( p i +1 ), and p ′ p ′′ . This (and the fact that ˆ q i ∈ p i by the induction hypothesis) allows usto apply Observation B.8 , which yields δ (ˆ q i , a i +1 ) ∈ C ( p i ). From p ′ p ′′ and ˆ q i ∈ C ( p ′ ) wecan conclude that ˆ q i p ′′ . We now can apply Lemma C.4 that gives us L p ′′ ∩ L a i +1 p ′′ Σ ∗ ⊆ L ˆ q i .Now we argue that the subword a i +2 · · · a i + N +1 is in L a i +1 p ′′ . By the definition of δ y , wehave d i +1 = N , enforcing that the next N transitions are all from δ (cid:8) , as these are the onlytransitions that allow d > N times yields . Martens, M. Niewerth and T. Trautner 21 that δ ( p ′′ , a i +2 · · · a i + N +1 ) ∈ C ( p ′′ ) and therefore a i +2 · · · a i + N +1 ∈ L a i +1 p ′′ . Using this and L p ′′ ∩ L a i +1 p ′′ Σ ∗ ⊆ L ˆ q i , we get that L δ ( p ′′ ,a i +1 ) ∩ a i +2 · · · a i + N +1 Σ ∗ ⊆ L δ (ˆ q i ,a i +1 ) yielding theclaim for i + 1. This concludes the proof of the lemma. ◭◮ Theorem 3.9.
For a regular language L , the following properties are equivalent: (1) L ∈ T tract (2) There exists an NFA A for L that satisfies the left-synchronized containment property. (3) There exists an NFA A for L that satisfies the left-synchronized containment propertyand only has memoryless components. (4) There exists a detainment automaton for L with consistent shortcuts. (5) There exists a detainment automaton for L with consistent shortcuts and only memory-less components. Proof of Theorem 3.9.
We show (1) ⇒ (5) ⇒ (3) ⇒ (2) ⇒ (1) and (5) ⇒ (4) ⇒ (2).(1) ⇒ (5): Holds by Lemma B.7.(5) ⇒ (3) and (4) ⇒ (2): Let A = ( Q, I, c, δ, F, ℓ ) be a detainment automaton withconsistent jumps. We compute an equivalent NFA A ′ = ( Q × { , . . . , ℓ } , δ ′ , I × { } , F × { } )in the obvious way, i.e., (( p, i ) , a, ( q, j )) ∈ δ ′ if and only if A can go from configuration ( p, i )to configuration ( q, j ) reading symbol a . By the definition of detainment automata, we getthat the nontrivial strongly connected components of A ′ are { C × { } | C is a nontrivial strongly connected component of A } This directly shows that A ′ only has memoryless components if A only has memorylesscomponents.We choose n = ℓ . Let now ( q , c ) , ( q , c ) ∈ Q × { , . . . , ℓ } , a ∈ Σ, and w ′ , w ′ ∈ Σ ∗ besuch that ( q , c ) ( q , c ), w = aw ′ ∈ Loop(( q , c )), and w = aw ′ ∈ Loop(( q , c )). Wehave to show that w n L ( q ,c ) ⊆ L ( q ,c ) . (1)We distinguish two cases. If q and q are in the same component, we know that there isa transition ( q , a, true ; q ; c := c − ∈ δ , as A has consistent jumps. Therefore, there is atransition (( q , , a, ( q , ∈ δ ′ , which directly yields (1).If q and q are in different components, then there is a transition ( q , a, c = 0; q ; c := k ) ∈ δ , as A has consistent jumps. Therefore, there is a transition (( q , , a, ( q , k )) ∈ δ ′ forsome k ∈ [0 , ℓ ]. We have w ∈ Loop( q ). The definition of detainment automata requiresthat every transition inside a component—thus every transition used to read w using theloop—is of the form ( p, a, true ; q, c := c − A ′ , we have that δ ′ (( q , k ) , w ℓ ) ⊇ δ (( q , , w ℓ ).This concludes the proof of (5) ⇒ (3) and (4) ⇒ (2)(5) ⇒ (4) and (3) ⇒ (2): Trivial.(2) ⇒ (1): Let A = ( Q, Σ , δ, I, F ) be an NFA satisfying the left-synchronized containmentproperty and A L be the minimal DFA equivalent to A . We show that A L satisfies the left-synchronized containment property establishing (1).Let M be the left synchronizing-length of A and q , q ∈ Q L be states of A L such that q q ; andthere are words w ∈ Loop( q ) and w ∈ Loop( q ) that start with the same symbol a . We need to show that there exists an n ∈ N with w n L q ⊆ L q . Let w be a word such that δ ( q , w ) = q . Let P ⊆ Q be a state of the powerset automaton of A with L P = L q andlet P = δ ( P , ww ∗ ) be the state in the powerset automaton of A that consists of all statesreachable from P reading some word from ww ∗ .It holds that L P = L q , as δ ( q , ww ∗ ) = q and L q = L P .We define P ′ = { p ∈ P | w i ∈ Loop( p ) for some i > } P ′′ = δ ( P , w | A | )We obviously have P ′′ ⊆ P ′ ⊆ P . Furthermore, we have L P = L q = L δ ( q ,w | A | ) = L P ′′ The second equation is by δ ( q , w | A | ) = q . We can conclude that L q = L P ′ .Let ρ : Q → Q be a function that selects for every state p ∈ P ′ a state p ∈ P suchthat p p . By definition of P ′ , such states exist. Using the fact that A satisfies theleft-synchronized containment property, we get that w M L p ⊆ L ρ ( p ) for each p ∈ P . Wecan conclude w M L q = w M L P ′ = [ p ∈ P ′ w M L p ⊆ [ p ∈ P ′ L ρ ( p ) ⊆ L P = L q and therefore w | A | + M L q ⊆ L q . So A L satisfies the left-synchronized containment propertywith n = M , where M is the left synchronizing-length of A . This concludes the proof for(2) ⇒ (1) and thus the proof of the theorem. ◭◮ Lemma B.9. T tract ⊆ FO [ <, +] Proof.
Let L be a language in T tract and A L be the minimal DFA of L . We use the character-ization of FO [ <, +] from Place and Segoufin [30]: A language L is definable in FO [ <, +]if and only if there exists an n such that for every u, v ∈ Σ ∗ and e, s, t ∈ Σ + with e beingidempotent it holds that u ( esete ) n v ∈ L if and only if u ( esete ) n t ( esete ) n v ∈ L ( ‡ )In particular, L is expressible in FO [ <, +] if the above equivalence holds for all u, v ∈ Σ ∗ and e, s, t ∈ Σ + , i.e., dropping the condition that e is an idempotent.We choose n = 2 N . Let q = δ ( i L , u ( esete ) n/ ) be the state after reading u ( esete ) n/ in A L . By standard pumping arguments, we know that we have read the last copy of esete inside some nontrivial component C of A L . By Corollary C.3, we can conclude that δ ( q, ( esete ) n/ ) ∈ C , δ ( q, ( esete ) n/ t ) ∈ C and δ ( q, ( esete ) n/ t ( esete ) n ) ∈ C . By Lemma 4.3,we can conclude that δ ( q, ( esete ) n/ ) = δ ( q, ( esete ) n/ t ( esete ) n ), which yields ( ‡ ). ◭◮ Theorem 3.12.(a) DC ( SP tract ( ( FO [ < ] ∩ T tract ) (b) T tract ( FO [ <, +] (c) T tract and FO [ < ] are incomparable Proof.
We first show (a). As DC is definable by simple regular expressions, we have for eachdownward closed language L that w ℓ w i ww i w r ∈ L implies w ℓ w i w i w r ∈ L for every integer . Martens, M. Niewerth and T. Trautner 23 i ∈ N and all words w ℓ , w , w, w , w r ∈ Σ ∗ . Therefore, L ∈ SP tract by Definition 3.11. Thelanguage { a } is not downward closed, but in SP tract using Definition 3.11 with i = 1.As SP tract ⊆ T tract by definition and a ∗ bc ∗ is a language that is not in SP tract , but in T tract and in FO [ < ], it only remains to show SP tract ⊆ FO [ < ].The class FO [ < ] can be described as the class of regular languages for which we havethat there exists an n such that for all u, v, x, y, z ∈ Σ ∗ it holds that: u ( xyz ) n y ( xyz ) n v ∈ L if and only if u ( xyz ) n v ∈ L . The only if-direction follows from Definition 3.11, that requiresthat each language L in SP tract satisfies u ( xyz ) n y ( xyz ) n v ∈ L only if u ( xyz ) n v ∈ L for some n ∈ N .For the if-direction, we use that Bagan et al. [5] give a definition in terms of regularexpressions, showing that each strongly connected component can be represented as ( A ≥ k + ε )for some k ∈ N . So if there is xyz ∈ Σ ∗ with u ( xyz ) M v ∈ L for some u, v ∈ Σ ∗ , then wealso have u ( xyz ) M (Alph( xyz )) ∗ ( xyz ) M v ⊆ L , where Alph( x ) denotes the set of symbols x uses. Thus we especially have u ( xyz ) M y ( xyz ) M v ∈ L , which proves the other direction.This concludes the proof of (a).Statement (b) follows from Lemma B.9 and the observation that a ∗ ba ∗ is a language in FO [ <, +] but not in T tract .It remains to show (c), which simply follows from the facts that the language a ∗ ba ∗ is in FO [ < ] but not in T tract whereas the language ( ab ) ∗ is in T tract but not in FO [ < ]. ◭ C Proofs for Section 4 (The Trichotomy)
We will first prove some technical lemmas on the components of languages in T tract . ◮ Lemma C.1.
Let A L satisfy the left-synchronized containment property. If states q and q belong to the same component of A L and Loop( q ) ∩ Loop( q ) = ∅ , then q = q . Proof.
Let q , q be as stated and let w be a word in Loop( q ) ∩ Loop( q ). According toDefinition 3.3, there exists an n ∈ N such that w n L q ⊆ L q . Since w ∈ Loop( q ), thisimplies that L q ⊆ L q . By symmetry, we have L q = L q , which implies q = q , since A L is the minimal DFA. ◭ The next lemmas characterize the internal language of a component. ◮ Lemma C.2.
Let L ∈ T tract , a ∈ Σ , C be a component of A L , and q , q ∈ C . If thereexist w a ∈ Loop( q ) and w a ∈ Loop( q ) , then, for all σ ∈ Σ , we have that δ L ( q , σ ) ∈ C if and only if δ L ( q , σ ) ∈ C . Proof.
Let q = q be two states in C . Let σ satisfy δ L ( q , σ ) ∈ C and let w ∈ Loop( q ) ∩ σ Σ ∗ a . Such a w exists since δ L ( q , σ ) ∈ C and δ L ( q , w a ) = q . Let q = δ L ( q , w N ). Wewill prove that q = q , which implies that δ L ( q , σ ) ∈ C . As L is aperiodic, w ∈ Loop( q ).Consequently, there is an n ∈ N such that w n L q ⊆ L q by Definition 3.3. Since w ∈ Loop( q ), this also implies L q ⊆ L q . Furthermore, q has a loop ending with a and A L satisfies the right-synchronized containment property, so w N L q ⊆ L q by Corollary B.3.Hence, L q ⊆ ( w N ) − L q and, by definition of q , we have ( w N ) − L q = L q . So we showed L q ⊆ L q and L q ⊆ L q which, by minimality of A L , implies q = q . ◭ We simply rewrote the identity that defines the class DA , that is ( xyz ) ω y ( xyz ) ω = ( xyz ) ω . Thérienand Wilke [32] proved that DA = FO [ < ]. The following is a direct consequence thereof. ◮ Corollary C.3.
Let L ∈ T tract , a ∈ Σ , C be a component of A L , and q , q ∈ C . If thereexist w a ∈ Loop( q ) and w a ∈ Loop( q ) , then δ L ( q , w ) ∈ C if and only if δ L ( q , w ) ∈ C for all words w ∈ Σ ∗ . Definition 3.3 uses some word w that is repeated multiple times. Now, we show thatwe can use any word w that stays in a component, given that w is long enough and startswith the same letter. ◮ Lemma C.4.
Let L ∈ T tract and let q , q be two states such that q q and Loop( q ) ∩ a Σ ∗ = ∅ . Let C be the component of A L that contains q . Then, L q ∩ L aq Σ ∗ ⊆ L q where L aq is the set of words w of length N that start with a and such that δ L ( q , w ) ∈ C . Proof.
If Loop( q ) = ∅ , then L q ∩ L aq Σ ∗ = ∅ and the inclusion trivially holds. Thereforewe assume from now on that Loop( q ) = ∅ . Since the proof of this lemma requires a numberof different states and words, we provide a sketch in Figure 4. Let w ∈ L q ∩ L aq Σ ∗ , u bethe prefix of w of length N and w ′ be the suffix of w such that w = uw ′ . Since q and δ L ( q , u ) are both in the same strongly connected component C , there exists a word v with uv ∈ Loop( q ). Corollary B.3 implies that( uv ) N L q ⊆ L q . (2)Let q = δ L ( q , ( uv ) N ). Due to aperiodicity we have uv ∈ Loop( q ). Since A L is determin-istic, this implies L q = (( uv ) N ) − L q and, together with Equation (2) that L q ⊆ L q . (3)We now show that there is a prefix u of u such that δ L ( q , u ) = q and δ L ( q , u ) = q ′ with Loop( q ) ∩ Loop( q ′ ) = ∅ . Assume that u = a · · · a N . Let q α, = q α and, for each i from 1 to N and α ∈ { , } , let q α,i = δ L ( q α , a · · · a i ). Since there are at most N distinctpairs ( q ,i , q ,i ), there exist i, j with 0 ≤ i < j ≤ N such that q ,i = q ,j and q ,i = q ,j .Let u = a · · · a i and u = a i +1 · · · a j . We have u ∈ Loop( q ,i ) ∩ Loop( q ,i ). We define q := δ L ( q , u ) and q ′ = δ L ( q , u ). Since q q ′ and u ∈ Loop( q ) ∩ Loop( q ′ ), Corollary B.3implies u N L q ′ ⊆ L q . Since u ∈ Loop( q ), we also have that L q ′ ⊆ L q . (4)By definition of q and the determinism of A L , we have that L q = u − L q . Thus, Equation (4)implies L q ′ ⊆ u − L q . The definition of q ′ implies that L q ′ = u − L q , so u − L q ⊆ u − L q .In other words, we have L q ∩ u Σ ∗ ⊆ L q ∩ u Σ ∗ . Since u is a prefix of u , and byEquation (3), we also have L q ∩ u Σ ∗ ⊆ L q . This implies that w ∈ L q , which concludesthe proof. ◭◮ Lemma 4.3.
Let L ∈ T tract , let C be a component of A L , let q , q ∈ C , and let w be aword of length N . If δ L ( q , w ) ∈ C and δ L ( q , w ) ∈ C , then δ L ( q , w ) = δ L ( q , w ) . Proof.
Assume that w = a · · · a N . For each i from 0 to N and α ∈ { , } , let q α,i = δ L ( q α , a · · · a i ). Since there are at most N distinct pairs ( q ,i , q ,i ), there exist i, j with0 ≤ i < j ≤ N such that q ,i = q ,j and q ,i = q ,j . Since δ L ( q , w ) ∈ C and δ L ( q , w ) ∈ C , q ,i , q ,i ∈ C . Let w ′ = a i +1 · · · a j . We have w ′ ∈ Loop( q ,i ) ∩ Loop( q ,i ), hence q ,i = q ,i by Lemma C.1. As a consequence, δ L ( q , w ) = δ L ( q , w ). ◭ . Martens, M. Niewerth and T. Trautner 25 q · · · q q q q ′ a.. a..u − ( uv ) N u u − uvu u u uv w ′ Figure 4
Sketch of the proof of Lemma C.4 ◮ Lemma 4.7.
Let L ∈ T tract , let ( C, ( v, q ) , e K · · · e ) be an abbreviation and E ′ ⊆ E . Thereexists an NL algorithm that outputs a shortest trail p such that p | = E ′ ( C, ( v, q ) , e K · · · e ) ifit exists and rejects otherwise. Proof.
We use an algorithm for reachability to search and output a shortest path p from( v, q ) to ( t, q ′ ) for t being the target node of e and some q ′ ∈ C in the product of G (restricted to the edges E ′ ∪ { e , . . . , e K } ) and C such that e K , . . . , e are only used once,and e K · · · e is the suffix of p . Since K is a constant, this is in NL.We show that p is a trail. Assume towards contradiction that p = d · · · d m e K · · · e isnot a trail. Then there exists an edge d i = d j that appears at least twice in p . Note that d j / ∈ { e K , . . . , e } by definition of p . We define p ′ = d · · · d i d j +1 · · · d m e K · · · e and show that p ′ is a shorter than p but meets all requirements. Let q = δ ( q, d · · · d i )and q = δ ( q, d · · · d j ). By definition, q , q ∈ C and both have an ingoing edge with labellab( d i ) = lab( d j ). So we can use Corollary C.3 to ensure that δ ( q , d j +1 · · · d m e K · · · e ) ∈ C .So we can apply Lemma 4.3 to prove that δ ( q , d j +1 · · · d m e K · · · e ) = δ ( q , d j +1 · · · d m e K · · · e ) . So p ′ is indeed a trail satisfying p ′ | = ( C, ( v, q ) , e K · · · e ). Furthermore, p ′ is shorter than p ,contradicting our assumption. ◭◮ Lemma 4.10.
Let G and ( s, t ) be an instance for RTQ ( L ) , with L ∈ T tract . Then everyshortest trail from s to t in G that matches L is admissible. Proof.
In this proof, we use the following notation for trails. By p [ e , e ) we denote theprefix of p [ e , e ] that excludes the last edge (so it can be empty). Notice that p [ e , e ] and p [ e , e ) are always well-defined for trails.Let p = d · · · d m be a shortest trail from s to t that matches L . Let S = α · · · α k bethe summary of p and let p , . . . , p k be trails such that p = p · · · p k and p i | = α i for all i ∈ [ k ]. We denote by left i and right i the first and last edge in p i . By definition of p i andthe definition of summaries, left i and right i are identical with left C and right C if α i ∈ Abbrv is an abbreviation for the component C .Assume that p is not admissible. That means there is some edge e used in p ℓ such that e / ∈ Edge ℓ . There are two possible cases: (1) e ∈ Edge i for some i < ℓ ; and Observe that the standard NL decision algorithm for reachability can be adapted to this purpose, usinga binary search procedure on the length of shortest paths. (2) e / ∈ Edge i for any i .In case (1), we choose i minimal such that some edge e ∈ Edge i is used in p j for some j > i . Among all such edges e ∈ Edge i , we choose the edge that occurs latest in p . Thisimplicitly maximizes j for a fixed i . Especially no edge from Edge i is used in p j +1 · · · p k .Let α i = ( C i , ( v, q ) , e K · · · e ). By definition of Edge i , there is a trail π from v , endingwith e , with δ L ( q, π ) ∈ C , and that is shorter than the subpath p [ left i , right i ] and thereforeshorter than p [ left i , e ].We will now show that p ′ = p · · · p i − πp ( e, d m ] is a trail. Since p is a trail, it suffices toprove that the edges in π are disjoint with other edges in p ′ . We note that all intermediateedges of π belong to Edge i . By minimality of i , no edge in p · · · p i − can use any edge of Edge i and by our choice of e , no edge after e can use any edge of Edge i . This shows that p ′ is a trail.We now show that p ′ matches L . Since e appears in p j , there is a path from left j to right j over e that stays in C j . Let q and q be the states of A L before and after reading e in p and, analogously, q ′ and q ′ the states of A L before and after reading e in p ′ . That is q = δ L ( i L , p [ d , e )) q = δ L ( q , e ) q ′ = δ L ( i L , p ′ [ d , e )) q ′ = δ L ( q ′ , e )We note that in p ′ , e is at the end of the subtrail π .We can conclude that the states q and q ′ both have loops starting with a = lab( e ),as the transition ( q , lab( e ) , q ) is read in C j and the transition ( q ′ , lab( e ) , q ′ ) is read in C i . Furthermore, q ′ q , since q ′ ∈ C i and q ∈ C j . Therefore, Lemma C.4 implies that L q ∩ L aq Σ ∗ ⊆ L q ′ where L aq denotes all words w of length K that start with a and suchthat δ L ( q , w ) ∈ C j .We have that lab( p [ e, d m ]) ∈ L q by the fact that p matches L . We have that lab( p [ e, d m ]) ∈ L aq Σ ∗ , as, by the definition of summaries, A L stays in C j for at least K more edges afterreading e in p . We can conclude that lab( p [ e, d m ]) ∈ L q ′ , which proves that p ′ matches L .This concludes case (1). For case (2), we additionally assume w.l.o.g. that there is noedge e ∈ Edge i that appears in some p j with j > i , i.e., no edge satisfies case (1). Bydefinition of Edge ℓ , there is a trail π with π | = Edge ℓ α ℓ that is shorter than p [ left ℓ , right ℓ ]. Wechoose p ′ as the path obtained from p by replacing p ℓ with π .We will now show that p ′ = p · · · p ℓ − · π · p ℓ +1 · · · p k is a trail. Since p is a trail, itsuffices to prove that the edges in π are disjoint with other edges in p ′ . We note that allintermediate edges of π belong to Edge ℓ .By definition of Edge ℓ , no edge in p · · · p ℓ − is in Edge ℓ . And by the assumption thatthere is no edge satisfying case (1), no edge in p ℓ +1 · · · p k is in Edge ℓ . Therefore, p ′ is a trail.It remains to prove that p ′ matches L . Let ( C, ( v, ˆ q ) , e K · · · e ) = α ℓ and let q and q ′ be the states in which A L is before reading e K in p and p ′ , respectively. By definitionof a summary, we have that δ L ( q, e K · · · e ) ∈ C and, by definition of | =, we have that δ L ( q ′ , e K · · · e ) ∈ C . By Lemma 4.3 we can conclude that δ L ( q, e K · · · e ) = δ L ( q ′ , e K · · · e ).As p matches L , we can conclude that also p ′ matches L . ◭◮ Lemma 4.11. P edge ( L ) is in NL for every L ∈ T tract . Proof.
The proof is similar to the proof of Lemma 14 by Bagan et al. [5], which is basedon the following result due to Immerman [17]: NL NL = NL. In other words, if a decisionproblem P can be solved by an NL algorithm using an oracle in NL, then this problem P belongs to NL. Let, for each k ≥ P ≤ k edge ( L ) be the decision problem P edge ( L ) with . Martens, M. Niewerth and T. Trautner 27 the restriction i ≤ k , i.e., ( G, s, t, S, e, i ) is a positive instance of P ≤ k edge ( L ) if and only if( G, s, t, S, e, i ) is a positive instance of P edge ( L ) and i ≤ k . Notice that i belongs to the inputof P ≤ k edge ( L ) while this is not the case for k . Obviously, P edge ( L ) = P ≤| S | edge ( L ). We prove that P ≤ k edge ( L ) ∈ NL for each k ≥
0. If k = 0, P ≤ edge ( L ) always returns False because Edge i is notdefined for i = 0. So P ≤ edge ( L ) is trivially in NL. Assume, by induction, that P ≤ k edge ( L ) ∈ NL.It suffices to show that there is an NL algorithm for P ≤ k +1 edge ( L ) using P ≤ k edge ( L ) as an oracle.Since NL NL = NL, this implies that P ≤ k +1 edge ( L ) ∈ NL.Let (
G, s, t, S, e, i ) be an instance of P ≤ k +1 edge ( L ). If i ≤ k , we return the same answer asthe oracle P ≤ k edge ( L ). If i = k + 1 and α i ∈ E , we return False, as Edge i = ∅ . If i = k + 1 and α i ∈ Abbrv , we first compute the length m of a minimal trail p such that p | = E i α i using theNL algorithm of Lemma 4.7. We note that we can compute E i using the NL algorithm for P ≤ k edge .To test whether the edge e can be used by a trail from some ( v, q ) in at most m − K steps, we use the on-the-fly-productautomaton of G and A L restricted to the edges in E i and states in C . We search for a shortest path from ( v, q ) to some ( v ′ , q ′ ) ∈ V × C thatends with e . We remind that reachability is in NL.We note that this trail in the product graph might correspond to a path p with a cycle in G . As we project away the states, some distinct edges in the product graph are actually thesame edge in G . However, by Lemma C.2, we can remove all cycles from p without losingthe property that δ L ( q, p ) ∈ C . This concludes the proof. ◭◮ Lemma C.5.
Let L be a language in T tract . There exists an NL algorithm that given aninstance G , ( s, t ) of RTQ ( L ) and a candidate summary S = α · · · α k tests whether there isa trail p from s to t in G with summary S that matches L . Proof.
We propose the following algorithm, which consists of three tests: (1)
Guess, on-the-fly, a path p from S by replacing each α i by a trail p i such that p i | = Edge i α i for each i ∈ [ k ]. This test succeeds if and only if this is possible. (2) In parallel, check that p matches L . (3) In parallel, check that S is a summary of p .We first prove that the algorithm is correct. First, we assume that there is a trail withsummary S from s to t that matches L . Then, there is also a shortest such trail and, byLemma 4.10, this trail is admissible. Therefore, the algorithm will succeed.Conversely, assume that the algorithm succeeds. Since E ( S ) and all the sets Edge i aremutually disjoint, the path p is always a trail. By tests (2) and (3), it is a trail from s to t that matches L .We still have to check the complexity. We note that the sets Edge i are not stored inmemory: we only need to check on-the-fly if a given edge belongs to those sets, whichonly requires logarithmic space according to Lemma 4.11. Therefore, we use an on-the-flyadaption of the NL algorithm from Lemma 4.7, which requires a set Edge i as input, whichwe will provide on-the-fly.Testing if p matches L can simply be done in parallel to test (1) on an edge-by-edgebasis, by maintaining the current state of A L in memory. If we do so, we can also check inparallel if S = α · · · α k is a summary of p . This is simply done by checking, for each α i ofthe form ( C, ( v, q ) , e K · · · e ) and α i +1 = e , whether e / ∈ C . This ensures that, after being in A shortest path is necessarily a trail. C for at least K edges, the path p leaves the component C , which is needed for summaries.Furthermore, we test if there is no substring α i · · · α j in S that purely consists of edges thatare visited in the same component C , but which is too long to fulfill the definition of asummary. Since this maximal length is a constant, we can check it in NL. ◭ We eventually show the main Lemma of this section, proving that
RTQ ( L ) is tractablefor every language in T tract . ◮ Lemma C.6.
Let L ∈ T tract . Then, RTQ ( L ) ∈ NL.
Proof.
We simply enumerate all possible candidate summaries S w.r.t. ( L, G, s, t ), andapply on each summary the algorithm of Lemma C.5. We return “Yes” if this algorithmsucceeds and “No” otherwise. Since the algorithm succeeds if and only if there exists anadmissible path from s to t that matches L , and consequently, if and only if there is a trailfrom s to t that matches L (Lemma 4.10), this is the right answer. Since L is fixed, there isa polynomial number of candidate summaries, each of logarithmic size. Consequently, theycan be enumerated within logarithmic space. ◭◮ Lemma 4.12.
Let L ∈ T tract and L be infinite. Then, RTQ ( L ) is NL-complete. Proof.
The upper bound is due to Lemma C.6, the lower due to reachability in directedgraphs being NL-hard. ◭◮ Corollary 4.13.
Let L ∈ T tract , G be a graph, and s , t be nodes in G . If there exists atrail from s to t that matches L , then we can output a shortest such trail in polynomial time(and in nondeterministic logarithmic space). Proof.
For each candidate summary S , we first use Lemma C.5 to decide whether thereexists an admissible trail with summary S . With the algorithm in Lemma 4.7, we thencompute the minimal length m i of each p i . The sum of these m i s then is the length of ashortest trail that is a completion of S . We will keep track of a summary of one of theshortest trails and finally recompute the overall shortest trail completing this summary andoutputting it. Notice that this algorithm is still in NL since the summaries have constantsize and overall counters never exceed | E | . ◭◮ Lemma C.7.
TwoEdgeDisjointPaths is NP-complete.
Proof.
Fortune et al. [13] showed that the problem variant of
TwoEdgeDisjointPaths thatasks for node-disjoint paths is NP-complete. The reductions from LaPaugh and Rivest [20,Lemma 1 and 2] or Perl and Shiloach [28, Theorem 2.1 and 2.2] then imply that the NPcompleteness also holds for edge-disjoint paths. ◭ To prove the lower bound, we first show that every regular language that is not in T tract admits a witness for hardness, which is defined as follows. ◮ Definition C.8. A witness for hardness is a tuple ( q, w m , w r , w , w ) with q ∈ Q L , w m , w r , w , w ∈ Σ ∗ , w ∈ Loop( q ) and there exists a symbol a ∈ Σ with w = aw ′ and w = aw ′ and satisfying w m ( w ) ∗ w r ⊆ L q , and ( w + w ) ∗ w r ∩ L q = ∅ . Before we prove that each regular language that is not in T tract has such a witness, recallProperty P : L q ⊆ L q for all q , q ∈ Q L such that q q and Loop( q ) ∩ Loop( q ) = ∅ . Martens, M. Niewerth and T. Trautner 29 ◮ Lemma C.9.
Let L be a regular language that does not belong to T tract . Then, L admitsa witness for hardness. Proof.
Let L be a regular language that does not belong to T tract . Then there exist q , q ∈ Q L and words w , w with w = aw ′ and w = aw ′ such that w ∈ Loop( q ) , w ∈ Loop( q ),and q q such that w M w ′ r / ∈ L q for a w ′ r ∈ L q . Let w m be a word with q ∈ δ L ( q , w m ).We set w r = w M w ′ r .We now show that the so defined tuple ( q , w m , w r , w , w ) is a witness for hardness. Bydefinition, we have w m ( w ) ∗ w r ⊆ L q . For commodity, we distinguish two cases, dependingon whether L satisfies Property P or not. If L does not satisfy P , we can assume w.l.o.g.that in our tuple we have w = w and since w M w ′ r / ∈ L q , we also have w ∗ w r ∩ L q = ∅ , soit is indeed a witness for hardness.Otherwise, L is aperiodic, see Lemma A.1. We assume w.l.o.g. that w = ( w ′ ) M forsome word w ′ . Then, we claim that L q ′ ⊆ L q for every q ′ in δ L ( q , Σ ∗ w ). Indeed, every q ′ ∈ δ L ( q , Σ ∗ w ) loops over w by the pumping lemma and aperiodicity of L , hence w ∈ Loop( q ) ∩ Loop( q ′ ) and therefore L q ′ ⊆ L q due to Property P .It remains to prove that ( w + w ) ∗ w r ∩ L q = ∅ . Every word in ( w + w ) ∗ canbe decomposed into uv with u ∈ ε + ( w + w ) ∗ w and v ∈ ( w ∗ ) w r . For q ′ ∈ δ L ( q , u )we have proved that L q ′ ⊆ L q , so it suffices to show that v / ∈ L q . This is immediatefrom w r = w M w ′ r / ∈ L q and the aperiodicity of L . So we have uv / ∈ L q and the tuple( q, w m , w r , w , w ) is indeed a witness for hardness. ◭ We can now show the following ◮ Lemma C.10.
Let L be a regular language that does not belong to T tract . Then, RTQ ( L ) is NP-complete. Proof.
The proof is almost identical to the reduction from two node-disjoint paths to the
RSPQ ( L ) problem by Bagan et al. [5]. Clearly, RTQ ( L ) is in NP for every regular language L , since we only need to guess a trail of length at most | E | from s to t and verify that theword on the trail is in L . Let L / ∈ T tract . We exhibit a reduction from TwoEdgeDisjointPaths to RTQ ( L ). According to Lemma C.9, L admits a witness for hardness ( q, w m , w r , w , w ).Let w ℓ be a word such that δ L ( i L , w ℓ ) = q . By definition we get w ℓ ( w + w ) ∗ w r ∩ L = ∅ and w ℓ w ∗ w m w ∗ w r ⊆ L .We build from G a graph G ′ whose edges are labeled by non empty words. This is actuallya generalization of directed, edge-labeled graphs. Nevertheless, by adding intermediatenodes, an edge labeled by a word w can be replaced with a path whose edges form the word w . The graph G ′ is constructed as follows. The nodes of G ′ are the same as the nodes of G .Let a ∈ Σ with w = aw ′ and w = aw ′ . If w ′ or w ′ is empty, we replace it with a . Foreach edge ( v , v ) in G , we add a new node v and three edges ( v , a, v ), ( v , w ′ , v ), and( v , w ′ , v ). Moreover, we add two new nodes s, t and three edges ( s, w ℓ , s ) , ( t , w m , s ),and ( t , w r , t ).An example of how the edges are replaced can be seen in Figure 5.By construction, for every trail p from s to t in G ′ that contains the edge ( t , w m , s ),we can obtain a similar path that matches a word in w ℓ w ∗ w m w ∗ w r by switching the edgeslabeled w ′ and w ′ while keeping the same nodes. Every trail p from s to t in G ′ thatdoes not contain the edge ( t , w m , s ) matches a word in w ℓ ( w + w ) ∗ w r . By definitionof witness for hardness, no path of that form matches L . Thus, RTQ ( L ) returns true for( G ′ , s, t ) iff there is a trail from s to t in G ′ that contains the edge ( t , w m , s ) that is, iff TwoEdgeDisjointPaths returns true for (
G, s , t , s , t ). ◭ v v v a abc Figure 5
Assume w = a and w = abc . Then every edge ( v , v ) in the reduction in the proofof Lemma C.10 will be replaced by this construction. q · · · q q q q a.. v , . . . , v N u u u w, u w N Figure 6
Sketch of the proof of Lemma D.1
D Proofs for Section 5 (Recognition and Closure Properties)
Before we establish the complexity of deciding for a regular language L whether L ∈ T tract ,we need the following lemma has been adapted from the simple path case (Lemma 4 in [5]). ◮ Lemma D.1.
Let L be a regular language. Then, L belongs to T tract if and only if for allpairs of states q , q ∈ Q L and symbols a ∈ Σ such that q q and Loop( q ) ∩ a Σ ∗ = ∅ ,the following statement holds: (Loop( q ) ∩ a Σ ∗ ) N L q ⊆ L q . Proof.
The (if) implication is immediate by Corollary B.3. Let us now prove the (only if)implication. Since the proof of this lemma requires a number of different states and words,we provide a sketch in Figure 6. Assume L ∈ T tract . Let q , q be two states such thatLoop( q ) ∩ a Σ ∗ = ∅ and q q . If Loop( q ) ∩ a Σ ∗ = ∅ , the statement follows immediately.So let us assume w.l.o.g. that Loop( q ) ∩ a Σ ∗ = ∅ . Let v , . . . , v N ∈ (Loop( q ) ∩ a Σ ∗ ) bearbitrary words and q = δ L ( q , v · · · v N ). We want to prove L q ⊆ L q . For some i, j with0 ≤ i < j ≤ N , we get δ L ( q , v · · · v i ) = δ L ( q , v · · · v j ) due to the pumping Lemma. (Wehave δ L ( q , v · · · v i ) = q for i = 0.) Let u = v · · · v i , u = v i +1 · · · v j and u = v j +1 · · · v k .Let q = δ L ( q , u ).We claim that L q ⊆ L q . The result then follows from L q = u − L q ⊆ u − L q = L q . To prove the claim, let w = u u N and q = δ L ( q , w N ). As w ∈ Loop( q ), we can useCorollary B.3 to obtain w N L q ⊆ L q . Together with L q = ( w N ) − L q this implies L q ⊆ L q . Furthermore, u belongs to Loop( q ) because L is aperiodic. To conclude the proof, weobserve that L q ⊆ L q , by Corollary B.3 with q , q and u , and because δ L ( q , u N ) = q and u ∈ Loop( q ) . ◭◮ Theorem 5.1.
Testing whether a regular language L belongs to T tract is (1) NL-complete if L is given by a DFA and (2) PSPACE-complete if L is given by an NFA or by a regular expression. Proof.
The proof is inspired by Bagan et al. [4]. The upper bound for (1) needs severaladaptions, the lower bound for (1) and the proof for (2) works exactly the same as in [4](just replacing SP tract by T tract ).We now provide the proof in detail. We first prove (1). W.l.o.g., we can assume that L is given by the minimal DFA A L , as testing Nerode-equivalence of two states is in NL. . Martens, M. Niewerth and T. Trautner 31 By Lemma D.1, we need to check for each pair of states q , q and symbol a ∈ Σ whether (i) q q ; (ii) Loop( q ) ∩ a Σ ∗ = ∅ ; and (iii) (Loop( q ) ∩ a Σ ∗ ) N L q \ L q = ∅ .Statements (i) and (ii) are easily verified using an NL algorithm for transitive closure. For(iii), we test emptiness of (Loop( q ) ∩ a Σ ∗ ) N L q \ L q using an NL algorithm for reachabilityin the product automaton of A L with itself, starting in the state ( q , q ). More precisely,the algorithm checks whether there does not exist a string that is in L q , is not in L q ,starts with an a , and leaves the state q (in the left copy of A L ) at least N times with an a -transition.The remainder of the proof is from [4] and only included for self containedness.For the lower bound of (1), we give a reduction from the Emptiness problem. Let L ⊆ Σ ∗ be an instance of Emptiness given by a DFA A L . W.l.o.g. we assume that ε / ∈ L , sincethis can be checked in constant time. Furthermore, we assume that the symbol 1 does notbelong to Σ. Let L ′ = 1 + L + . A DFA A L ′ that recognizes L ′ can be obtained from A L asfollows. We add a state q I that will be the initial state of A L ′ . and a state q F that will bethe unique final state of A L ′ . The transition function δ L ′ is the extension of δ L defined asfollows: δ L ′ ( q I ,
1) = q I and δ L ′ q I , a = i L for every symbol a ∈ Σ.For every final state q ∈ F L , δ L ′ ( q,
1) = q F . δ L ′ ( q F ,
1) = q F .If L is empty, then L ′ = ∅ belongs to T tract . Assume that L is not empty. Let w ∈ L . Then,for every n ∈ N , 1 n w n ∈ L ′ and 1 n n / ∈ L ′ . Thus L ′ / ∈ T tract .For the upper bound of (2), we first observe the following fact: Let A, B be two problemssuch that A ∈ NL and let t be a reduction from B to A that works in polynomial spaceand produces an exponential output. Then B belongs to PSPACE. Thus, we can apply theclassical powerset construction for determinization on the NFA and use the upper boundfrom (1).For the lower bound of (2), we give a reduction from Universality problem. Let L ⊆{ , } ∗ be an instance of Universality given by an NFA or a regular expression. Consider L ′ = (0 + 1) ∗ a ∗ ba ∗ + La ∗ over the alphabet { , , a, b } . Our reduction associates L ′ to L and keeps the same representation (NFA or regular expression). If L ′ = { , } ∗ , then L ′ =(0 + 1) ∗ a ∗ ( b + ε ) and thus L ′ ∈ T tract . Conversely, assume L = { , } ∗ . Let w ∈ { , } ∗ \ L .Then, for every n ∈ N , wa n ba n ∈ L ′ and wa n a n / ∈ L ′ . Thus L ′ / ∈ T tract . ◭◮ Lemma 5.2.
Both classes SP tract and T tract are closed under (i) finite unions, (ii) finite in-tersections, (iii) reversal, (iv) left and right quotients, (v) inverses of non-erasing morphisms, (vi) removal and addition of individual strings. Proof.
Let
C ∈ { SP tract , T tract } and L , L ∈ C . Let n = max( n , n ), where n i ∈ N is thesmallest number such that w ℓ w n i w m w n i w r ∈ L i implies w ℓ w n i w m w n i w r ∈ L i for i ∈ { , } .The proofs for (i) to (vi) all establish that closure under left-synchronized power ab-breviations (or the analogous property for SP tract ) is preserved under the operations. Lettherefore w ℓ , w m , w r ∈ Σ ∗ and w , w ∈ Σ + (if C = SP tract ) or w , w ∈ a Σ ∗ for some a ∈ Σ(if C = T tract ). (i) If w ℓ w n w m w n w r ∈ L ∪ L , then there exists i ∈ { , } with w ℓ w n w m w n w r ∈ L i andthus w ℓ w n w n w r ∈ L i . So, w ℓ w n w n w r ∈ L ∪ L . (ii) If w ℓ w n w m w n w r ∈ L ∩ L , then w ℓ w n w m w n w r ∈ L i and thus w ℓ w n w n w r ∈ L i for i ∈ { , } . So, w ℓ w n w n w r ∈ L ∩ L . (iii) The “reversal” of the definitions define the same class of languages, see Definition 3.11(resp. Theorem 3.5). (iv)
Let w ∈ Σ ∗ . If w ℓ w n w m w n w r ∈ w − L , then ( ww ℓ ) w n w m w n w r ∈ L and therefore( ww ℓ ) w n w n w r ∈ L . This implies w ℓ w n w n w r ∈ w − L . Closure under right-quotientsfollows from closure under left-quotients together with closure under reversal. (v) Let h be a non-erasing morphism. Let w ℓ w n w m w n w r ∈ h − ( L ). Then we have h ( w ℓ w n w m w n w r ) ∈ L , so h ( w ℓ ) h ( w ) n h ( w m ) h ( w ) n h ( w r ) ∈ L . Since h ( w ) , h ( w )are nonempty in the case C = SP tract and h ( w ) , h ( w ) are in h ( a )Σ ∗ in the case C = T tract , it follows that h ( w ℓ ) h ( w ) n h ( w ) n h ( w r ) ∈ L . This implies w ℓ w n w n w r ∈ h − ( L ). (vi) Let w ′ be any word from L . Here we choose n = max( n , | w ′ | ). Let w ℓ w n w m w n w r ∈ L \{ w ′ } . As w ℓ w n w m w n w r ∈ L we have w ℓ w n w n w r ∈ L and therefore w ℓ w n w n w r ∈ L \ { w ′ } . We note that | w ′ | < | w ℓ w n ′ w n ′ w r |. The proof for L ∪ { w ′ } is analogous. ◭ E Proofs for Section 6 (Enumeration) ◮ Theorem 6.1.
Let L be a regular language, G be a graph and ( s, t ) a pair of nodes in G .If NL = NP, then one can enumerate trails from s to t that match L in polynomial delay indata complexity if and only if L ∈ T tract . Notice that we cannot simply use the line graph construction and solve this problem forsimple paths since the class of regular languages that is tractable for simple paths is a strictsubset of T tract . So this method would not, for example, solve the problem for L = ( ab ) ∗ .Instead, we can change Yen’s algorithm [37] to work with trails instead of simple paths.The changes are straight forward: instead of deleting nodes, we only delete edges. InAlgorithm 1 we changed Yen’s algorithm to enumerate trails that match L . Note that weonly need L ∈ T tract to ensure that the subroutines in lines 3 and 10 are in polynomial time.In the algorithm, p [1 , i ] denotes the prefix of p containing exactly i edges and target( p )denotes the last node of p . Explanation of Algorithm 1
In line 3, we can use the algorithm explained in the proof of Theorem 4.1 (more concretely,Corollary 4.13) to find a trail that matches L from s to t in G in NL if one exists. In the for-loop in line 7 we use derivatives of the last trail written to the output to find new candidates.Intuitively, for all i ∈ N , we regard all paths that share the prefix of length exactly i withthe last path and do not share a prefix of length i + 1 with any path outputted so far. Inline 10, we search for a suffix to the prefix p [1 , i ] by again using the algorithm explained inthe proof of Theorem 4.1. We remind that T tract is closed under left derivatives and removalof individual strings, see Lemma 5.2, i.e., (lab( p [1 , i ])) − L \ { ε } is in T tract . However, toprevent finding a trail that was already in the output, we do not allow the suffix to startwith some edge from S . We note that the algorithm from Corollary 4.13 can be easilymodified to check for this additional condition. We repeat this procedure with all trails in B , until we do not find any new trails. . Martens, M. Niewerth and T. Trautner 33 ALGORITHM 1:
Yen’s algorithm changed to work with trails
Input:
Graph G = ( V, E ), nodes s, t ∈ V , a language L ∈ T tract Output:
All trails from s to t in G that match L A ← ∅ ⊲ A is the set of trails already written to output B ← ∅ ⊲ B is a set of trails from s to t matching L p ← a shortest trail from s to t matching L ⊲ p ← ⊥ if no such trail exists while p = ⊥ do output p Add p to A for i = 0 to | p | do G ′ ← ( V, E \ p [1 , i ]) ⊲ Delete the edges of the path p [1 , i ] S = { e ∈ E | p [1 , i ] · e is a prefix of a trail in A } p ← a shortest trail from target( p [1 , i ]) to t in G ′ that matches((lab( p [1 , i ])) − L ) \ { ε } and does not start with an edge from S Add p [1 , i ] · p to B end p ← a shortest trail in B ⊲ p ← ⊥ if B = ∅ Remove p from B15