Which Regular Languages can be Efficiently Indexed?
Nicola Cotumaccio, Giovanna D'Agostino, Alberto Policriti, Nicola Prezza
WWhich Regular Languages can be EfficientlyIndexed?
Nicola Cotumaccio ! Gran Sasso Science Institute, L’Aquila, Italy
Giovanna D’Agostino ! University of Udine, Italy
Alberto Policriti ! University of Udine, Italy
Nicola Prezza ! University Ca’ Foscari, Venice, Italy
Abstract
Consider the problem of matching a pattern P of length π against the elements of a given regularlanguage L in the setting where L can be pre-processed off-line in a fast data structure (an index). Regular expression matching is an ubiquitous problem in computer science, finding fundamentalapplications in areas including, but not limited to, natural language processing, search engines,compilers, and databases. Recent results have settled the exact complexity of this problem: Θ( πm )time is necessary and sufficient for indexed pattern matching queries, where m is the size of anNFA recognizing L . This, however, does not mean that all regular languages are hard to index: forinstance, for the sub-class of Wheeler languages [SODA’20] we can reduce query time to the optimal O ( π ). A Wheeler language admits a total order of a finite refinement of its Myhill-Nerode equivalenceclasses reflecting the co-lexicographic order of their elements. This boosts indexing performancebecause classes whose elements are suffixed by P form a range in this order. In [SODA’21], thistechnique was extended to arbitrary NFAs by allowing the order to be partial . This line of attacksuggested that the width p of such an order is the parameter ultimately capturing the fine-grainedcomplexity of the problem: (i) indexed pattern matching can always be solved in O ( πp ) time, (ii)the standard powerset construction algorithm always produces an output whose size is exponentialin p rather than in the input’s size, and (iii) p even determines how succinctly NFAs can be encoded.In the present work, we take a step further and study the hierarchy of p -sortable languages :regular languages accepted by automata of width p . Our main contributions are the following: (i) weshow that the hierarchy is strict and does not collapse, (ii) we provide (exponential) upper and lowerbounds relating the minimum widths of equivalent NFAs and DFAs, and (iii) we characterize DFAs ofminimum p for a given L via a co-lexicographic variant of the Myhill-Nerode theorem. Our findingsimply that in polynomial time we can build an index breaking the worst-case conditional lowerbound of Ω( πm ), whenever the input NFA’s width is at most ϵ log m , for any constant 0 ≤ ϵ < / Theory of computation → Pattern matching
Keywords and phrases
Wheeler languages, Indexing
Digital Object Identifier
String indexing is the algorithmic problem of building a small data structure (an index )over a given string supporting fast substring search queries [15]. Building efficient stringindexes is a challenging problem which finds important applications in several areas, notablybioinformatics [14, 13]. Lifting this problem to a regular collection L of strings is an evenmore challenging problem and naturally calls into play finite state automata. As a matterof fact, Regular expression matching is an ubiquitous problem in computer science, findingfundamental applications in areas including, but not limited to, natural language processing, © Nicola Cotumaccio, Giovanna D’Agostino, Alberto Policriti, and Nicola Prezza;licensed under Creative Commons License CC-BY 4.042nd Conference on Very Important Topics (CVIT 2016).Editors: John Q. Open and Joan R. Access; Article No. 23; pp. 23:1–23:23Leibniz International Proceedings in InformaticsSchloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany a r X i v : . [ c s . F L ] F e b search engines, compilers, and databases. When L is represented as an NFA (equivalently,a regular expression) of size m , existing on-line algorithms [3] solve the problem in O ( πm )time, π being the length of the query pattern. Recent lower bounds by Backurs and Indyk[4], Equi et al. [8, 7], Potechin and Shallit [16], and Gibney [11] show that, unless importantconjectures such as the Strong Exponential Time Hypothesis (SETH) [12] fail, this complexitycannot be significantly improved. This holds even in the off-line setting (the subject matterof our work) where L can be pre-processed in an index in polynomial time and the complexityis measured in terms of query times [9]. As pointed out by Backurs and Indyk [4], Gagie et al.[10], Alanko et al. [1, 2], and Cotumaccio and Prezza [5], however, this does not necessarilymean that all regular languages are hard to index.Indeed, in [1, 2] we tackled the task of characterising regular languages admitting a directgeneralization of known string indexing techniques—those accepted by so-called Wheelerautomata introduced in [10]. More specifically, recalling that a state q of an NFA can beseen as the collection S q of strings labeling the paths that connect the start state with q , weproved that Wheeler automata are those whose states q (i) have S q that is a convex set I q inthe co-lexicographically ordered set of strings read on the automaton’s paths, and (ii) aresuch that the family of these I q enjoys the so-called prefix/suffix property : the only way I q can be intersect another I q ′ is that a suffix of the former coincides with a prefix of the latter(or vice versa).The co-lexicographic order over strings can be naturally lifted to the elements of a familyof convex sets enjoying such property. In turn, this defines an order over the automaton’sstates which enables pattern matching queries in optimal O ( π ) time: states reached by apath labeled with a given string P form an interval in this order [10].Let P ref ( L ( A )) denote the prefix closure of the language L ( A ) recognized by A . Givena finite-state automaton A and representing all α ∈ P ref ( L ( A )) on a line where α ⪯ β means that α is co-lexicographically smaller than or equal to β , we can depict a Wheelerautomaton/language as follows: ( P ref ( L ( A )) , ≺ ) I q I q I q I q I q I q I q Figure 1
Family of convex sets I q i (on the linear order ( P ref ( L ( A )) , ≺ )) corresponding to the statesof a Wheeler NFA. Deterministic
Wheeler automata are simply those whose I q ’s have pairwise empty inter-section. From the computational point of view, the above characterisation has the interestingconsequence that turning a nondeterministic Wheeler automaton into a deterministic onetakes polynomial time with the classic powerset construction algorithm, in contrast with thegeneral (exponential) case of regular languages. In fact, we proved an even stronger property:any Wheeler NFA with n states admits an equivalent Wheeler DFA with at most 2 n states.Since, clearly, not all (interesting, regular) languages admit a Wheeler accepting auto-maton, the next natural question is: what if we want to index a general regular language? canwe say something on the language’s propensity to be indexed? can we give directions/boundson the complexity of such indexing task?In this paper, elaborating on the idea put forward in [5], we prove that the above pictureis a sort of one-dimensional version of a more general one. In this more general view, the setof the I q ’s is (always, for any automaton) partially ordered and all its elements end up in a . Cotumaccio, G. D’Agostino, A. Policriti, and N. Prezza 23:3 collection of p linearly ordered components, where p is the order’s width. In other words, westudy the scenario in which we do not put any constraint on the language and prove thatthe picture becomes: ( P ref ( L ( A )) , ≺ ) linear component 1 ... linear component p Figure 2
Family of convex sets I q i (on the linear order ( P ref ( L ( A )) , ≺ )) corresponding to the states ofan arbitrary NFA. The family can be partitioned into p linear components, each enjoying the prefix/suffixproperty. However, the union of any two (distinct) components may not satisfy this property. Looking at the projection of the language on a single linear component, we almost see aWheeler language: its elements I q form a prefix/suffix family. In fact, the picture is a bitmore complex as transitions can let us move among different components. As it turns out,the order’s width p is a fundamental measure of NFA complexity [5]: (i) indexed patternmatching can always be solved in O ( πp ) time (the Wheeler case corresponding to p = 1),(ii) the standard powerset construction algorithm always produces an output whose size isexponential in p , rather than in the input’s size, and (iii) p even determines how succinctlyNFAs can be encoded.Within this framework, our main contribution is to begin the study of the hierarchy of p -sortable languages : regular languages accepted by automata of width p (for the minimumsuch p ). In this hierarchy, regular languages are sorted according to the new fundamentalmeasure of NFA complexity p . More in detail:(1) We show that the hierarchy is strict and does not collapse: a p -sortable language existsfor all p ≥ exponentially-large p in the worstcase.(3) We characterize DFAs of minimum p for a given regular language via a co-lexicographicvariant of the Myhill-Nerode theorem. Interestingly, we show that the smallest minimum-width automaton is not unique .(4) We furthermore give a more general self-contained proof of the exponential dependencyin p between the input and output sizes of the standard powerset construction algorithm.Our findings have also important algorithmic consequences. For instance, let 0 < ϵ < A be an NFA of size m and width at most p ≤ ( ϵ/
2) log m . Contributions(2) and (4) — when combined with the polynomial-time DFA indexing algorithms of [5] —imply that in polynomial time we can index A so that pattern matching queries on L ( A ) aresolved in O ( πm ϵ ) time. This breaks asymptotically the conditional lower bound Ω( πm ) ofEqui et al. [9] holding in the worst-case even when polynomial preprocessing time is allowed.The paper is organized as follows: after giving some definitions and notations in Sections2 and 3, in Section 4 we discuss the notion of width of a regular languages and the twohierarchies—deterministic/nondeterministic—based on this notion. In Section 5 we prove aMyhill-Nerode theorem for each level of the deterministic hierarchy.Due to limited space, most of the proofs can be found in the appendix. C V I T 2 0 1 6
We say that ( V, ≤ ) is a partial order if V is a set and ≤ is a binary relation on V beingreflexive, antisymmetric, and transitive. Any u, v ∈ V are said to be ≤ - comparable if either u ≤ v or v ≤ u . We write u < v when u ≤ v and u ̸ = v . We write u ∥ v if u and v arenot ≤ -comparable. Note that for every u, v ∈ V exactly one of the following hold true: (1) u = v , (2) u < v , (3) v < u , (4) u ∥ v . We say that ( V, ≤ ) is a total order if ( V, ≤ ) is apartial order and every pair on elements in ( V, ≤ ) are ≤ -comparable. A subset Z ⊆ V is a ≤ - chain if ( Z, ≤ ) is a total order, and a family { V i } pi =1 is a ≤ - chain partition if { V i } pi =1 isa partition of V and each V i is a ≤ -chain. The width of ( V, ≤ ) is the smallest integer p forwhich there exists a chain partition { Q i } pi =1 . We say that U ⊆ V is an ≤ - antichain if everypair of elements in U are not ≤ -comparable. Dilworth’s theorem [6] states that the widthof ( V, ≤ ) is the cardinality of a largest ≤ -antichain. A subset C of a partial order ( V, ≤ ) is ≤ - convex if for every u, v, z ∈ V , if u, z ∈ C and u < v < z , then v ∈ C . If C is a finite ≤ -convex set we call it an ≤ - interval . If the order is deducible from the context, we drop theprefix ≤ . A monotone sequence in a partial order ( V, ≤ ) is a sequence ( v n ) n ∈ N with v n ∈ V and either v i ≤ v i +1 , for all i , or v i ≥ v i +1 , for all i . A nondeterministic finite automaton (NFA) is a 5-tuple A = ( Q, s, Σ , δ, F ), where Q is theset of states, s is the initial state, Σ is the alphabet, δ : Q × Σ → P ow ( Q ) is the transitionfunction, and F ⊆ Q is the set of final states. An automaton A is deterministic (a DFA), if | δ ( q, a ) | ≤
1, for any q ∈ Q and a ∈ Σ. As customary, we extend δ to operate on strings asfollows: for all q ∈ Q, a ∈ Σ , and α ∈ Σ ∗ : δ ( q, ϵ ) = { q } , δ ( q, αa ) = [ v ∈ δ ( q,α ) δ ( v, a ) . We say that a state q ′ is reachable from a state q if there exists α ∈ Σ ∗ with q ′ ∈ δ ( q, α ). Ifthe automaton is deterministic we write δ ( q, α ) = q ′ for the unique q ′ such that δ ( q, α ) = { q ′ } (if defined). We denote by L ( A ) = { α ∈ Σ ∗ | δ ( s, α ) ∩ F ̸ = ∅} .Throughout this paper we assume that every NFA A satisfies the following properties:(1) L ( A ) ̸ = ∅ , (2) every state is reachable from the initial state, (3) every state is either finalor it allows to reach a final state, (4) the initial state is not reachable from any other state,(5) if q ∈ δ ( q ′ , a ) ∩ δ ( q ′′ , b ) then a = b ( input-consistency ).As a consequence of the above assumptions, for every state u ̸ = s there exists a uniqueletter λ ( u ) such that u ∈ δ ( q, λ ( u )), for some q . For the initial state s , we define λ ( s ) = ̸∈ Σ. Moving labels from an edge to its target state, input-consistent automata canbe described as state-labeled automata and this is what we do in our examples. It is alsoconvenient to define an edge of an automaton as a triple ( u, v, a ) with v ∈ δ ( u, a ) (or simply( u, v ), since a = λ ( v )) and denote the set of edges of the automaton as E A —simply E when A is clear from the context.The class of NFAs satisfying (1-5) is fully general when languages are concerned. Inparticular, it can be easily seen that any automaton can be converted into an input-consistentone recognizing the same language, possibly at the price of increasing | Q | by a multiplicativefactor | Σ | .We assume that on the alphabet Σ there is a fixed, predetermined total order ⪯ . We alsoassume that ≺ a for every a ∈ Σ. . Cotumaccio, G. D’Agostino, A. Policriti, and N. Prezza 23:5 If A = ( Q, s, Σ , δ, F ) is an NFA, we denote with P ref ( L ( A )) the set of all prefixes ofsome word in L ( A ). For every α ∈ P ref ( L ( A )), let I α = { q ∈ Q | δ ( s, α ) = q } . For every q ∈ Q , let I q = { α ∈ P ref ( L ( A )) | δ ( s, α ) = q } .The Myhill-Nerode equivalence ≡ L for L ⊆ Σ ∗ is the equivalence relation on P ref ( L )such that for every α, β ∈ P ref ( L ): α ≡ L β if and only if ( ∀ γ ∈ Σ ∗ )( αγ ∈ L ⇐⇒ βγ ∈ L ) . In this section we recall basic definitions and results from [5]. We consider orders on the setof the automaton’s states reflecting the co-lexicographic order of the words spelling pathsfrom the source. As stated in the previous section, we always refer to a fixed linear order ≺ on the alphabet Σ, co-lexicographically extended to words. ▶ Definition 1. ([5, Def. 3.1])
Let A = ( Q, s, Σ , δ, F ) be an NFA. A co-lexicographicorder on A is a partial order ( Q, ≤ ) such that: (Axiom 1) For every u, v ∈ Q , if λ ( u ) ≺ λ ( v ) , then u < v (hence s ≤ u for every u ∈ U ). (Axiom 2) For every ( u ′ , u ) , ( v ′ , v ) ∈ E A , if λ ( u ) = λ ( v ) and u < v , then u ′ ≤ v ′ . Wheeler automata [10] are precisely those for which the order ≤ of Definition 1 is total.Since not all automata admit a Wheeler order, the totality requirement restricts the class ofautomata we are allowed to use when considering a language. On the other hand, if we dropthe linearity requirement, we may consider the whole class of finite automata and use the width of the partial order (intuitively, the “distance” from being a linear order) to classifyautomata and the languages they accept (see Definitions 3 and 6).The following lemma from [5] describes how a co-lexicographic order on the states ofan automaton is related to the co-lexicographic order on the words that can be read on itspaths. ▶ Lemma 2. ([5, Lem. 3.1])
Let A = ( Q, s, Σ , δ, F ) be an NFA, and let ≤ be a co-lexicographic order on A . Let u, v ∈ Q and α, β ∈ P ref ( L ( A )) be such that u ∈ I α , v ∈ I β and { u, v } ̸⊆ I α ∩ I β (or equivalently, α ∈ I u , β ∈ I v and { α, β } ̸⊆ I u ∩ I v ). Then: If α ≺ β , then u < v or u ∥ v . If u < v , then α ≺ β . Automata can be classified according to the width of their co-lexicographic orders: ▶ Definition 3. ([5, Def. 3.3])
Let A = ( Q, s, Σ , δ, F ) be an NFA. We say that A is p -sortable if there exists a co-lexicographic order ≤ on A such that Q admits a ≤ -chain partition { Q i } pi =1 . The width of an automaton A , denoted by w idth ( A ) , is the smallest integer p for which A is p -sortable. In Fig.3 we present an NFA A with w idth ( A ) = 2 (with the natural alphabetic orderover the letters, except that Q , Q , where states in Q are colored dark grey, states in Q are colored light grey, andinside a class states are ordered from left to right. Notice that this NFA is not 1-sortable:let q, q ′ be the states labeled by x with q above q ′ ; then ax, axx ∈ I q , bx ∈ I q ′ \ I q and,since ax ≺ bx ≺ axx , using Lemma 2 we see that the q and q ′ cannot be compared in anyco-lexicographic order. C V I T 2 0 1 6 ab xx yz
Figure 3
An NFA A with w idth ( A ) = 2 An NFA can have many co-lexicographic orders and the larger the order (that is, thelarger the number of comparable state pairs is), the more insight we gain on the language itrecognizes. In [5, Thm. 6.2] it was shown that DFAs admit a unique maximal co-lexicographicorder, while for NFAs the maximal order is, in general, not unique. Here we give an explicitdefinition of the maximal co-lexicographic order of a DFA: two distinct states are comparableif and only if the strings reaching the former are co-lexicographically smaller than thosereaching the latter. This order will be particularly important to obtain our results. ▶ Lemma 4.
Let A = ( Q, s, Σ , δ, F ) be a DFA. For every u, v ∈ Q , let: u ≤ v ⇐⇒ u = v ∨ ( ∀ α ∈ I u )( ∀ β ∈ I v )( α ≺ β ) . Then, ≤ is a co-lexicographic order. Moreover, for every co-lexicographic order ≤ ′ on A andfor every u, v ∈ Q , if u ≤ ′ v , then u ≤ v . We say that ≤ is the maximal co-lexicographicorder on A . It is well known that nondeterministic automata can be exponentially more succinct(as far as the number of states is concerned) than equivalent deterministic ones. However,for specific classes of automata, this gap can be considerably reduced. For example, in[1] it is proved that for nondeterministic Wheeler automata the gap is linear. This resultis generalized in [5] as follows, substantiating the importance of the notion of automata’swidth. In the appendix we give an independent, self-contained, and more general proof ofthe following bound: ▶ Lemma 5. ([5, Thm 5.1])
Let A = ( Q, s, Σ , δ, F ) be an NFA, with | Q | = n andw idth ( A ) = p . Let A ∗ = ( Q ∗ , E ∗ , Σ , s ∗ , F ∗ ) be the powerset automaton of A . Then, | Q ∗ | ≤ p ( n − p + 1) − . The above lemma states that the well-known exponential explosion of NFA determinization occurs in the width p , rather than in the number n of states . This fact has importantconsequences to the study of efficient algorithms on automata: for example, it implies thatthe PSPACE-complete NFA equivalence problem is fixed-parameter tractable with respectto p (via powerset construction followed by DFA equivalence checking). On the grounds of the basic definition of automata’s co-lexicographic width, we start studyingits implications for the theory of regular languages. In this section, we define the “width of aregular language”, based on co-lexicographic orders for the automata recognizing it. . Cotumaccio, G. D’Agostino, A. Policriti, and N. Prezza 23:7 ▶ Definition 6.
Let L be a regular language. The nondeterministic width of L , denoted by w idth N ( L ) , is the smallest integer p forwhich there exists an NFA A such that L ( A ) = L and w idth ( A ) = p . The deterministic width of L , denoted by w idth D ( L ) , is the smallest integer p for whichthere exists a DFA A such that L ( A ) = L and w idth ( A ) = p . In Lemma 8 we check that every level of both the above hierarchies is non-empty. To seethis we shall use the following lemma: ▶ Lemma 7.
Let A = ( Q, s, Σ , δ, F ) be an NFA. Assume that A contains a simple cycle oflength m such that all edges of the cycle are equally labeled. Then, w idth ( A ) ≥ m . Proof.
Let e ∈ Σ and u , . . . , u m − m pairwise distinct states such that: δ ( u , e ) = u , δ ( u , e ) = u , . . . , δ ( u m − , e ) = u . We prove that the states u , . . . , u m − are pairwise incomparable in any co-lexicographicorder ≤ on A . Suppose there are 0 ≤ i < i + h ≤ m − u i , u i + h are comparable.Assume w.l.o.g. that u i < u i + h ; since the u j ’s are pairwise distinct, denoting by [ j ] the classmodulo m , by Axiom 2 we obtain: · · · < u [ i − nh ] < · · · < u [ i − h ] < u i < u i + h . Since there are exactly m possible elements modulo m , there exist j < k with [ i − jh ] =[ i − kh ], a contradiction. Hence, the states u , . . . , u m − are pairwise incomparable in anyco-lexicographic order on A . This proves that w idth ( A ) ≥ m . ◀▶ Lemma 8.
For every integer m ≥ , there exists L such that w idth N ( L ) = w idth D ( L ) = m . Proof.
Define: L m = { a km | k ≥ } . Notice that it will suffice to prove that L m is recognized by a m -sortable DFA but itcannot be recognized by any ( m − A m , having m + 1nodes s, q , . . . , q m (with q m final), δ ( s, a ) = q , δ ( q i , a ) = q i +1 , for i = 1 , . . . , m −
1, and δ ( q m , a ) = q , endowed with the maximal co-lexicographic order defined as in Lemma 4.Under this order the states q , . . . , q m are pairwise incomparable, so that A m is m -sortable.Next, consider any NFA A that recognizes L m , whose alphabet must be Σ = { a } . Since L m is an infinite language any NFA A recognizing L m must contain a simple cycle C . Let u be any node in C and let a c with c ≥ a k , a h are the labels ofsome paths from s to u and from u to a final state, respectively, we must have m | ( h + k )and m | ( h + k + c ); hence c is a non-zero multiple of m and by Lemma 7 we conclude that A cannot be ( m − ◀ Clearly, for every regular language L we have w idth N ( L ) ≤ w idth D ( L ). Moreover, forlanguages with w idth N ( L ) = 1, the so called Wheeler languages , it is known that thenondeterministic and deterministic width coincide (see [1]). However, as we shall see, this is acharacteristic which is very peculiar of Wheeler languages, as the gap from nondeterministicto deterministic width for regular languages is, in general, exponential.In order to prove the exponential gap between nondeterministic and deterministic width,we first prove the following.
C V I T 2 0 1 6 ▶ Lemma 9.
Let p , . . . , p k be distinct primes. Then, there exists a language L such thatw idth D ( L ) ≥ Q ki =1 p i and w idth N ( L ) ≤ P ki =1 p i . Proof.
Consider the language: L = { a r | ( ∃ i ∈ { , . . . , k } )( p i | r ) } . It is easy to explicitly define an NFA recognizing L with n = 1 + P ki =1 p i states: the sourcestate and k disjoint cycles, of length p i , . . . , p k , connected to the source state via an edge.Since the source state can always be compared to any other state, the width of this NFA isat most n − idth D ( L ) ≥ Q ki =1 p i , consider a DFA A recognizing L . Anysuch DFA is composed by a line of states, δ ( s , a ) = s , δ ( s , a ) = s , . . . , δ ( s k − , a ) = s k ,terminating in a cycle of length, say, ℓ . We prove that Q ki =1 p i divides ℓ . Consider m suchthat the word a u reaches a final state inside the cycle, where u = ( Q ki =1 p i ) m . If u = u + ℓ ,the word a u arrives to the same final state, hence it belongs to L ; let p i such that p i divides u (say, p i = p ). It follows that p divides ℓ . Consider now m ′ such that the word a u arrives to a final state of the cycle, where u = ( Q ki =2 p i ) m ′ + ℓ . Hence, there exists p i suchthat p i divides u . Since p divides ℓ but not ( Q ki =2 p i ) m ′ , we have p i ̸ = p , say, p i = p . Itfollows that p divides ℓ .Iterating the above argument we obtain that ℓ is divided by any p i , hence by their product.From Lemma 7 it then follows that width( A ) ≥ Q ki =1 p i .Since this is true for any DFA recognizing L , we conclude that w idth D ( L ) ≥ Q ki =1 p i . ◀ Our lower bound immediately follows: ▶ Lemma 10.
There exists a language L such that width D ( L ) ≥ e √ width N ( L ) . Proof.
Let p , . . . , p k be all primes no larger than a fixed n . The primorial function growsasymptotically as Q ki =1 p i = e (1+ o (1)) n ≥ e n and the sum of the primes no larger than n grows asymptotically as P ki =1 p i ∈ O ( n / log n ) and it is, in fact, never larger than n . Now,consider the language L of Lemma 9. Combining width N ( L ) ≤ n with width D ( L ) ≥ e n , weobtain the claimed lower bound. ◀ We now complement the above lower bound with an upper bound which can be provedby means of the usual powerset construction: ▶ Lemma 11.
Let A be an NFA and let A ∗ be the powerset automaton obtained from A .Then, w idth ( A ∗ ) ≤ w idth ( A ) − . From the above lemma we immediately get the following result. ▶ Corollary 12.
Let L be a language. Then, w idth D ( L ) ≤ w idth N ( L ) − . Notice that, when w idth N ( L ) = 1, that is, when L is Wheeler, we obtain that w idth D ( L ) =w idth N ( L ) (as already proved in [1]).An important consequence of the above bounds is that, in polynomial time, we canindex an interesting subset of the regular languages (represented as NFAs) such that patternmatching query times break the indexability lower bound of Equi et al. [9] (holding in theworst-case even when polynomial preprocessing time is allowed): ▶ Corollary 13.
Let A be an NFA with m edges and w idth ( A ) ≤ ( ϵ/
2) log m for any constant < ϵ < . We can index A in polynomial time so that pattern matching queries on L ( A ) aresolved in O ( πm ϵ ) time, π being the pattern’s length. . Cotumaccio, G. D’Agostino, A. Policriti, and N. Prezza 23:9 Proof.
Let σ be the alphabet’s size, n be the number of states of A , and p = width ( A ).First, note that σ ≤ m holds after a (polynomial-time) re-mapping of the alphabet tothe interval [1 , m ] so that all symbols label at least one edge. The first step is to run thepowerset construction algorithm and build a DFA A ∗ equivalent to A . An analysis of thepowerset construction algorithm’s complexity (for example, see [5, Lem. 5.1]), combined withLemma 5, shows that in O (2 p ( n − p + 1) n σ ) = O ( m ϵ/ n ) time the powerset constructionalgorithm generates A ∗ , which has at most ¯ n ≤ p ( n − p + 1) ≤ m ϵ/ n states and at most¯ m ≤ ¯ n · σ ≤ m ϵ/ n edges. By Lemma 11, the width of A ∗ is at most ¯ p < p ≤ m ϵ/ . Atthis point, [5, Cor. 6.1] states that we can build the generalized FM-index [5, Thm. 4.2]of A ∗ in O ( ¯ m + ¯ n / ) = O (( m ϵ/ n ) + ( m ϵ/ n ) / ) time. This index supports patternmatching queries in O ( π · ¯ p · log(¯ p · σ )) = O ( πm ϵ · log m ) time, which can be made O ( πm ϵ )by infinitesimally adjusting ϵ . All running times for building the index are polynomial in m and n , that is, in the size of the input. ◀ Before Corollary 13, only the case p = 1 (Wheeler languages [10, 2]) admitted a polynomial-time indexing strategy (by the indexing algorithms of Alanko et al. [1]) beating the in-dexability lower bound [9] of Ω( πm ). Corollary 13 extends this result to a larger class ofNFAs. The Myhill-Nerode Theorem for regular languages states that there is an exact correspondencebetween DFAs recognizing a regular language L and right invariant equivalences on wordswith finite index, realizing L as a union of classes. Moreover, among such equivalences, thereis one realizing the DFA with minimum number of states recognizing L and this automatonis unique up to isomorphism. As proved in [1], these results carry over to Wheeler languagesand automata, so that we can speak about the minimum Wheeler DFA recognizing a Wheelerlanguage. In this section we prove that the picture is more complex in the sortability context.First, we prove that, given a regular language L , it is not always the case that among theDFAs recognizing L and having minimum width, there is a unique DFA with minimumnumber of states, up to isomorphism. ▶ Lemma 14.
There exists a regular language L such that: w idth N ( L ) = w idth D ( L ) = 2 . There exist two non-isomorphic DFAs A and B with the same number of states such that L ( A ) = L ( B ) = L , w idth ( A ) = w idth ( B ) = 2 , and no -sortable DFA with fewer statesrecognizes L . However, as we shall see in Corollary 25, if we fix the partition of
P ref ( L ) induced bya p -sortable DFA recognizing L , among all p -sortable DFA recognizing L and inducing thesame partition there is a unique DFA with minimum number of states. More generally,we will relate DFA sortability and convexity in equivalence relations, thereby deriving aco-lexicographic Myhill-Nerode theorem for regular languages.First, let us recall the definition of right-invariant equivalence relation, which is at theroot of the classical Myhill-Nerode theorem. ▶ Definition 15.
Let
L ⊆ Σ ∗ be a regular language and let ∼ be an equivalence relation on P ref ( L ) . We say that ∼ respects P ref ( L ) if: ( ∀ α, β ∈ P ref ( L ))( ∀ ϕ ∈ Σ ∗ )( α ∼ β ∧ αϕ ∈ P ref ( L ) → βϕ ∈ P ref ( L )) . C V I T 2 0 1 6 If ∼ respects P ref ( L ) , we say that ∼ is right-invariant if for every α, β ∈ P ref ( L ) andfor every ϕ ∈ Σ ∗ , if α ∼ β and αϕ ∈ P ref ( L ) (and so βϕ ∈ P ref ( L ) ), then αϕ ∼ βϕ . The next step is to focus on co-lexicographic equivalence relations that are consistent with respect to a given partition of
P ref ( L ). More specifically, with the following definitionswe introduce equivalence relations that (1) refine the partition in a forward-stable manner(that is, by extending two equivalent words with any fixed word ϕ we end up in the samepartition’s class), and (2) whose classes form co-lexicographic convex sets. ▶ Definition 16.
Let
L ⊆ Σ ∗ be a regular language, and let ∼ be an equivalence relation on P ref ( L ) . Let P = { U , . . . , U p } be a partition of P ref ( L ) . For every α ∈ P ref ( L ) , let U α be the unique element U i of P such that α ∈ U i . ▶ Definition 17.
Let
L ⊆ Σ ∗ be a regular language. Let P = { U , . . . , U p } be a partition of P ref ( L ) . Let ∼ be an equivalence relation on P ref ( L ) that respects P ref ( L ) . We say that ∼ is P -consistent if for every α, β ∈ P ref ( L ) and for every ϕ ∈ Σ ∗ , if α ∼ β and αϕ ∈ P ref ( L ) (and so βϕ ∈ P ref ( L ) ), it holds U αϕ = U βϕ . Note that, in particular, U α = U β . Let ∼ be a P -consistent equivalence relation on P ref ( L ) . We say that ∼ is P -convex iffor every α ∈ P ref ( L ) we have that [ α ] ∼ is a convex set in ( U α , ⪯ ) . In the following, we will be interested in the coarsest equivalence relation refining a given P -consistent equivalence relation. ▶ Lemma 18.
Let
L ⊆ Σ ∗ be a regular language, and let P = { U , . . . , U p } be a partition of P ref ( L ) . Let ∼ be an equivalence relation on P ref ( L ) that respects P ref ( L ) . For every α, β ∈ P ref ( L ) , define: α ∼ P β ⇐⇒ ( α ∼ β ) ∧ ( ∀ ϕ ∈ Σ)( αϕ ∈ P ref ( L ) → U αϕ = U βϕ ) . Then, ∼ P is the coarsest P -consistent equivalence relation on P ref ( L ) that refines ∼ . Wesay that ∼ P is the P -consistent refinement of ∼ . At this point, we add to the previous definitions one additional ingredient: we embedour equivalence relation in the co-lexicographic order by requiring that the relation is also P -convex. ▶ Lemma 19.
Let
L ⊆ Σ ∗ be a regular language, and let P = { U , . . . , U p } be a partitionof P ref ( L ) . Let ∼ be a P -consistent equivalence relation on P ref ( L ) . For every α, γ ∈ P ref ( L ) , define: α ∼ c γ ⇐⇒ ( α ∼ γ ) ∧∧ ( ∀ β, ϕ ∈ Σ ∗ )(( αϕ, βϕ ∈ P ref ( L )) ∧ ( U αϕ = U βϕ ) ∧ (min { α, γ } ≺ β ≺ max { α, γ } ) → αϕ ∼ βϕ ) . Then, ∼ c is a P -consistent and P -convex equivalence relation on P ref ( L ) being the coarsest P -convex equivalence relation on P ref ( L ( A )) that refines ∼ . We say that ∼ c is the P -convexrefinement of ∼ . ▶ Definition 20.
Let L be a regular language. Let P = { U , . . . , U p } be a partition of V . We define ≡ L , P to be the P -consistent refinement of ≡ L . We define ≡ c L , P to be the P -convex refinement of ≡ L , P . . Cotumaccio, G. D’Agostino, A. Policriti, and N. Prezza 23:11 The last steps consist in establishing a map from the linear components of an NFArecognizing L (i.e. the elements of a chain partition for one of its valid co-lexicographicorders) to subsets of P ref ( L ). First, we note that a chain partition of the NFA naturallyinduces a cover of P ref ( L ). ▶ Definition 21.
Let A = ( Q, s, Σ , δ, F ) be an NFA, let ≤ be a co-lexicographic order on A and let { Q i } pi =1 be a ≤ -chain partition of Q . For every i ∈ { , . . . , p } , define: P ref ( L ( A )) i = { α ∈ P ref ( L ( A )) | I α ∩ Q i ̸ = ∅} . ▶ Remark 22.
In general, { P ref ( L ( A )) i } pi =1 is not a partition of P ref ( L ( A )), but just acover of P ref ( L ( A )), that is , ∪ i ∈{ ,...,p } P ref ( L ( A )) i = P ref ( L ( A )). Nonetheless, notethat { P ref ( L ( A )) i } pi =1 is a partition of P ref ( L ( A )) if A is a DFA.Then, a P -sortable NFA (for a given partition P of P ref ( L ( A ))) is defined as oneadmitting a chain partition of its states that is mapped (in the sense of the above definition)to P . ▶ Definition 23.
Let A = ( Q, s, Σ , δ, F ) be an NFA, and let P = { U , . . . , U p } be a partitionof P ref ( L ( A )) . We say that A is P -sortable if there exists a co-lexicographic order ≤ on A and a ≤ -chain partition { Q i } pi =1 such that for every i ∈ { , . . . , p } : P ref ( L ( A )) i = U i . We can finally state our co-lexicographic extension of the Myhill-Nerode theorem. ▶ Theorem 24 (Co-lexicographic Myhill-Nerode theorem) . Let L be a language. Let P be apartition of P ref ( L ) . The following are equivalent: L is recognized by a P -sortable NFA. ≡ c L , P has finite index. L is the union of some classes of a P -convex, right invariant equivalence relation on P ref ( L ) of finite index. L is recognized by a P -sortable DFA. By taking P = { P ref ( L ) } , the above theorem generalizes the one devised for the Wheelercase in [1, Thm 2.1]. In particular, the theorem implies once again that if w idth N ( L ) = 1,then w idth D ( L ) = 1. However, the above theorem tells us more: the reason why in generalw idth N ( L ) is strictly smaller than w idth D ( L ) is that in general, given an NFA A , a co-lexicographic order ≤ on A and a ≤ -chain partition { Q i } pi =1 , we have that { P ref ( L ( A )) i } pi =1 need not be a partition of P ref ( L ( A )). On the other hand, if in our NFA all equally-labelededges leaving the same ≤ -chain end in the same ≤ -chain, then { P ref ( L ( A )) i } pi =1 is a partitionof P ref ( L ( A )). This means that the inequality w idth N ( L ) < w idth D ( L ) does not dependon the expressive power of nondeterminism itself, but it depends on the nondeterministicbehaviour of the elements in the ≤ -chain partition.We notice that our characterization implies the existence of a unique canonical minimum P -sortable DFA. ▶ Corollary 25.
Let L be a language. Let P be a partition of P ref ( L ) . If L is recognized bysome P -sortable DFA, then there exists a P -sortable DFA A such that all P -sortable DFAsrecognizing L and non-isomorphic to A have a larger number of states. In other words, A isthe minimum P -sortable DFA recognizing L . C V I T 2 0 1 6
We can finally merge the results of the previous sections by observing that width D ( L )is nothing but the cardinality of a (smallest) partition of P ref ( L ) mapping to the linearcomponents of a co-lexicographic order for some automaton recognizing L . ▶ Corollary 26.
Let L be a language. The following are equivalent: width D ( L ) = p . There exists a partition P of P ref ( L ) having cardinality p such that ≡ c L , P has finiteindex. There exists a partition P of P ref ( L ) having cardinality p such that L is the union ofsome classes of a P -convex, right invariant equivalence relation on P ref ( L ) of finiteindex. We proved that the concept of automaton’s width allows us to define two non collapsinghierarchies of regular language. and that the levels of such hierarchies are meaningfulcomplexity measures. Although levels 1 in each of the two hierarchies denote the same classof languages, the Wheeler ones, where we can also find unique minimal automata moduloisomorphism [1], we proved that this is no longer true for higher levels, where we have anexponential gap between the nondeterministic and the deterministic hierarchy. Moreover, wealso proved that even the deterministic levels above level 1 lack the uniqueness of minimalautomata. However, by fixing certain parameters (a partition of the prefixes of the language)we can retrieve an uniqueness result.Our language-theoretic results find important applications to the study of regular expres-sion matching algorithms: we showed that regular languages represented as NFAs in the low(logarithmic) levels of the nondeterministic hierarchy admit indexes supporting fast patternmatching queries.In a paper in preparation we shall consider the following questions on the width notion. Given a regular language L (say, by giving its minimum DFA) can we calculate its widthin an effective way? Notice that the width of the minimum DFA does not, in general,reflect the width of the language already at level one: there are Wheeler languages forwhich the minimum DFA is not Wheeler (see [1]), so the question is not trivial. As for other interesting subclasses of regular languages, Wheeler languages admit anautomata free characterization: a language L is Wheeler if and only if every monotonesequence in ( P ref ( L ) , ⪯ ) is “thin”, i.e. it ends definitely in at most one Myhill-Nerodeclass [1]. Can we find a similar characterization for languages of width p , for p > Is it possible to derive the width of a language directly from some combinatorial/graph-theoretical property of the minimum DFA accepting L ? Our lower bound of Lemma 10 and upper bound of Lemma 11 do not match. Can weimprove this result by providing tight bounds for the separation between the deterministicand nondeterministic hierarchies?In addition, our work opens further intriguing questions of more algorithmic flavor. Forinstance, can we devise a fast algorithm that, given a DFA, outputs the equivalent DFAof minimum width? Can we extend our efficient indexing strategies to higher levels of thenondeterministic hierarchy? Can we prove conditional lower bounds for the regular expressionmatching problem as a function of the language’s width? . Cotumaccio, G. D’Agostino, A. Policriti, and N. Prezza 23:13
References Jarno Alanko, Giovanna D’Agostino, Alberto Policriti, and Nicola Prezza. Regular languagesmeet prefix sorting. In
Proceedings of the 2020 ACM-SIAM Symposium on Discrete Algorithms(SODA) , pages 911–930. doi:10.1137/1.9781611975994.55 . Jarno Alanko, Giovanna D’Agostino, Alberto Policriti, and Nicola Prezza. Wheeler languages,2020. arXiv:2002.10303 . Amihood Amir, Moshe Lewenstein, and Noa Lewenstein. Pattern matching in hypertext.
Journal of Algorithms , 35(1):82–99, 2000. doi:10.1007/3-540-63307-3_56 . Arturs Backurs and Piotr Indyk. Which regular expression patterns are hard to match? In , pages457–466. IEEE, 2016. doi:10.1109/FOCS.2016.56 . Nicola Cotumaccio and Nicola Prezza. On indexing and compressing finite automata. In
Proceedings of the 2021 ACM-SIAM Symposium on Discrete Algorithms (SODA) , pages2585–2599. doi:10.1137/1.9781611976465.153 . R. P. Dilworth. A decomposition theorem for partially ordered sets. In Kenneth P. Bogart,Ralph Freese, and Joseph P. S. Kung, editors,
The Dilworth Theorems: Selected Papersof Robert P. Dilworth , pages 7–12. Birkhäuser Boston, Boston, MA, 1990. doi:10.1007/978-1-4899-3558-8_1 . Massimo Equi, Roberto Grossi, Veli Mäkinen, and Alexandru I. Tomescu. On the complexityof string matching for graphs. In Christel Baier, Ioannis Chatzigiannakis, Paola Flocchini,and Stefano Leonardi, editors, , volume 132 of
LIPIcs , pages55:1–55:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019. doi:10.4230/LIPIcs.ICALP.2019.55 . Massimo Equi, Veli Mäkinen, and Alexandru I Tomescu. Conditional Indexing Lower BoundsThrough Self-Reducibility. arXiv preprint arXiv:2002.00629 , 2020. Massimo Equi, Veli Mäkinen, and Alexandru I. Tomescu. Graphs cannot be indexed inpolynomial time for sub-quadratic time string matching, unless SETH fails. arXiv preprintarXiv:2002.00629 , 2020. Travis Gagie, Giovanni Manzini, and Jouni Sirén. Wheeler graphs: A framework for BWT-based data structures.
Theoretical Computer Science , 698:67 – 78, 2017. Algorithms, Stringsand Theoretical Approaches in the Big Data Era (In Honor of the 60th Birthday of ProfessorRaffaele Giancarlo). doi:10.1016/j.tcs.2017.06.016 . Daniel Gibney, Gary Hoppenworth, and Sharma V. Thankachan. Simple reductions fromformula-sat to pattern matching on labeled graphs and subtree isomorphism, 2020. arXiv:2008.11786 . Russell Impagliazzo and Ramamohan Paturi. On the complexity of k-sat.
J. Comput. Syst.Sci. , 62(2):367–375, March 2001. doi:10.1006/jcss.2000.1727 . Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L Salzberg. Ultrafast and memory-efficient alignment of short dna sequences to the human genome.
Genome biology , 10(3):R25,2009. doi:10.1186/gb-2009-10-3-r25 . Heng Li and Richard Durbin. Fast and accurate short read alignment with burrows–wheelertransform. bioinformatics , 25(14):1754–1760, 2009. doi:10.1093/bioinformatics/btp324 . Gonzalo Navarro and Veli Mäkinen. Compressed full-text indexes.
ACM Comput. Surv. ,39(1):2–es, April 2007. doi:10.1145/1216370.1216372 . Aaron Potechin and Jeffrey Shallit. Lengths of words accepted by nondeterministic finiteautomata.
Information Processing Letters , 162:105993, 2020. doi:10.1016/j.ipl.2020.105993 . C V I T 2 0 1 6
A Proofs of Section 3
Lemma 4.
Let A = ( Q, s, Σ , δ, F ) be a DFA. For every u, v ∈ Q , let: u ≤ v ⇐⇒ u = v ∨ ( ∀ α ∈ I u )( ∀ β ∈ I v )( α ≺ β ) . Then, ≤ is a co-lexicographic order. Moreover, for every co-lexicographic order ≤ ′ on A andfor every u, v ∈ Q , if u ≤ ′ v , then u ≤ v . We say that ≤ is the maximal co-lexicographicorder on A . Proof.
It is straightforward to check that ≤ is a partial order. Let us prove Axiom 1. If forsome u, v ∈ Q it holds λ ( u ) < λ ( v ), then every string in I u ends with λ ( u ) and every stringsin I v ends with λ ( v ), hence we conclude u < v (in particular, this works for u = s also). Letus prove Axiom 2. Consider two edges ( u ′ , u ) , ( v ′ , v ) ∈ E such that λ ( u ) = λ ( v ) and u < v .We want to prove that u ′ < v ′ . Fix α ′ ∈ I u and β ′ ∈ I v ; we must prove that α ′ ≺ β ′ . Let c = λ ( u ) = λ ( v ). We have α ′ c ∈ I u and β ′ c ∈ I v , so from u < v it follows α ′ c ≺ β ′ c and so α ′ ≺ β ′ .Finally, let us prove that ≤ is the maximal co-lexicographic order. Let ≤ ′ be a co-lexicographic order on A , and assume that u < ′ v ; we must prove that u < v . Fix α ∈ I u and β ∈ I v ; we must prove that α ≺ β . Since I u ∩ I v = ∅ (being A a DFA), the conclusionfollows from lemma 2. ◀ In the remaining part of this section we give a proof of Lemma 5. We start with somegeneral consideration on partitions on the domain of a binary relation. ▶ Definition 27.
Let R be a binary relation on the set V , and let { V i } pi =1 be a partition ofthe set V . We say that { V i } pi =1 is a R -comparable partition of V if for every i ∈ { , . . . , p } and for every x , x ∈ V i , with x ̸ = x it holds ( x , x ) ∈ R ∨ ( x , x ) ∈ R . ▶ Remark 28.
If ( V, ≤ ) is a partial order, then a partition of V is a ≤ -comparable partitionof V if and only it is a ≤ -chain partition of V . ▶ Lemma 29.
Let R be a binary relation on V , and let U be a family of nonempty subsetsof V . Let S be a relation on U such that: ( ∃ x ∈ U )( ∃ y ∈ U )( { x, y } ̸⊆ U ∩ U ∧ ( x, y ) ∈ R ) = ⇒ ( U , U ) ∈ S. If V admits a R -comparable partition of cardinality p , then U admits a S -comparable partitionof cardinality at most p − . Proof.
Let { V i } pi =1 be a R -comparable partition of V of cardinality p . For every nonemptyset K ⊆ { , . . . , p } , define: U K = { U ∈ U | ( ∀ i ∈ { , . . . , k } )( U ∩ V i ̸ = ∅ ⇐⇒ i ∈ K ) } . Notice that every U ∈ U belongs to exactly one U K , so {U K | ∅ ⫋ K ⊆ { , . . . , p } , U K ̸ = ∅} is a partition of U having cardinality at most 2 p −
1. As a consequence, we are just left withproving that each nonempty U K consists of S -comparable elements. Fix U , U ∈ U K , with U ̸ = U . We must prove that U and U are S -comparable. Since U ̸ = U , there existseither x ∈ U \ U or y ∈ U \ U . Assume that there exists x ∈ U \ U (the other case isanalogous). In particular, let i ∈ { , . . . , p } be the unique integer such that x ∈ V i . Since U , U ∈ U K , from the definition of U K it follows that there exists y ∈ U ∩ V i . Notice that { x, y } ̸⊆ U ∩ U (so in particular x ̸ = y ), and since x, y ∈ V i we conclude that x and y are R -comparable. If ( x, y ) ∈ R , we conclude ( U , U ) ∈ S , and if ( y, x ) ∈ R , we conclude( U , U ) ∈ S , so U and U are S -comparable. ◀ . Cotumaccio, G. D’Agostino, A. Policriti, and N. Prezza 23:15 ▶ Lemma 30.
Let ( V, ≤ ) be a finite partial order, with n = | V | , and let U be a family ofnonempty subsets of V . Let S be an antisymmetric relation on U such that: ( ∃ x ∈ U )( ∃ y ∈ U )( { x, y } ̸⊆ U ∩ U ∧ x < y ) = ⇒ ( U , U ) ∈ S. If ( V, ≤ ) admits a ≤ -chain partition of cardinality p , then |U| ≤ p ( n − p + 1) − . Proof.
From lemma 29, we know that: |U| = X ∅ ⫋ K ⊆{ ,...,p } |U K | . (1)Moreover, from lemma 29 we also know for every U , U ∈ U K , with U ̸ = U , we have thatat least one between ( U , U ) ∈ S and ( U , U ) ∈ S holds true. Since S is antisymmetric, weconclude that for every U , U ∈ U K , with U ̸ = U , exactly one between ( U , U ) ∈ S and( U , U ) ∈ S holds true.Fix K . For every U ∈ U K and for every i ∈ K , let m iU be the smallest element of U ∩ V i (this makes sense because ( V i , ≤ ) is totally ordered) and let | m iU | be the position of m iU inthe total order ( V i , ≤ ) (in particular, this means that | m iU | ∈ { , . . . , | V i |} ). Similarly, let M iU be the largest element of U ∩ V i and let | M iU | be the position of M iU in the total order( V i , ≤ ).Fix U , U ∈ U K , with U ̸ = U . Note the following: Assume that for some i ∈ K it holds m iU < m iU ∨ M iU < M iU . Then, it must be( U , U ) ∈ S . Indeed, assume that m iU < m iU (the other case is analogous). Then, wehave m iU ∈ U , m iU ∈ U , { m iU , m iU } ̸⊆ U ∩ U and m iU < m iU , so we conclude( U , U ) ∈ S . Notice that we can equivalently state that if ( U , U ) ∈ S , then ( ∀ i ∈ K )( | m iU | ≤ | m iU | ∧ | M iU | ≤ | M iU | ). Assume that for a some i ∈ K it holds | m iU | = | m iU | ∧ | M iU | = | M iU | . Then, it mustbe U ∩ V i = U ∩ V i . Indeed, suppose by contradiction that U ∩ V i ̸ = U ∩ V i . Thismeans that there exists x ∈ ( U ∩ V i ) \ U or y ∈ ( U ∩ V i ) \ U . Assume that there exists x ∈ ( U ∩ V i ) \ U (the other case is analogous). Notice that m iU = m iU ∧ M iU = M iU ,so it must be m iU = m iU < x < M iU = M iU . Now, we have x ∈ U , m iU ∈ U , { x, m iU } ̸⊆ U ∩ U and m iU < x , so ( U , U ) ∈ S . On the other hand, we have x ∈ U , M iU ∈ U , { x, M iU } ̸⊆ U ∩ U and x < m iU , so ( U , U ) ∈ S , a contradiction. Assume that ( ∀ i ∈ K )( | m iU | = | m iU | ∧ | M iU = M iU | ). Then, it must be U = U .Indeed, from point 2 we obtain ( ∀ i ∈ K )( U ∩ V i = U ∩ V i ), so U = ∪ i ∈ K ( U ∩ V i ) = ∪ i ∈ K ( U ∩ V i ) = U . Notice that we can equivalently state that if U ̸ = U , then( ∃ i ∈ K )( | m iU | ̸ = | m iU | ∨ | M iU ̸ = | M iU | ).Now it is easy to show that:( U , U ) ∈ S ⇐⇒ ( ∀ i ∈ K )( | m iU | ≤ | m iU | ∧ | M iU | ≤ | M iU | ) ∧∧ ( ∃ i ∈ K )( | m iU | < | m iU | ∨ | M iU | < | M iU | ) . (2)Indeed, ( ⇐ = ) follows from point 1. As for ( = ⇒ ), notice that ( ∀ i ∈ K )( | m iU | ≤ | m iU | ∧| M iU | ≤ | M iU | ) again follows from point 1, whereas ( ∃ i ∈ K )( | m iU | < m iU | ∨ | M iU | < M iU | )follows from point 3.For every U ∈ U K , define: T ( U ) = X i ∈ K ( | m iU | + | M iU | ) . C V I T 2 0 1 6
Pick any U , U ∈ U K , with U ̸ = U . By equation 2, if ( U , U ) ∈ S , then T ( U ) < T ( U ),and if ( U , U ) ∈ S , then T ( U ) < T ( U ). In particular, we always have T ( U ) ̸ = T ( U ).Since for every U ∈ U K we have 2 | K | < T ( U ) < P i ∈ K | V i | , then: |U K | ≤ X i ∈ K | V i | − | K | + 1 . (3)From equations 1 and 3, we obtain: |U| ≤ X ∅ ⫋ K ⊆{ ,...,p } (2 X i ∈ K | V i | − | K | + 1) == 2 X ∅ ⫋ K ⊆{ ,...,p } X i ∈ K | V i | − X ∅ ⫋ K ⊆{ ,...,p } | K | + X ∅ ⫋ K ⊆{ ,...,p } . Notice that P ∅ ⫋ K ⊆{ ,...,p } P i ∈ K | V i | = 2 p − P i ∈{ ,...,p } | V i | = 2 p − n because every i ∈{ , . . . , p } occurs in exactly 2 p − subsets of { , . . . , p } . Similarly, we obtain P ∅ ⫋ K ⊆{ ,...,p } | K | =2 p − p and P ∅ ⫋ K ⊆{ ,...,p } p −
1, We conclude: |U| ≤ p n − p p + 2 p − p ( n − p + 1) − . ◀ Let A = ( Q, s, Σ , δ, F ) be an NFA. Recall that the powerset automaton A ∗ = ( Q ∗ , E ∗ , Σ , s ∗ , F ∗ )of A is the DFA defined as follows: (i) Q ∗ = { I α | α ∈ P ref ( L ( A )) } , (ii) E ∗ = { ( I α , I αe , e ) | α ∈ Σ ∗ , e ∈ Σ , αe ∈ P ref ( L ( A )) } , (iii) s ∗ = { s } , and (iv) F ∗ = { I α | α ∈ L ( A ) } . In particular, L ( A ∗ ) = L ( A ). Lemma 5.
Let A = ( Q, s, Σ , δ, F ) be an NFA, with | Q | = n and w idth ( A ) = p . Let A ∗ = ( Q ∗ , E ∗ , Σ , s ∗ , F ∗ ) be the powerset automaton of A . Then, | Q ∗ | ≤ p ( n − p + 1) − . Proof.
Let ≤ be a co-lexicographic order on A such that there exists a ≤ -chain partitionof cardinality p . Let ≤ ∗ be the maximal co-lexicographic order on A ∗ , that is, for every I α , I β ∈ Q ∗ : I α < ∗ I β ⇐⇒ ( ∀ α ′ , β ′ ∈ P ref ( L ( A )))(( I α ′ = I α ) ∧ ( I β ′ = I β ) → α ′ ≺ β ′ )We check that the condition of Lemma 30 are satisfied with V = Q , U = { I α | α ∈ P ref ( L ( A )) } , and ≤ , S = ≤ ∗ as above. Fix I α , I β ∈ Q ∗ , and assume that there exist u ∈ I α and v ∈ I β such that { u, v } ̸⊆ I α ∩ I β and u < v . We want to prove that I α < ∗ I β . Fix any α ′ , β ′ ∈ P ref ( L ( A )) such that I α ′ = I α and I β ′ = I β . From Lemma 2 we conclude α ′ ≺ β ′ ,as desired. This means that we can apply Lemma 30, so the conclusion follows. ◀ B Proofs of Section 4
Lemma 11.
Let A be an NFA and let A ∗ be the powerset automaton of A . Then,w idth ( A ∗ ) ≤ w idth ( A ) − . Proof.
The proof is analogous to that of lemma 5, the only difference is that now we mustuse lemma 29 instead of lemma 30. ◀ . Cotumaccio, G. D’Agostino, A. Policriti, and N. Prezza 23:17 C Proofs of Section 5
Lemma 14
There exists a regular language L such that w idth N ( L ) = w idth D ( L ) = 2 . There exist two non-isomorphic DFAs A and A with the same number of states suchthat L ( A ) = L ( A ) = L , w idth ( A ) = w idth ( A ) = 2 and no -sortable DFA with fewerstates recognizes L . Proof.
In the proof we shall use the following remark: ▶ Remark 31.
A partition Q , . . . , Q p of the states of a DFA A = ( Q, s, Σ , δ, F ) is a chainpartition w.r.t. the maximal co-lexicographic order if and only if the following holds: for anypair q, q ′ of distinct states belonging to the same component of the partition, if λ ( q ) = λ ( q ′ )then either I q ≺ I q ′ or I q ′ ≺ I q .Consider the regular language L accepted by the (input-consistent) automaton A inFigure 4, where the initial state is labelled by A is the minimum automaton for thelanguage accepted (one can easily check that for each pair of states there is a word which isreadable from one state and not from the other). Using [2, Thm. 3.1] we easily see that L isnot Wheeler, since the infinite monotone sequence ac ≺ bc ≺ acc ≺ bcc ≺ accc ≺ . . . ≺ ac n ≺ bc n ≺ . . . flips indefinitely through two different states of the minimum automaton. It follows that thewidth (deterministic and nondeterministic) of L is greater than one. abccuv d d d dgh lm no pqe e ef f f Figure 4
The minimum automaton for L . Let us check that there is no 2-sortable automaton with just one state more than theminimum automaton recognizing the same language. By the Myhill-Nerode Theorem forregular languages we know that any automaton recognizing a language can be obtained fromthe minimum automaton by “dividing” some states. Let q , q , q , q be the nodes of theminimum automaton labelled by d , from left to right. Then I q = { d, f d } , I q = { ed, f f d } , I q = { eed, f f f d } , I q = { eeed } . C V I T 2 0 1 6
Notice that the states q , q , q are pairwise incomparable with respect to the maximalco-lexicographic order. Hence, one among q , q , q should be “divided” if we want to obtaina 2-sortable automaton. However, if we divide just one of these states, the remaining twoand the state q would still be reached by the same words as in the minimum automaton andwould be still pairwise incomparable w.r.t. the maximal co-lexicographic order of the DFA.It follows that a two sortable automaton recognizing the language should have at least 2states more than the minimum automaton.We next show that we can indeed realize 2-sortability by dividing two states of theminimum automaton (in two different ways).Consider the automata in figures 5 and 6: these automata recognize the language L ( A ),and they are 2-sortable, because the maximal co-lexicographic order among their states isso. Indeed, using Remark 31, we can easily check the 2-sortability of the automaton; inthe pictures, some states are colored either with light gray or dark gray depending on thecomponent to which they belong (the non colored states can be easily assigned to one of thetwo components in such a way that Axiom 2 holds true). The reader is invited to check that,in both automata, all relevant states are colored and that states with the same color can belinearly ordered by the maximal co-lexicographic ordering. abccuv d dd dd dgh lm no pqe e ef f f Figure 5
A 2-sortable automaton recognizing L . To finish the proof of the lemma, notice that both automata have two states more thanthe minimum automaton, but they are not isomorphic. ◀ We now state and prove some lemmas preliminary to our co-lexicographic extension ofMyhill-Nerode theorem (Thm. 24).
Lemma 18.
Let
L ⊆ Σ ∗ be a regular language, and let P := { U , . . . , U p } be a partition of P ref ( L ) . Let ∼ be an equivalence relation on P ref ( L ) that respects P ref ( L ) . For every α, β ∈ P ref ( L ) , define: α ∼ P β ⇐⇒ ( α ∼ β ) ∧ ( ∀ ϕ ∈ Σ)( αϕ ∈ P ref ( L ) → U αϕ = U βϕ ) . Then, ∼ P is the coarsest P -consistent equivalence relation on P ref ( L ) that refines ∼ . Wesay that ∼ P is the P -consistent refinement of ∼ . . Cotumaccio, G. D’Agostino, A. Policriti, and N. Prezza 23:19 abccuv dd dd d dgh lm no pqe e ef f f Figure 6
Another 2-sortable automaton recognizing L . Proof.
First, notice that ∼ P is well-defined. Indeed, βϕ ∈ P ref ( L ) follows from theassumptions on ∼ , so the expression U βϕ makes sense. From the same assumptions itstraightforwardly follows that ∼ P is an equivalence relation.It is immediate to check that every P -consistent equivalence relation on P ref ( L ) thatrefines ∼ must also refine ∼ P , so we only have to show check that ∼ P is P -consistent.Assume that α, β ∈ P ref ( L ) satisfy α ∼ P β , and assume that αϕ ∈ P ref ( L ). Then, βϕ ∈ P ref ( L ) follows from the assumptions on ∼ , and U αϕ = U βϕ follows from α ∼ P β . ◀ Lemma 19.
Let
L ⊆ Σ ∗ be a regular language, and let P = { U , . . . , U p } be a partition of V .Let ∼ be a P -consistent equivalence relation on P ref ( L ) . For every α, γ ∈ P ref ( L ) , define: α ∼ c γ ⇐⇒ ( α ∼ γ ) ∧∧ ( ∀ β, ϕ ∈ Σ ∗ )(( αϕ, βϕ ∈ P ref ( L )) ∧ ( U αϕ = U βϕ ) ∧ (min { α, γ } ≺ β ≺ max { α, γ } ) → αϕ ∼ βϕ ) . Then, ∼ c is a P -consistent and P -convex equivalence relation on P ref ( L ) being the coarsest P -convex equivalence relation on P ref ( L ( A )) that refines ∼ . We say that ∼ c is the P -convexrefinement of ∼ . Proof.
Let us prove that ∼ c is an equivalence relation. First, notice that ∼ c is triviallyreflexive, because for every α ∈ P ref ( L ) we have min( α, α ) = max( α, α ) = α , so oneimmediately obtains α ∼ c α . Now, let us prove that ∼ c is symmetric. Assume that α ∼ c γ ;we want to prove that γ ∼ c α . To this end, it suffices to observe that if γ, β ∈ Σ ∗ satisfy γϕ, βϕ ∈ P ref ( L ) and U γϕ = U βϕ , then we obtain αϕ, βϕ ∈ P ref ( L ) and U αϕ = U βϕ because ∼ is P -consistent. Let us prove that ∼ c is transitive. Assume that α ∼ c γ and γ ∼ c δ . Wewant to prove that α ∼ c δ . First, α ∼ δ follows from the transitivity of ∼ . Now, assumethat β, ϕ ∈ Σ ∗ satisfy αϕ, βϕ ∈ P ref ( L ), U αϕ = U βϕ and min { α, δ } ≺ β ≺ max { α, δ } . Wemust prove that αϕ ∼ βϕ . Since ∼ is P -consistent, we obtain αϕ, βϕ, γϕ, δϕ ∈ P ref ( L ) and U αϕ = U βϕ = U γϕ = U δϕ . Notice that either min { α, γ } ≺ β ≺ max { α, γ } or min { γ, δ } ≺ β ≺ max { γ, δ } , so the conclusion follows from either α ∼ c γ or γ ∼ c δ .First, observe that ∼ c is P -consistent because it is a refinement of ∼ , which is P -consistent. C V I T 2 0 1 6
Notice that every P -convex equivalence relation on P ref ( L ) that refines ∼ c must alsorefine ∼ c . As a consequence, we only have to prove that ∼ c is P -convex.Fix α ∈ P ref ( L ). We must prove that [ α ] ∼ c is a convex set in ( U α , ≤ ). First, noticethat [ α ] ∼ c ⊆ U α because ∼ c is P -consistent. Now, fix β, γ, δ ∈ U α such that β ≺ γ ≺ δ and β, δ ∈ [ α ] ∼ c . We must prove that γ ∈ [ α ] ∼ c . In other words, starting from U β = U γ = U δ , β ≺ γ ≺ δ and β ∼ c δ we must prove that β ∼ c γ . First, notice that β ∼ c γ follows from β ∼ c δ . Now, fix β ′ , ϕ ∈ Σ ∗ such that βϕ, β ′ ϕ ∈ Σ ∗ , U βϕ = U β ′ ϕ and β ≺ β ′ ≺ γ (and so β ≺ β ′ ≺ δ ). Then, βϕ ∼ β ′ ϕ follows once again from β ∼ c δ . ◀ From lemmas 18 and 19 we immediately obtain the following corollary. ▶ Corollary 32.
Let
L ⊆ Σ ∗ be a regular language, and let P = { U , . . . , U p } be a partitionof P ref ( L ) . Let ∼ be an equivalence relation on P ref ( L ) that respects P ref ( L ) . Let ∼ P bethe P -consistent refinement of ∼ and let ∼ c P be the P -convex refinement of ∼ . Then, ∼ c P isthe coarsest refinement of ∼ being both P -consistent and P -convex. Notice that both in the definition of P -consistent refinement (lemma 18) and in thedefinition of P -convex refinement (lemma 19) we have introduced a string ϕ ∈ Σ ∗ that"propagates" the considered properties. The intuition is that we want to ensure that right-invariance is preserved (basically, we are defining the right-invariant refinement). Moreprecisely, we have the following corollary, which is easily proved. ▶ Corollary 33.
The P -consistent refinement of a right-invariant equivalence relation isright-invariant. The P -convex refinement of a right-invariant equivalence relation is right-invariant. The following definition interprets the requirement that automata must be input-consistentfrom an equivalence relation perspective. ▶ Definition 34.
Let
L ⊆ Σ ∗ be a regular language, and let ∼ be an equivalence relationon P ref ( L ) . We say that ∼ is input-consistent if for every α, β ∈ P ref ( L ) , if α ∼ β , then end ( α ) = end ( β ) . Note that [ ϵ ] ∼ = { ϵ } . It is immediate to derive the following lemma. ▶ Lemma 35.
Let Let
L ⊆ Σ ∗ be a regular language, and let ∼ be an equivalence relationon P ref ( L ) . For every α, β ∈ P ref ( L ) , define: α ∼ ∗ β ⇐⇒ ( α ∼ β ) ∧ ( end ( α ) = end ( β )) . Then, ∼ ∗ is the coarsest input-consistent equivalence relation on P ref ( L ) refining ∼ . Wesay that ∼ ∗ is the input-consistent refinement of ∼ . ▶ Remark 36.
Assume that ∼ has some of the following properties: finite index, right-invariance, P -convexity (for some partition P of P ref ( L )). Then, it is easy to check thatthese properties are inherited by ∼ ∗ .Let A = ( Q, s, Σ , δ, F ) be an NFA, let ≤ be a co-lexicographic order on A and let { Q i } pi =1 be a ≤ -chain partition of Q . For every α ∈ P ref ( L ( A )) and for every i ∈ { , . . . , p } , wedefine: I iα = I α ∩ Q i . In particular, we can restate definition 21 as follows:
P ref ( L ( A )) i = { α ∈ P ref ( L ( A )) | I iα ̸ = ∅} . . Cotumaccio, G. D’Agostino, A. Policriti, and N. Prezza 23:21 Note that for every α ∈ P ref ( L ( A )) we have that { I iα | i ∈ { , . . . , p } , I iα ̸ = ∅} is a partitionof I α . ▶ Remark 37.
Let A = ( Q, s, Σ , δ, F ) be a P -sortable NFA, where P = { U , . . . , U p } is apartition of P ref ( L ( A )). Let ≤ be a co-lexicographic order on A and let { Q i } pi =1 be a ≤ -chain partition witnessing that A is P -sortable. Then, for every α ∈ P ref ( L ( A )) thereexists exactly one i ∈ { , . . . , p } such that I iα ̸ = ∅ (otherwise P would not be a a partition). ▶ Lemma 38.
Let A = ( Q, s, Σ , δ, F ) be an NFA, let ≤ be a co-lexicographic order on A ,and let { Q i } pi =1 be a ≤ -chain partition of Q . Fix i ∈ { , . . . , p } and fix u ∈ Q i . Then, I u isa convex set in ( P ref ( L ( A )) i , ⪯ ) . Proof.
Fix α, β, γ ∈ P ref ( L ( A )) i such that α ≺ β ≺ γ and α, γ ∈ I u . We must prove that β ∈ I u . Assume by contradiction that β ̸∈ I u . Since β . Since β ∈ P ref ( L ( A )) i , then thereexists v ∈ Q i such that β ∈ I v . Notice that we have α ∈ I u , β ∈ I v and { α, β } ̸⊆ I u ∩ I v and α ≺ β , so by lemma 2 we conclude u < v or u ∥ v . Since u, v ∈ Q i and Q i is a ≤ -chain, weconclude u < v . By using β ≺ γ , one analogously obtains v < u , a contradiction. ◀▶ Lemma 39.
Let A = ( Q, s, Σ , δ, F ) be an NFA, let ≤ be a co-lexicographic order on A ,and let { Q i } pi =1 be a ≤ -chain partition of Q . Fix i ∈ { , . . . , p } and fix α ∈ P ref ( L ( A )) .Then, the set { γ ∈ P ref ( L ( A )) i | I iγ = I iα } is a convex set in ( P ref ( L ( A )) i , ⪯ ) . Proof.
We have to prove that if α, β, γ ∈ P ref ( L ( A )) i satisfy α ≺ β ≺ γ and I iα = I iγ , then I iα = I iβ .First, let us show that I iα ⊆ I iβ . Pick u ∈ I iα . From I iα = I iγ it follows α, γ ∈ I u , so since α ≺ β ≺ γ and β ∈ P ref ( L ( A )) i we conclude β ∈ I u (or equivalently, u ∈ I iβ ) by lemma 38.Now suppose by contradiction that I iβ ̸⊆ I iα . Let v ∈ I iβ \ I iα . Since α ∈ P ref ( L ( A )) i , wecan pick u ∈ I iα . Notice that u ∈ I α , v ∈ I β , { u, v } ̸⊆ I α ∩ I β and α ≺ β , so by 2 we conclude( u < v ) ∨ ( u ∥ v ). Since u, v ∈ Q i and Q i is a ≤ -chain, we conclude u < v . By using β ≺ γ ,one analogously obtains v < u , a contradiction. ◀▶ Definition 40.
Let A = ( Q, s, Σ , δ, F ) be an NFA. Define the equivalence relation ∼ A on P ref ( L ( A )) as follows. For every α, β ∈ P ref ( L ( A )) , define: α ∼ A β ⇐⇒ I α = I β . ▶ Lemma 41.
Let A = ( Q, s, Σ , δ, F ) be a P -sortable NFA, where P = { U , . . . , U p } is apartition of P ref ( L ( A )) . Then, ∼ A respects P ref ( L ) , and it is P -consistent and P -convex.Moreover, ∼ A is right-invariant, input consistent and its index is finite. Proof.
Assume that α ∼ A β . This means that if we read α and β on A , then we reachthe same set of states. As a consequence, if ϕ ∈ Σ ∗ satisfies αϕ ∈ P ref ( L ( A )), then itmust be βϕ ∈ P ref ( L ( A ) and I αϕ = I βϕ , so proving that ∼ A respects P ref ( L ( A )) and itis P -consistent. Finally ∼ A is P -convex because for every α ∈ P ref ( L ( A )) we have that { γ ∈ P ref ( L ( A )) | I γ = I α } is a convex set in ( U α , ⪯ ) by Lemma 39 (see Remark 37). Theremaining properties are straightforward to prove. ◀ Considering now a P -sortable NFA A , for a given partition P of P ref ( L ( A )), we comparethe equivalence ∼ A with the P -consistent, P -convex refinement ≡ c L ( A ) , P of the Myhill-Nerodeequivalence ≡ L ( A ) . ▶ Lemma 42.
Let A = ( Q, s, Σ , δ, F ) be a P -sortable NFA, where P = { U , . . . , U p } is apartition of P ref ( L ( A )) . Then, ∼ A is a refinement of ≡ c L ( A ) , P . C V I T 2 0 1 6
Proof.
It is immediate to realize that ∼ A is a refinement of ≡ L . Moreover ∼ A is P -consistentand P -convex by lemma 41. On the other hand, by corollary 32 we have that ≡ c L ( A ) , P isthe coarsest refinement of ≡ L ( A ) being both P -consistent and P -convex, so the conclusionfollows. ◀▶ Lemma 43.
Let L ⊆ Σ ∗ be a language, and let P = { U , . . . , U p } be a partition of P ref ( L ) .Assume that L is the union of some classes of a P -convex, input-consistent, right invariantequivalence relation ∼ on P ref ( L ) of finite index. Then, L is recognized by a P -sortableDFA A ∼ = ( Q ∼ , s ∼ , Σ , E ∼ , F ∼ ) such that: | Q ∼ | is equal to the index of ∼ ; ∼ A ∼ and ∼ are the same equivalence relation (in particular, | Q ∼ | is equal to the index of ∼ A ∼ ).Moreover, if B is a P -sortable DFA that recognizes L , then A ∼ B is isomorphic to B . Proof.
Define the DFA A ∼ = ( Q ∼ , E ∼ , Σ , s ∼ , F ∼ ) as follows. Q ∼ = { [ α ] ∼ | α ∈ P ref ( L ) } ; s ∼ = [ ϵ ] ∼ , where ϵ is the empty string; E ∼ = { ([ α ] ∼ , [ αa ] ∼ , a ) | α ∈ Σ ∗ , a ∈ Σ , αa ∈ P ref ( L ) } ; F ∼ = { [ α ] | α ∈ L} .It is easy to show (see e.g. [2, Thm 2.16]) that A ∼ is well-defined A that satisfies theproperties assumed throughout this paper (input-consistency...) and such that L ( A ∼ ) = L .In particular, for all α, β ∈ P ref ( L ), it holds: α ∈ [ β ] ∼ ⇐⇒ δ ∼ ( s ∼ , α ) = [ β ] ∼ . (4)Note that so far we do not need P -convexity yet.Let ≤ be the maximal co-lexicographic order on A (see Lemma 4). Notice that fromequation 4 it follows that for every α, β ∈ P ref ( L ):[ α ] ∼ < [ β ] ∼ ⇐⇒ ( ∀ α ′ ∈ [ α ] ∼ )( ∀ β ′ ∈ [ β ] ∼ )( α ′ ≺ β ′ )For every i ∈ { , . . . , p } , define: Q i = { [ α ] ∼ | U α = U i } . Since ∼ in P -convex, then Q i is well-defined and it is a ≤ -chain, so { Q i } pi =1 is a ≤ -chainpartition of Q ∼ .Finally from equation 4 we obtain: P ref ( L ( A ∼ )) i = { α ∈ P ref ( L ( A ∼ )) | δ ∼ ( s ∼ , α ) ∈ Q i } == { α ∈ P ref ( L ( A ∼ )) | ( ∃ [ β ] ∼ ∈ Q i | α ∈ [ β ] ∼ ) } == { α ∈ P ref ( L ( A ∼ )) | U α = U i } = U i . In other words, A ∼ witnesses that L is recognized by a P -sortable DFA. Moreover: One immediately notices that the number of states of A ∼ is equal to the index of ∼ . By equation 4: α ∼ A ∼ β ⇐⇒ δ ∼ ( s ∼ , α ) = δ ∼ ( s ∼ , β ) ⇐⇒ [ α ] ∼ = [ β ] ∼ ⇐⇒ α ∼ β so ∼ A ∼ and ∼ are the same equivalence relation. . Cotumaccio, G. D’Agostino, A. Policriti, and N. Prezza 23:23 Finally, suppose B is a P -sortable DFA that recognizes L . Notice that by lemma 41 we havethat ∼ B is a P -convex, input consistent, right invariant equivalence relation on P ref ( L ) offinite index such that L is the union of some ∼ B -classes, so A ∼ B is well defined. Call Q B theset of states of B , and let ϕ : Q ∼ B → Q B be the function sending [ α ] ∼ B into the state in Q B reached by reading α . Notice that ϕ is well-defined because by the definition of ∼ B we obtainthat all strings in [ α ] ∼ B reach the same state of B . It is easy to check that ϕ determines anisomorphism between A ∼ B and B . ◀ Theorem 24 (Co-lexicographic Myhill-Nerode theorem).
Let L be a language. Let P be a partition of P ref ( L ) . The following are equivalent: L is recognized by a P -sortable NFA. ≡ c L , P has finite index. L is the union of some classes of a P -convex, right invariant equivalence relation on P ref ( L ) of finite index. L is recognized by a P -sortable DFA. Proof. (1) → (2) Let A be a P -sortable NFA. Since ∼ A has finite index (Lemma 41), byLemma 42 we conclude that ≡ c L , P has finite index.(2) → (3) The desired equivalence relation is ≡ c L , P . Indeed, by definition ≡ c L , P is P -convex. Since ≡ L is right-invariant, then ≡ c L , P is right-invariant by Corollary 33. Similarlyone obtains that L is the union of some ≡ c L , P -classes.(3) → (4) Let ∼ be a P -convex, right invariant equivalence relation on P ref ( L ) of finiteindex such that L is the union of some ∼ -classes. By Remark 36 we can also assume that ∼ in input-consistent. The conclusion follows from Lemma 43.(4) → (1) Trivial. ◀ Corollary 25.
Let L be a language. Let P be a partition of P ref ( L ) . If L is recognized bysome P -sortable DFA, then there exists a P -sortable DFA A such that all P -sortable DFAsrecognizing L and non-isomorphic to A have a larger number of states. In other words, A isthe minimum P -sortable DFA recognizing L . Proof.
The minimum automaton is A ∼ ∗ , where ∼ ∗ in the input-consistent refinement of ≡ c L , P and where A ∼ ∗ is built as in Lemma 43. Indeed, the number of states of A ∼ ∗ isequal to the index of ∼ ∗ , or equivalently, of ∼ A ∼∗ . On the other hand, if B is any distinct P -sortable DFA recognizing L , then ∼ B is P -convex, right invariant and input-consistent byLemma 41, so it must be a (strict) refinement of ∼ ∗ and we conclude that B has more statesthen A ∼ ∗ . ◀◀