Asymptotic Approximation by Regular Languages
AAsymptotic Approximation by Regular Languages
Ryoma Sin’ya
Akita University, Akita, JapanRIKEN AIP, [email protected]
Abstract
This paper investigates a new property of formal languages called REG-measurability where REGis the class of regular languages. Intuitively, a language L is REG-measurable if there exists aninfinite sequence of regular languages that “converges” to L . A language without REG-measurabilityhas a complex shape in some sense so that it can not be (asymptotically) approximated by regularlanguages. We show that several context-free languages are REG-measurable (including languageswith transcendental generating function and transcendental density, in particular), while a certainsimple deterministic context-free language and the set of primitive words are REG-immeasurable ina strong sense. Theory of computation → Formal languages and automata theory
Keywords and phrases
Automata, context-free languages, density, primitive words
Digital Object Identifier
Funding
Ryoma Sin’ya : JSPS KAKENHI Grant Number JP19K14582
Approximating a complex object by more simple objects is a major concept in both computerscience and mathematics. In the theory of formal languages, various types of approximationshave been investigated ( e.g. , [13, 14, 9, 6, 5, 7]). For example, Kappes and Kintala [13] intro-duced convergent-reliability and slender-reliability which measure how a given deterministicautomaton A nicely approximates a given language L over an alphabet A . Formally A is saidto accept L convergent-reliability if the ratio of the number of in correctly accepted/rejectedwords of length n L ( A ) L ) ∩ A n ) / A n )tends to 0 if n tends to infinity, and is said to accept L slender-reliability if the number ofincorrectly accepted/rejected words of length n is always bounded above by some constant c : i.e. , L ( A ) L ) ∩ A n ) ≤ c for any n . Here L ( A ) denotes the language accepted by A , S ) denotes the cardinality of the set S , L denotes the complement of L and denotes thesymmetric difference. A slightly modified version of approximation is bounded- (cid:15) -approximation which was introduced by Eisman and Ravikumar. They say that two languages L and L provide a bounded- (cid:15) -approximation of language L if L ⊆ L ⊆ L holds and the ratio oftheir length- n difference satisfies L \ L ) ∩ A n ) / A n ) ≤ (cid:15) for every sufficiently large n ∈ N . Perhaps surprisingly, they showed that no pair ofregular languages can provide a bounded- (cid:15) -approximation of the language { w ∈ { a, b } ∗ | w has more a ’s than b ’s } for any 0 ≤ (cid:15) < in approximable(by regular languages) example of certain non-regular languages. Also, there is a differentframework of approximation so-called minimal-cover [5, 7]. © Ryoma Sin’ya;licensed under Creative Commons License CC-BY42nd Conference on Very Important Topics (CVIT 2016).Editors: John Q. Open and Joan R. Access; Article No. 23; pp. 23:1–23:15Leibniz International Proceedings in InformaticsSchloss Dagstuhl – Leibniz-Zentrum für Informatik, Dagstuhl Publishing, Germany a r X i v : . [ c s . F L ] A ug A model of approximation introduced in this paper is rather close to the work of Eismanand Ravikumar [9]. Instead of approximating by a single regular language, we consider anapproximation of some non-regular language L by an infinite sequence of regular languagesthat “converges” to L . Intuitively, we say that L is REG -measurable if there exists an infinitesequence of pairs of regular languages ( K n , M n ) n ∈ N such that K n ⊆ L ⊆ M n holds for all n and the “size” of the difference M n \ K n tends to 0 if n tends to infinity. The formaldefinition of “size” is formally described in the next section: we use a notion called density(of languages) for measuring the “size” of a language.Although we used the term “approximation” in the title and there are various researchon this topic in formal language theory, our work is rather influenced by the work of Buck [4]which investigates, as the title said, the measure theoretic approach to density . In [4] theconcept of measure density µ of subsets of natural numbers N was introduced. Roughlyspeaking, Buck considered an arithmetic progression X = { cn + d | n ∈ N } (where c, d ∈ N , c can be zero) as a “basic set” whose natural density as δ ( X ) = 1 /c if c = 0 and δ ( X ) = 0otherwise, then defined the outer measure density µ ∗ ( S ) of any subset S ⊆ N as µ ∗ ( S ) = inf nX i δ ( X i ) | S ⊆ X and X is a finite union ofdisjoint arithmetic progressions X , . . . , X k o . Then the measure density µ ( S ) = µ ∗ ( S ) was introduced for the sets satisfying the condition µ ∗ ( S ) + µ ∗ ( S ) = 1 (1)where S = N \ S . Technically speaking, the class D µ of all subsets of natural numberssatisfying Condition (1) is the Carathéodory extension of the class D def == { X ⊆ N | X is a finite union of arithmetic progressions } , see Section 2 of [4] for more details. Notice that here we regard a singleton { d } as anarithmetic progression (the case c = 0 for { cn + d | n ∈ N } ), any finite set belongs to D .Buck investigated several properties of µ and D µ , and showed that D µ properly contains D .In the setting of formal languages, it is very natural to consider the class REG of regularlanguages as “basic sets” since it has various types of representation, good closure propertiesand rich decidable properties. Moreover, if we consider regular languages REG A over a unaryalphabet A = { a } , then REG A is isomorphic to the class D ; it is well known that the Parikhimage {| w | | w ∈ L } ⊆ N (where | w | denotes the length of w ) of every regular language L inREG A is semilinear and hence it is just a finite union of arithmetic progressions. From thisobservation, investigating the densities of regular languages and its measure densities ( i.e. ,REG-measurability) for non-regular languages can be naturally considered as an adaptationof Buck’s study [4] for formal language theory. Our contribution
In this paper we investigate REG-measurability ( ’ asymptotic approximability by regu-lar languages) of non-regular, mainly context-free languages. The main results consist ofthree kinds. We show that: (1) several context-free languages (including languages with transcendental generating function and transcendental density ) are REG-measurable [The-orem 23–30]. (2) there are “very large/very small” (deterministic) context-free languagesthat are REG-immeasurable in a strong sense [Theorem 36]. (3) the set of primitive words . Sin’ya 23:3 is “very large” and REG-immeasurable in a strong sense [Theorem 37–38]. Open problemsand some possibility of an application of the notion of measurability to classifying formallanguages will be stated in Section 6.The paper is organised as follows. Section 2 provides mathematical background ofdensities of formal languages. The formal definition of REG-approximability and REG-measurability are introduced in Section 3. The scenario of Section 3 mostly follows oneof the measure density introduced by Buck [4] which was described above. In Section 4,we will give several examples of REG-inapproximable but REG-measurable context-freelanguages. These examples include, perhaps somewhat surprisingly, a language with a transcendental density which have been considered as a very complex context-free languagefrom a combinatorial viewpoint. In Section 5, we consider the set of so-called primitivewords and its REG-measurability. Section 6 ends this paper with concluding remarks, somefuture work and open problems. We assume that the reader has a basic knowledge of formallanguage theory. For a set S , we write S ) for the cardinality of S . The set of natural numbers including0 is denoted by N . For an alphabet A , we denote the set of all words (resp. all non-emptywords) over A by A ∗ (resp. A + ). We write ε for the empty word and write A n (resp. A Let L ⊆ A ∗ be a language. The natural density δ A ( L ) of L is defined as δ A ( L ) def == lim n →∞ L ∩ A n ) A n )if the limit exists, otherwise we write δ A ( L ) = ⊥ and say that L does not have a naturaldensity. The density δ ∗ A ( L ) of L is defined as δ ∗ A ( L ) def == lim n →∞ n n − X k =0 (cid:0) L ∩ A k (cid:1) A k )if its exists, otherwise we write δ ∗ A ( L ) = ⊥ and say that L does not have a density. Alanguage L ⊆ A ∗ is called null if δ ∗ A ( L ) = 0, and conversely L is called co-null if δ ∗ A ( L ) = 1. (cid:73) Remark 2. Notice that if L has a natural density ( i.e. , δ A ( L ) = ⊥ ), then it also has adensity and δ ∗ A ( L ) = δ A ( L ) holds. But the converse is not true in general, e.g. , the case L = ( AA ) ∗ (see Example 4 below).The following observation is basic. C V I T 2 0 1 6 (cid:66) Claim 3. Let K, L ⊆ A ∗ with δ ∗ A ( K ) = α, δ ∗ A ( L ) = β . Then we have: α ≤ β if K ⊆ L . δ ∗ A ( L \ K ) = β − α if K ⊆ L . δ ∗ A ( K ) = 1 − α . δ ∗ A ( K ∪ L ) ≤ α + β if δ ∗ A ( K ∪ L ) = ⊥ . δ ∗ A ( K ∪ L ) = α + β if K ∩ L = ∅ .For more properties of δ ∗ A , see Chapter 13 of [3]. (cid:73) Example 4. Here we enumerate a few examples of densities of languages.The set of all words A ∗ clearly satisfies δ A ( A ∗ ) = 1, and its complement ∅ satisfies δ A ( ∅ ) = 0. It is also clear that every finite language is null.For the set { a } A ∗ of all words starting with a ∈ A , we have { a } A ∗ ∩ A n ) / A n ) = (cid:0) aA n − (cid:1) / A n ) = 1 / A ) . Hence δ A ( { a } A ∗ ) = 1 / A ).Consider ( AA ) ∗ the set of all words with even length. Because AA ) ∗ ∩ A n ) A n ) = ( n is even,0 if n is odd.holds, its limit does not exist and thus ( AA ) ∗ does not have a natural density δ A (( AA ) ∗ ) = ⊥ . However, it has a density δ ∗ A (( AA ) ∗ ) = 1 / D def == { w ∈ { a, b } ∗ | | w | a = | w | b and | u | a ≥ | u | b for every prefix u of w } is non-regular but context-free. It is well known that the number of words in D of length2 n is equal to the n -th Catalan number whose asymptotic approximation is Θ(4 n /n / ).Thus D ∩ A n ) A n ) = ( Θ(1 / ( n/ / ) if n is even,0 if n is odd.and we have δ A ( D ) = 0, i.e. , D is null.Example 4 shows us that, for some regular language L , its natural density is either zero orone, for some, like L = { a } A ∗ (for A ) ≥ δ A ( L ) could be a real number strictly betweenzero and one, and for some, like L = ( AA ) ∗ , a natural density may not even exist. However,the following theorem tells us that all regular languages do have densities. (cid:73) Theorem 5 ( cf. Theorem III.6.1 of [19]) . Let L ⊆ A ∗ be a regular language. Then there isa positive integer c such that for all natural numbers d < c , the following limit exists lim n →∞ (cid:0) L ∩ A cn + d (cid:1) A cn + d ) and it is always rational, i.e., the sequence ( L ∩ A n ) / A n )) n ∈ N has only finitely manyaccumulation points and these are rational and periodic. (cid:73) Corollary 6. Every regular language has a density and it is rational. (cid:73) Corollary 7. For any regular language L ⊆ A ∗ , δ A ( L ) = 0 if and only if δ ∗ A ( L ) = 0 . Furthermore, for unambiguous context-free languages, the following holds. (cid:73) Theorem 8 (Berstel [2]) . For any unambiguous context-free language L over A , its density δ ∗ A ( L ) , if it exists (i.e., δ ∗ A ( L ) = ⊥ ), is always algebraic. . Sin’ya 23:5 In the next section we will introduce a language with a transcendental density, which shouldbe inherently ambiguous due to Theorem 8.We conclude the section by introducing the notion called dense : a property about sometopological “largeness” of a language ( cf. Chapter 2.5 of [3]). (cid:73) Definition 9. A language L ⊆ A ∗ is said to be dense if the set of all factors of L is equalto A ∗ . We say that a word w ∈ A ∗ is a forbidden word (resp. forbidden prefix ) of L if L ∩ A ∗ wA ∗ = ∅ (resp. L ∩ wA ∗ = ∅ ).Observe that L ⊆ A ∗ is dense if and only if no word is a forbidden word of L . The nexttheorem ties two different notions of “largeness” of languages in the regular case. (cid:73) Theorem 10 (S. [20]) . A regular language is non-null if and only if it is dense. The “only if”-part of Theorem 10 is nothing but the well-known so-called infinite monkeytheorem (which states that L is not dense implies L is null), and this part is true for any(non-regular) languages. But we stress that “if”-part is not true beyond regular languages; forexample the semi-Dyck language D is null but dense (which will be described in Proposition 12).We denote by REG + the family of non-null regular languages, which is equivalent to thefamily of regular languages with positive densities thanks to Corollary 6. Although we will mainly consider REG-measurability of non-regular languages in this paper,here we define two notions approximability and measurability in general setting, with fewconcrete examples. (cid:73) Definition 11. Let C , D be class of languages. A language L is said to be ( C , (cid:15) ) -lower-approximable if there exists K ∈ C such that K ⊆ L and δ ∗ Alph ( L ) ( L \ K ) ≤ (cid:15) . A language L is said to be ( C , (cid:15) ) -upper-approximable if there exists M ∈ C such that L ⊆ M and δ ∗ Alph ( M ) ( M \ L ) ≤ (cid:15) . A language L is said to be C -approximable ( C -approx. for short) if L isboth ( C , C , D is said to be C -approx. if every languagein D is C -approx.The following proposition gives a simple REG-inaproximable example. (cid:73) Proposition 12. The semi-Dyck language D is REG -inapprox. Proof. We already mentioned that D is null in Example 4, and thus D is (REG , ∅ ⊆ D . One can easily observe that D has no forbidden word: since for any w ∈ A ∗ there exists a pair of natural numbers ( n, m ) ∈ N such that a n wb m ∈ D . Hence if aregular language L satisfies D ⊆ L , L has no forbidden word, too, and thus L is non-null byTheorem 10. Thus by Claim 3, δ ∗ A ( L \ D ) = δ ∗ A ( L ) − δ ∗ A ( D ) = δ ∗ A ( L ) > 0, which means that D can not (REG , (cid:74) The proof of Proposition 12 only depends on the non-existence of forbidden words, hence wecan apply the same proof to the next theorem. (cid:73) Theorem 13. Any null language having no forbidden word is (REG , -upper-inapprox. Because D is deterministic context-free, in our term we have: (cid:73) Corollary 14. DetCFL is REG -inapprox. C V I T 2 0 1 6 Furthermore, by the combination of Theorem 8 and the next theorem, we will know thatthere exists a context-free language which can not be approximated by any unambiguouscontext-free language. (cid:73) Theorem 15 (Kemp [15]) . Let A = { a, b, c } . Define S def == { a }{ b i a i | i ≥ } ∗ S def == { a i b i | i ≥ } ∗ { a } + , and L def == S { c } A ∗ L def == S { c } A ∗ . Then K def == L ∪ L is a context-free language with a transcendental natural density δ A ( K ) . (cid:73) Corollary 16. CFL is UnCFL -inapprox. We then introduce the notion of C -measurability which is a formal language theoreticanalogue of Buck’s measure density [4]. (cid:73) Definition 17. Let C , D be classes of languages. For a language L , we define its C -lower-density as µ C ( L ) def == sup { δ ∗ A ( K ) | A = Alph ( L ) , K ⊆ L, K ∈ C A , δ ∗ A ( K ) = ⊥} and its C -upper-density as µ C ( L ) def == inf { δ ∗ A ( K ) | A = Alph ( L ) , L ⊆ K, K ∈ C A , δ ∗ A ( K ) = ⊥} . A language L is said to be C -measurable if µ C ( L ) = µ C ( L ) holds, and we simply write µ C ( L )as µ C ( L ). D is said to be C -measurable if every language in D is C -measurable. (cid:73) Definition 18. We call µ C ( L ) − µ C ( L ) the C -gap of a language L . We say that a language L has full C -gap if its C -gap equals to 1, i.e. , µ C ( L ) − µ C ( L ) = 1.In the next section, we describe several examples of both REG-measurable and REG-immeasurable languages. The REG-gap could be a good measure how much a given languagehas a complex shape from the viewpoint of regular languages.The following lemmata are basic. (cid:73) Lemma 19. Let K, L be two languages. µ C ( K ) ≤ µ C ( L ) if K ⊆ L . µ C ( K ∪ L ) ≤ µ C ( K ) + µ C ( L ) if C is closed under union. µ C ( K ) = δ ∗ A ( K ) if K ∈ C and δ ∗ A ( K ) = ⊥ . (cid:73) Lemma 20. Let C be a language class such that C is closed under complement and everylanguage in C has a density. A language L ⊆ A ∗ is C -measurable if and only if µ C ( L ) + µ C ( L ) = 1 . (2) Proof. Let L be a language and A = Alph ( L ). By definition, L satisfies Condition (2) if andonly ifinf { δ ∗ A ( K ) | L ⊆ K, K ∈ C} = 1 − inf { δ ∗ A ( K ) | L ⊆ K, K ∈ C} (3)holds. On the other hand, L is measurable if and only ifinf { δ ∗ A ( K ) | L ⊆ K, K ∈ C} = sup { δ ∗ A ( K ) | K ⊆ L, K ∈ C} . (4)For any language K ∈ C A such that K ⊆ L and δ ∗ A ( K ) = ⊥ , its complement K satisfies L ⊆ K and δ ∗ A ( K ) = 1 − δ A ( K ). This means that if C A is closed under complement thensup { δ ∗ A ( K ) | K ⊆ L, K ∈ C A } = 1 − inf { δ ∗ A ( K ) | L ⊆ K, K ∈ C A } , holds, which immediatelyimplies the equivalence of Condition (3) and Condition (4). (cid:74) . Sin’ya 23:7 REG -measurability on Context-free Languages In this section we examine REG-measurability of several types of context-free languages.The first type of languages (Section 4.1) is null context-free languages. Although some nulllanguage can have a full REG-gap as stated in the next theorem, we will show that typicalnull context-free languages are REG-measurable. (cid:73) Theorem 21. There is a recursive language L which is null but µ REG ( L ) = 1 . Proof. Let A be an alphabet with A ) ≥ A i ) i ∈ N be an enumeration of automataover A such that REG A = { L ( A i ) | i ∈ N } ; we can take such enumeration by enumeratingsome binary representation of automata via shortlex order < lex . We will construct a nulllanguage L such that µ REG ( L ) = 1, in particular, L intersects with every regular infinitelanguage.Consider the following program P which takes an input word w : Step 1 set i = 0 and ‘ = 0. Step 2 check L ( A i ) is infinite or not. Step 3 if L ( A i ) is finite, then set i = i + 1 and go back to Step 2. Step 4 otherwise, pick u such that u is the smallest (with respect to < lex ) word satisfying | u | > ‘ and u ∈ L (such u surely exists since L ( A i ) is infinite). Step 5 if w = u then P accepts w and halts. Step 6 if w < lex u then P rejects w and halts. Step 7 if u < lex w then set ‘ = | u | , i = i + 1 and go back to Step 2.One can easily observe that all Steps are effective and P ultimately halts for any inputword w because the length of the word u in Step 4 is strictly increasing until u = w or w < lex u . Thus the language L def == { w ∈ A ∗ | P accepts w } is recursive, (1) L ∩ R = ∅ forany regular infinite language because by Step (4–5) P accepts some word w ∈ R , and (2) δ A ( L ) = 0; by Step (5–6) and the length of u is strictly increasing, P rejects every word in A n except for one single word u , for each n . Thus δ A ( L ) = 0 and µ REG ( L ) = 1. (cid:74) The second type of languages (Section 4.2) is inherently ambiguous languages and the thirdtype of languages (Section 4.3) includes Kemp’s language K whose density is transcendental.The last type of languages (Section 4.4) is languages with full REG-gap, i.e. , stronglyREG-immeasurable languages. First we consider the following language with constraints on the number of occurrences ofletters, which is a very typical example of a non-regular but context-free language. (cid:73) Definition 22. For an alphabet A and letters a, b ∈ A such that a = b , we define L A ( a, b ) def == { w ∈ A ∗ | | w | a = | w | b } . (cid:73) Theorem 23. L A ( a, b ) is REG -measurable where A = { a, b } . Proof. It is enough to show that the complement L = L ( a, b ) satisfies µ REG ( L ) = 1. Foreach k ≥ 1, we define L k def == { w ∈ A ∗ | | w | a = | w | b mod k } . C V I T 2 0 1 6 q q q aaa bbb Figure 1 The deterministic automaton A in the Proof of Theorem 23. Here, the state q havingunlabelled incoming arrow is initial and the states q , q having unlabelled outgoing arrow are final. Clearly, L k ⊆ L holds. Each L k is recognised by a k -states deterministic automaton A k = ( Q k = { q , . . . , q k − } , ∆ k : Q k × A → Q k , q , Q k \ { q } )where∆ k ( q i , a ) = q i +1 mod k ∆ k ( q i , b ) = q i − k ( for each i ∈ { , . . . , k − } ) ,q is the initial state, and any other state q ∈ Q k \ { q } is a final state (the case k = 3 isdepicted in Fig 1). The adjacency matrix of A k is M k = · · · · · · 11 0 1 . . . ...0 1 . . . . . . . . . ...... . . . . . . . . . 1 0... . . . 1 0 11 · · · · · · = E k + E k − k where E k = · · · · · · 11 0 0 . . . ...0 1 . . . . . . . . . ...... . . . . . . . . . 0 0... . . . 1 0 00 · · · · · · .M k is a special case of circulant matrices . A k -dimensional circulant matrix C k is amatrix that can be represented by a polynomial of E k : C k = p ( E k ) = k − X n =0 c n E nk and it is well known that C k can be diagonalised as, for a k -th root of unity ξ k = e − πik (where i is the imaginary unit),1 √ k F Hk · C k · √ k F k = diag( p (1) , p ( ξ − k ) , p ( ξ − k ) , . . . , p ( ξ − ( k − k ))where F k = ( f n,m ) with f n,m = ξ ( n − m − k (for 1 ≤ n, m ≤ k ) is the k -dimensional Fouriermatrix , F Hk is its Hermitian transpose and diag( λ , · · · , λ k ) is the diagonal matrix whose n -th diagonal element is λ n (for 1 ≤ n ≤ k ) ( cf. Section 5.2.1 of [16]). Hence, in the case of M k = p A k ( E k ) = E k + E k − k , we have1 √ k F Hk · M k · √ k F k = diag(2 , ξ − k + ξ k , ξ − k + ξ k , . . . , ξ − ( k − k + ξ k − k ) (5) . Sin’ya 23:9 because, for any n ≥ p A k ( ξ − nk ) = ξ − nk + ξ − n ( k − k = ξ − nk + ξ nk holds.Let Λ k = diag(2 , ξ − k + ξ k , ξ − k + ξ k , . . . , ξ − ( k − k + ξ k − k ). Because A k is deterministic andthe final states are all but q , the number of words of length n in L k is exactly the numberof paths from q to any other state in A k . For the k -dimensional vectors e = (1 , , , . . . , = (1 , , , . . . , L k ∩ A n ) = e · M nk · ( − e ) T = 1 k e · F k · Λ nk · F Hk ( − e ) T = 1 k · Λ nk · k − , k − X j =1 ξ − jk , k − X j =1 ξ − jk , . . . , − ( k − X j =1 ξ − ( k − jk T = 1 k n ( k − 1) + ( ξ − k + ξ k ) n k − X j =1 ξ − jk + · · · + ( ξ − ( k − k + ξ k − k ) n k − X j =1 ξ − ( k − jk . (6)If k is odd k = 2 m + 1, then for any 1 ≤ j ≤ k − ξ − jk + ξ jk is a real number whoseabsolute value is strictly smaller than 2; because ξ − jk is the complex conjugate of ξ jk andhence | ξ − jk + ξ jk | = | ξ jk ) | < k . Hence from Equation (6) we can deduce that L k ∩ A n ) = k − k n + o (2 n )where o (2 n ) means some function such that lim n →∞ o (2 n ) / n = 0. Thus we have δ A ( L k ) = k − k for odd k = 2 m + 1, which tends to 1 if k tends infinity, i.e. , µ REG ( L ) = 1. Thiscompletes the proof. (cid:74) By Theorem 23, it is also true that any subset of L { a,b } ( a, b ) is REG-measurable. Inparticular, we have: (cid:73) Corollary 24. The semi-Dyck language D ⊆ L { a,b } ( a, b ) is REG -measurable. The next example is the set of all palindromes. (cid:73) Theorem 25. P A def == { w ∈ A ∗ | w = rev( w ) } is REG -measurable. Proof. Because the case A ) = 1 is trivial ( P A = A ∗ ), we assume that A ) ≥ 2. It isenough to show that the complement P A is REG-measurable.For each k ≥ 1, we define L k def == { w A ∗ w | w , w ∈ A k , w = rev( w ) } . One can easily observe that L k ⊆ P A for each k ≥ 1. Moreover, for any n > k , the numberof words in L k of length n is L k ∩ A n ) = A ) k · A ) n − k · ( A ) k − 1) = A ) n − A ) n − k . From this we can conclude that δ A ( L k ) = 1 − A ) − k and it tends to 1 if k tends to infinity.Thus we have µ REG ( P A ) = 1. (cid:74) C V I T 2 0 1 6 There are REG-measurable inherently ambiguous context-free languages. Since every boundedlanguage L ⊆ w ∗ · · · w ∗ k is trivially REG-measurable ( µ REG ( L ) = 0), a typical example of aninherently ambiguous context-free language { a i b j c k | i = j or i = k } is REG-measurable.Some more complex examples of inherently ambiguous languages are the followinglanguages with constraints on the number of occurrences of letters investigated by Flajolet [11]: O def == { w ∈ { a, b, c } ∗ | | w | a = | w | b or | w | a = | w | c } , O def == { w ∈ { x, ¯ x, y, ¯ y } ∗ | | w | x = | w | ¯ x or | w | y = | w | ¯ y } . (cid:73) Theorem 26. O and O are REG -measurable. Proof. Let A = { a, b, c } . For the case O , in a very similar way to Theorem 23, wecan construct a sequence of automata ( A abk ) k ∈ N such that each automaton A abk satisfies L ( A abk ) ⊆ L A ( a, b ) and its adjacency matrix is of the form M abk = M k + I k = · · · · · · 11 1 1 . . . ...0 1 . . . . . . . . . ...... . . . . . . . . . 1 0... . . . 1 1 11 · · · · · · where M k is the adjacency matrix stated in Theorem 23 and I k is the k -dimensional identitymatrix. The automaton A abk is obtained by just adding self-loop labeled by c for each state q ∈ Q k of A k in Theorem 23. This sequence of automata ensures that the language L A ( a, b )is REG-measurable ( µ REG ( L A ( a, b )) = 0, in particular). The same argument is applicable tothe language L A ( a, c ), thus these union O = L A ( a, b ) ∪ L A ( a, c ) is also REG-measurable byLemma 19. The case O can be archived in the same manner. (cid:74) Next we consider the so-called Goldstine language G def == { a n ba n b · · · a n p b | p ≥ , n i = i for some i } . While G can be accepted by an non-deterministic pushdown automaton, its generatingfunction is not algebraic [12] and thus it is an inherently ambiguous context-free languagedue to Chomsky–Schützenberger theorem. (cid:73) Theorem 27. G is REG -measurable. Proof. Let A = { a, b } . Observe that G ⊆ A ∗ b and µ REG ( G ) ≤ δ A ( A ∗ b ) = 1 / 2. Let L G = { u ∈ A ∗ | uA ∗ { b } ∩ G = ∅} be the set of all forbidden prefixes of the complement G . For each k ≥ 1, we define L k def == { uA ∗ { b } | u ∈ L G ∩ A k } . If a word u is in L G , then by definition of L G , uvb is always in G for any word v , thus L k ⊆ G holds for each k . Any word in L G = A ∗ \ L G is a prefix of the infinite word . Sin’ya 23:11 a n ba n ba n b · · · ( n i = i for each i ∈ N ) thus L G ∩ A n ) = A n ) − n ≥ δ A ( L k ) = lim n →∞ L k ∩ A n ) A n ) = lim n →∞ ( (cid:0) A k (cid:1) − · (cid:0) A n − k − (cid:1) A n )= ( A ) k − · A ) − k − = 2 − − − k − . This implies that δ A ( L k ) tends to 1 / 2. Thus µ REG ( G ) = 1 / (cid:74) In general, for an infinite word w ∈ A ω , the setCopref( w ) def == A ∗ \ { u ∈ A ∗ | u is a prefix of w } is called the coprefix language of w . The proof of Theorem 27 uses a key property that G canbe characterised by using the coprefix language of the infinite word w = a n ba n ba n b · · · as G = Copref( w ) ∩ { a, b } ∗ { b } which was pointed out in [1]. Thus by the same argument, wecan say that any coprefix language L is REG-measurable ( µ REG ( L ) = 1, in particular).For coprefix languages, the following nice “gap theorem” holds. (cid:73) Theorem 28 (Autebert–Flajolet–Gabarro [1]) . Let w ∈ A ω be an infinite word generated byan iterated morphism, i.e., w = h ( w ) = h ω ( a ) for some monoid morphism h : A ∗ → A ∗ andletter a ∈ A . Then for the coprefix language L = Copref( w ) there are only two possibilities: L is a regular language. L is an inherently ambiguous context-free language. This means that we can construct, by finding some suitable morphism h , many examples ofinherently ambiguous context-free languages. K : A Language with Transcendental Density We now show the fact that the language K defined by Kemp [15] (recall that the definition of K appeared in Therem 15) is REG-measurable. We will actually show a more general resultregarding the following type of languages. (cid:73) Definition 29. Let L ⊆ A ∗ be a language and c / ∈ A be a letter. We call the language L { c } ( A ∪ { c } ) ∗ over A ∪ { c } suffix extension of L by c . (cid:73) Theorem 30. The suffix extension L ⊆ ( A ∪ { c } ) ∗ of any language L ⊆ A ∗ by c / ∈ A is REG -measurable. Proof. Let B = A ∪ { c } and k = B ). We first show that L has a natural density. Forany words u, v ∈ L with u = v , two languages u { c } B ∗ and v { c } B ∗ are disjoint, and clearly u { c } B ∗ ∩ B n ) / B n ) = (cid:16) u { c } B n −| u |− (cid:17) / B n ) = k n −| u |− /k n = k − ( | u | +1) holds for n > | u | thus δ B ( u { c } B ∗ ) = k − ( | u | +1) . The natural density of L is δ B ( L ) = lim n →∞ L ∩ B n ) B n ) = lim n →∞ (cid:0)S w ∈ L ( w { c } B ∗ ∩ B n ) (cid:1) B n )= lim n →∞ P w ∈ L w { c } B ∗ ∩ B n ) B n ) = lim n →∞ X w ∈ ( L ∩ A For each n ∈ N , the language L n def == S w ∈ L ∩ A Remark 32. Theorem 30 indicates that REG-measurability is a quite relaxed propertyin some sense: even for a non-recursively-enumerable language, its suffix extension is stillnon-recursively-enumerable but REG-measurable.The same proof method works for the prefix extension , and the infix extension is alsoREG-measurable. (cid:73) Theorem 33. Let c / ∈ A and A = A ∪ { c } . The prefix extension L = A { c } L of anylanguage L ⊆ A ∗ is REG -measurable. Also, the infix extension L = A { c } L { c } A of anylanguage L ⊆ A ∗ is REG -measurable, µ REG ( L ) = 0 if L = ∅ , µ REG ( L ) = 1 otherwise, inparticular. Proof. The prefix extension of L is just the reverse of the suffix extension of L , the sameproof method trivially works. For the infix extension L = A { c } L { c } A , if L = ∅ then L is also empty and thus µ REG ( L ) = 0. Further, if L = ∅ then there is a word w ∈ L andthus A cwcA ⊆ L holds, which means that δ A ( A cwcA ) = 1 by the infinite monkeytheorem and we have µ REG ( L ) = 1. (cid:74) REG -Gap In Section 4.1, we showed that the language L { a,b } ( a, b ) is REG-measurable. On the otherhand, by the result of Eisman–Ravikumar [9], we will know that the closely related language M def == { w ∈ { a, b } ∗ | | w | a > | w | b } , sometimes called the majority language , is not REG-measurable. This contrast is interesting. (cid:73) Theorem 34 (Eisman–Ravikumar [9, 10]) . Let A = { a, b } and L ⊆ A ∗ be a regular language.Then M ⊆ L implies lim sup n →∞ { (cid:0) L ∩ A n (cid:1) / A n ) } = 0 . One can easily observe that lim sup n →∞ { (cid:0) L ∩ A n ) (cid:1) / A n ) } = 0 if and only if δ A ( L ) = 0,which means that any regular superset of M is co-null. Thus the above theorem implies thatboth M and M are REG + -immune, hence we have: (cid:73) Corollary 35. M has full REG -gap. By using the infinite monkey theorem and some probabilistic arguments, we can generalisethe previous theorem as follows. (cid:73) Theorem 36. For any m ≥ , the following language over A = { a, b } M m def == { w ∈ A ∗ | | w | a > m · | w | b } has full REG -gap, and δ A ( M m ) = 1 / if m = 1 otherwise δ A ( M m ) = 0 . . Sin’ya 23:13 Proof. First we prove that any non-null regular language L can not be a subset of M m . Let η : A ∗ → M be the syntactic morphism η and monoid M of L , and let c = max m ∈ M min w ∈ η − ( m ) | w | (this is well-defined natural number since M is finite). By the infinite monkey theorem, L isnot null implies that L has no forbidden word, and thus for the word b c there exists a words x, y such that xb c y is in L . We can assume that | x | , | y | ≤ c without loss of generality by thedefinition of c , which implies | xb c y | a ≤ | x | + | y | = 2 c ≤ | xb c y | b hence xb c y / ∈ M m . Thus L M m and µ REG ( M m ) = 0. By using same argument, we can prove that µ REG ( M m ) = 0and hence M m has full REG-gap.In the case m = 1, δ A ( M ) = δ A ( M ) = 1 / δ A ( M ) = 0 holds (since M m ⊆ M for any m ≤ δ A ( M ) = lim n →∞ { w ∈ A n | | w | a > | w | b } )2 n = lim n →∞ { w ∈ A n | | w | a > n/ } )2 n = lim n →∞ Pr( | X n − n/ | > n/ 6) = 0where Pr( | X n − n/ | > n/ 6) means the probability that the absolute value of the differenceof the number X n of the occurrences of a ’s in a randomly chosen word of length n and itsmean value n/ n/ 6; its tends to zero by the weak law of large numbers. (cid:74) REG -Immesurability of Primitive Words A non-empty word w ∈ A + is said to be primitive if u n = w implies u = w for any u ∈ A + and n ∈ N . The set of all primitive words over A is denoted by Q A . Because the case A ) = 1 is meaningless ( Q A = A in this case), hereafter we always assume A ) ≥ Q A is context-free or not is a well-known long-standing open problem posed byDömösi, Horváth and Ito [8]. Reis [18] proved Q A = A + \ { a n | a ∈ A, n = 2 } , whichintuitively means that every non-empty word w not a power of a letter is a product of twoprimitive words. From this result one may think that Q A is “very large” in some sense.Actually, Q A is somewhat “large” (it is dense in the sense of Definition 9), but we can showmore stronger property as follows. (cid:73) Theorem 37. δ A ( Q A ) = 1 . Proof. It is enough to show that δ A ( Q A ) = 0 holds. One can easily observe that any naturalnumber n ∈ N has at most 2 √ n divisors. In addition, for any non-primitive word w = v m oflength n is uniquely determined by v (since m = n/ | v | ) and | v | ≤ n/ 2. Hence the number ofnon-primitive words of length n satisfies (cid:0) Q A ∩ A n (cid:1) ≤ √ n b n/ c X i =0 (cid:0) A i (cid:1) ≤ √ n · A ) b n/ c +1 . By using the above estimation, we can deduce that (cid:0) Q A ∩ A n (cid:1) A n ) ≤ √ n · A ) b n/ c +1 A ) n ≤ √ n A ) n/ − and it tends to 0 if n tends to infinity (since we assume A ) ≥ δ A ( Q A ) = 0. (cid:74) While Q A is “very large” (co-null) as stated above, we can also prove that Q A is REG + -immune. The proof relies on an analysis of the structure of the syntactic monoid of a non-nullregular language. We assume that the reader has a basic knowledge of semigroup theory ( cf. [17]): Green’s relations J , R , L , H and Green’s theorem (an H -class H in a semigroup S is asubgroup of S if and only if H contains an idempotent), in particular. C V I T 2 0 1 6 (cid:73) Theorem 38. µ REG ( Q A ) = 0 . Proof. Let L be a regular language over A with a positive density δ A ( L ) > 0. It is enoughto show that L must contain a non-primitive words. We consider η : A ∗ → M the syntacticmorphism η and the syntactic monoid M of L , and let S be a subset of M satisfying η − ( S ) = L .We first show that S contains a ≤ J -minimal element t . This is rather clear because,for any non- ≤ J -minimal element s , its language η − ( s ) ⊆ A ∗ is null: s is non- ≤ J -minimalmeans that there is an other element t such that t < J s ( i.e. , M tM (cid:40) M sM ), whence s / ∈ M tM which implies that any word w ∈ η − ( t ) is a forbidden word of η − ( s ). Thus bythe infinite monkey theorem η − ( s ) is null.Clearly, we have t n ≤ J t and thus t J t n holds for any n > ≤ J -minimality of t . t J t n implies that there is x, y such that xt n y = t . Since M is finite, x m is an idempotentfor some m > i.e. , x m = x m ). Thus we obtain t = xt n y = x ( t ) t n − y = x ( t )( t n − y ) = · · · = x m t ( t n − y ) m = x m x m t ( t n − y ) m = x m t whence t = t n ( y ( t n − y ) m − ). It follows that t R t n . Dually, we also obtain t L t n and hence we can deduce that t H t n holds. By thefiniteness of M , there exists some n > t n is an idempotent. Thanks to Green’stheorem, the H -equivalent class H t of t is a subgroup of M with the identity element t n .Then for any ε = w ∈ η − ( t ) (such word always exists since t is ≤ J -minimal), we have η ( w n +1 ) = t n +1 = t n · t = t ∈ S which means that L ⊇ η − ( t ) contains a non-primitive word w n +1 . (cid:74)(cid:73) Corollary 39 (of Theorem 37 and 38) . Q A has full REG -gap. In this paper we proposed REG-measurability and showed that several context-free languagesare REG-measurable, excluding M m . Interestingly, it is shown that, like G and K , languagesthat have been considered as complex from a combinatorial viewpoint are, actually, easyto asymptotically approximate by regular languages. It is also interesting that a modifiedmajority language M is just a deterministic context-free but it is complex from a measuretheoretic viewpoint. Its complement M is also deterministic context-free, and actually it isco-null but REG + -immune ( i.e. , has full REG-gap). This means that M is as complex as Q A from a viewpoint of REG-measurability.The following fundamental problems are still open and we consider these to be futurework. (cid:73) Problem 40. Can we give an alternative characterisation of the null (resp. co-null)context-free languages (like Theorem 10)? (cid:73) Problem 41. Can we give an alternative characterisation of the REG -measurable context-free languages? (cid:73) Problem 42. Can we find a language class that can “separate” Q A and CFL ? i.e., Isthere C such that Q A has full C -gap but no co-null context-free language has full C -gap, or Q A is C -immeasurable but any co-null context-free language is C -measurable? The our results (Theorem 36, 37 and 38) tell us that the class REG of regular languagescan not separate Q A and CFL. However, it is still open whether the situation is same or . Sin’ya 23:15 not when C = DetCFL , UnCFL or other extension of regular languages. Notice that if theanswer of Problem 42 is “yes”, then Q A is not context-free. Acknowledgement: The author would like to thank Takanori Maehara (RIKEN AIP)whose helpful discussion were an enormous help to me. The author also thank to anonymousreviewers for many valuable comments. This work was supported by JSPS KAKENHI GrantNumber JP19K14582. References Jean-Michel Autebert, Philippe Flajolet, and Joaquim Gabarro. Prefixes of infinite words andambiguous context-free languages. Information Processing Letters , 25(4):211 – 216, 1987. Jean Berstel. Sur la densité asymptotique de langages formels. In International Colloquiumon Automata, Languages and Programming (ICALP, 1972) , pages 345–358, France, 1973.North-Holland. Jean Berstel, Dominique Perrin, and Christophe Reutenauer. Codes and Automata . Encyclo-pedia of Mathematics and its Applications. Cambridge University Press, 2009. Robert C. Buck. The measure theoretic approach to density. American Journal of Mathematics ,68(4):560–580, 1946. Cezar Câmpeanu, Nicolae Sântean, and Sheng Yu. Minimal cover-automata for finite languages. Theoretical Computer Science , 267(1):3 – 16, 2001. Brendan Cordy and Kai Salomaa. On the existence of regular approximations. TheoreticalComputer Science , 387(2):125 – 135, 2007. Michael Domaratzki. Minimal covers of formal languages. Master’s thesis, University ofWaterloo, 2001. Pál Dömösi, Sándor Horváth, and Masami Ito. On the connection between formal languagesand primitive words. pages 59–67, 1991. Gerry Eisman and Bala Ravikumar. Approximate recognition of non-regular languages byfinite automata. In Twenty-Eighth Australasian Computer Science Conference (ACSC2005) ,volume 38 of CRPIT , Newcastle, Australia, 2005. ACS. Gerry Eisman and Bala Ravikumar. On approximating non-regular languages by regularlanguages. Fundamenta Informaticae , 110:125–142, 2011. Philippe Flajolet. Ambiguity and transcendence. In Automata, Languages and Programming ,pages 179–188, Berlin, Heidelberg, 1985. Springer Berlin Heidelberg. Philippe Flajolet. Analytic models and ambiguity of context-free languages. TheoreticalComputer Science , 49(2):283 – 309, 1987. Martin Kappes and Chandra M. R. Kintala. Tradeoffs between reliability and concisenessof deterministic finite automata. Journal of Automata, Languages and Combinatorics , 9(2–3):281–292, 2004. Martin Kappes and Frank Nießner. Succinct representations of languages by dfa with differentlevels of reliability. Theoretical Computer Science , 330(2):299 – 310, 2005. Rainer Kemp. A note on the density of inherently ambiguous context-free languages. ActaInformatica , 14(3):295–298, 1980. Piet van Mieghem. Graph Spectra for Complex Networks . Cambridge University Press, 2010. Jean-Éric Pin. Mathematical foundations of automata theory, 2012. C.M. Reis and H.J. Shyr. Some properties of disjunctive languages on a free monoid. Inform-ation and Control , 37(3):334 – 344, 1978. Arto Salomaa and Matti Soittola. Automata Theoretic Aspects of Formal Power Series .Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1978. Ryoma Sin’ya. An automata theoretic approach to the zero-one law for regular languages:Algorithmic and logical aspects. In Proceedings Sixth International Symposium on Games,Automata, Logics and Formal Verification, GandALF 2015 , pages 172–185, 2015., pages 172–185, 2015.