Approximate Learning of Limit-Average Automata
AApproximate Learning of Limit-Average Automata
Jakub Michaliszyn
University of Wrocł[email protected]
Jan Otop
University of Wrocł[email protected]
Abstract
Limit-average automata are weighted automata on infinite words that use average to aggregate theweights seen in infinite runs. We study approximate learning problems for limit-average automata intwo settings: passive and active. In the passive learning case, we show that limit-average automataare not PAC-learnable as samples must be of exponential-size to provide (with good probability)enough details to learn an automaton. We also show that the problem of finding an automaton thatfits a given sample is NP -complete. In the active learning case, we show that limit-average automatacan be learned almost-exactly, i.e., we can learn in polynomial time an automaton that is consistentwith the target automaton on almost all words. On the other hand, we show that the problemof learning an automaton that approximates the target automaton (with perhaps fewer states) is NP -complete. The abovementioned results are shown for the uniform distribution on words. Webriefly discuss learning over different distributions. Theory of computation → Automata over infinite objects; Theoryof computation → Quantitative automata
Keywords and phrases weighted automata, learning, expected value
Funding
The National Science Centre (NCN), Poland under grant 2017/27/B/ST6/00299.
Quantitative verification has been proposed to verify non-Boolean system properties such asperformance, energy consumption, fault-tolerance, etc. There are two main challenges inapplying quantitative verification in practice. First, formalization of quantitative propertiesis difficult as specifications are given directly as weighted automata, which extend finiteautomata with weights on transitions [18, 15]. Quantitative logics, which could facilitatethe specification task, have either limited expression power or their model-checking problemis undecidable [11, 13]. Second, there is little research on abstraction in the quantitativesetting [14], which would allow us to reduce the size of quantitative models, presented asweighted automata.We approach both problems using learning of weighted automata. We can apply thelearning framework to facilitate writing quantitative specifications and make it more accessibleto non-experts. For abstraction purposes, we study approximate learning, having in mindthat approximation of a system can significantly reduce its size.We focus on weighted automata over infinite words which compute the long-run averageof weights [15]. Such automata express interesting quantitative properties [15, 23] and admitpolynomial-time algorithms for basic decision questions [15]. Some of the interesting propertiescan be only expressed by non-deterministic automata [15], but every non-deterministicweighted automaton is approximated by some deterministic weighted automaton on almostall words [29]. This means that, by allowing some small margin of error, we can focus ondeterministic automata while being able to model important system properties, such asminimal response time, minimal number of errors, the edit distance problem [23], and the a r X i v : . [ c s . F L ] J un Approximate Learning of Limit-Average Automata specification repair framework from [10].We therefore focus on approximate learning under probabilistic semantics, which cor-responds to the average-case analysis in quantitative verification [16]. We treat words asrandom events and functions defined by weighted automata as random variables. In thissetting, an automaton A (cid:15) -approximates A if the expected difference between A and A (over all words) is bounded by (cid:15) . We consider two learning approaches: passive and active.In passive learning, we think of an automaton as a black-box. This might be a program,a specification, a working correct system to be replaced or abstracted, or even only someexamples of how the automaton should work. Our goal is to construct a weighted automatonbased on this black-box model by observing only inputs and outputs of the model.In active learning, we assume presence of an interactive teacher that reveals the values ofgiven examples and verifies whether a provided automaton is as intended; if not, the teacherresponses with a witness, which is a word showing the difference between the constructedautomaton and the intended automaton. This does not necessary mean that the teacher isfamiliar with the weighted automata formalism — to provide a witness, they may simply runthe constructed automaton in parallel with the black-box and, if at any point their behaviorsdiffer, provide the appropriate input to the learning algorithm. Our contributions.
We start with a discussion on the setting for learning problems. Thefirst step is to find a suitable representation for samples; not all infinite words have finiterepresentations. A natural idea is to use ultimately periodic words in samples; we studysuch samples. However, the probability distribution over ultimately periodic words is verydifferent from the uniform distribution over infinite words. Therefore, we also considersamples consisting of a finite word u labeled with the expected value over all extensions of u . The probability distribution over such samples is closer to the uniform distribution overinfinite words.Then we study the passive learning problem. We show that for unique characterizationof an automaton, we need a sample of exponential size. We study the complexity of theproblem of finding an automaton that fits the whole sample. The problem, without additionalrestrictions, has a trivial and overfitting solution. To mitigate this, we impose bounds onautomata size, and show that then the problem is NP -complete.For active learning, we show that the problem of learning an almost-exact automaton canbe solved in polynomial time, and finding an automaton of bounded size that approximatesthe target one cannot be done in polynomial time if P = NP . We conclude with a discussionon different probability distributions. Related work.
The probably approximately correct (PAC) learning, introduced in [35],is a general passive learning framework applied to various objects (DNF/CNF formulas,decision trees, automata, etc.) [26]. PAC learning of deterministic finite automata (DFA)has been extensively studied despite negative indicators. First, the sample fitting problemfor DFA, where the task is to construct a minimal-size DFA consistent with a given sample,has been shown NP -complete [21]. Even approximate sample fitting, where we ask for aDFA at most polynomially greater than a miniaml-size DFA, remains NP -complete [34].Second, it has been shown that existence of a polynomial-time PAC learning algorithm forDFA would break certain cryptographic systems (such as RSA) and hence it is unlikely [25].Despite these negative results, it has been empirically shown that DFA can be efficientlylearned [27]. In particular, if we assume structural completeness of a sample, then itdetermines a minimal DFA [32]. Pitt posed a question whether DFA are PAC-learnableunder the uniform distribution [33], which remains open [2].Angluin showed that DFA can be learned in polynomial time, if the learning algorithm akub Michaliszyn and Jan Otop 3 can ask membership and equivalence queries [1]. This approach proved to be very fruitfuland versatile. Angluin’s algorithm has been adapted to learn NFA [12], automata overinfinite words [3, 20], nominal automata over infinite alphabets [31], weighted automata overwords [9] and weighted automata over trees [22, 28].Recently, there has been a renewed interest in learning weighted automata [8, 28, 6, 5, 7].These results apply to weighted automata over fields [18], which work over finite words. We,however, consider limit-average automata, which work over infinite words and cannot bedefined using a field or even a semiring. Furthermore, we consider weighted (limit-average)automata under probabilistic semantics [16, 29], i.e., we consider functions represented byautomata as random variables. Given a finite alphabet Σ of letters, a word w is a finite or infinite sequence of letters. Wedenote the set of all finite words over Σ by Σ ∗ , and the set of all infinite words over Σ byΣ ω . For a word w , we define w [ i ] as the i -th letter of w , and we define w [ i, j ] as the subword w [ i ] w [ i + 1] . . . w [ j ] of w . We use the same notation for vectors and sequences; we assumethat sequences start with 0 index.A deterministic finite automaton (DFA) is a tuple (Σ , Q, q , F, δ ) consisting of the alphabetΣ, a finite set of states Q , the initial state q ∈ Q , a set of final states F , and a transitionfunction δ : Q × Σ → Q .A (deterministic) LimAvg -automaton extends a DFA with a function C : δ → Q thatdefines rational weights of transitions. The size of such an automaton A , denoted by |A| , isthe sum of the number of states of A and the lengths of binary encodings of all the weights.A run π of a LimAvg -automaton A on a word w is a sequence of states π [0] π [1] . . . such that π [0] is the initial state and for every i > δ ( π [ i − , w [ i ]) = π [ i ]. Wedo not consider ω -accepting conditions and assume that all infinite runs are accepting.Every run π of A on an infinite word w defines a sequence of weights C ( π ) of successivetransitions of A , i.e., C ( π )[ i ] = C ( π [ i − , w [ i ] , π [ i ]). The value of the run π is thendefined as LimAvg ( π ) = lim sup k →∞ Avg ( C ( π )[0 , k ]), where for finite runs π we have Avg ( C ( π )) = Sum ( C ( π )) / | C ( π ) | . The value of a word w assigned by the automaton A ,denoted by L A ( w ), is the value of the run of A on w .We consider three classes of probability measures on words over the alphabet Σ. U n , for n ∈ N , is the uniform probability distribution on Σ n assigning each word theprobability | Σ | − n . G ( λ ), for a termination probability λ ∈ Q + ∩ (0 , u ∈ Σ ∗ we have G ( λ )( u ) = | Σ | −| u | · (1 − λ ) | u | · λ . Observe that the probability of words of the samelength is equal, and the probability of generating a word of length k is (1 − λ ) k · λ . Wecan consider this as process generating finite words, which stops after each step withprobability λ . U ∞ is the uniform probability measure on Σ ω . Formally, we define U ∞ on basic sets u Σ ω = { uw | w ∈ Σ ω } as follows: U ∞ ( u Σ ω ) = | Σ | −| u | . Then, U ∞ is the unique extensionon all Borel sets in Σ ω considered with the product topology [19, 4]. Automata as random variables.
A weighted automaton A defines the measurablefunction L A ( w ) : Σ ω R that assigns values to words. We interpret such functions asrandom variables w.r.t. the probabilistic measure U ∞ . Hence, for a given automaton A , weconsider the following quantities: Approximate Learning of Limit-Average Automata E ( A ) — the expected value of the random variable L A defined by A w.r.t. the uniformdistribution U ∞ on Σ ω , and E ( A | u Σ ω ) — the conditional expected value, defined for U ∞ as the expected value of L u such that L u ( w ) = L A ( uw ).We consider automata as generators of random variables — two automata are almostequivalent if they define the almost equal random variables. Formally, we say that A and A are almost equivalent if and only if for almost all words w we have L A ( w ) = L A ( w ).Note that almost all words means all except for words from some Y of probability 0. LimAvg -automata considered over probability distributions are equivalent to Markovchains with the long-run average objectives presented in [4, Section 10.5.2]. (cid:73)
Theorem 1 ([4]) . Let A be a LimAvg -automaton. (1) If A is strongly connected, thenalmost all words have the same value equal to E ( A ) . (2) For almost all words w , the run on w eventually reaches some bottom strongly connected component (SCC) of A . Theorem 1 has important consequences. First, we can contract each bottom SCC to asingle state with self-loops of the same weight x being the expected value of that SCC. Werefer to x as the value of that SCC. Such an operation does not affect almost equivalence.Second, while reasoning about LimAvg -automata, we can neglect all weights except for thevalues of bottom SCCs. We will omit weights other than the values of bottom SCCs.
Samples.
A sample is a set of labeled examples of some (hidden) function L : Σ ω → R . Inthe classical automata-learning approach words are finite, and hence they can be presentedas examples along with the information whether the word belongs to the hidden languageor not. In the infinite-word case, however, words cannot be given directly to a learningalgorithm. We present two alternative solutions for this problem. One is to restrict examplesto ultimately periodic words, which have finite presentation, and the other is to considerfinite words and ask for conditional expected values. We discuss both approaches below.To distinguish samples with different types of labeled examples, we call them U -samples, E -samples and ( E, n )-samples. Ultimately periodic words . Consider an example being an ultimately periodic word uv ω .It can presented as a pair of finite words ( u, v ) and we consider labeled examples ( u, v, x ),where u, v ∈ Σ ∗ and x = L ( uv ω ). A set of labeled examples ( u, v, x ) is called an U -sample.To draw a random U -sample, we consider finite words u, v to be selected independentlyat random according to distributions G ( λ ) and G ( λ ) for some λ , λ . For such a set ofpairs of words, we label them according to the function L . Conditional expected values . We consider examples, which are finite words u ∈ Σ ∗ . Alabeled example is a pair ( u, x ), where x = E ( L | u Σ ω ) is the conditional expected valueof L under the condition that random words start with u . For such labeled exampleswe consider E -samples consisting of labeled words of various length, and ( E, n )-samplesconsisting of words of length n . We assume that finite words for random E -samplesare drawn according to a distribution G ( λ ) for some λ , and finite words for random( E, n )-samples are drawn according to the uniform distribution U n .We only consider minimal consistent samples, i.e., samples that do not contain ex-amples whose value can be computed from other examples in the sample. For instance, { ( a, a, , ( a, aa, } is an inconsistent U -sample, and { ( aa, , ( ab, ) , ( a, } is an inconsistent E -sample over { a, b } . (cid:73) Remark 2 (Incompatible distributions).
Note that the distribution on ultimately periodicwords differ from the uniform distribution on infinite words. The set of ultimately periodic akub Michaliszyn and Jan Otop 5 words is countable and hence it has probability 0 (according to the distribution U ∞ ).Moreover, almost all infinite words contain all finite words as infixes, whereas this is not thecase for ultimately periodic words under any probability distribution. (cid:73) Remark 3 (Feasibility of conditional expectation).
Consider a
LimAvg -automaton A com-puting L : Σ ω → R . For a finite prefix u , we can compute E ( L | u Σ ω ) in polynomial timein |A| [4]. If we consider A to be a black-box, which can be controlled, then E ( L | u Σ ω )can be approximated in the following way. We pick random words v , . . . , v k of length k ,compute partial averages in A of uv , . . . , uv k and then take the average of these values.The probability that this process returns a value (cid:15) -close to E ( L | u Σ ω ) converges to 1 atexponential rate with k . Passive learning corresponds to a scenario with an uncontrolled working black-box system.The learner can only observe system’s output, and its goal is to create an approximate modelof the system. This task comprise of two problems. The first problem, characterization ,is to assess whether the observations cover most, if not all, behaviors of the system. Thesecond one, called sample fitting , is to create a reasonable automaton consistent with theobservations. In this section we discuss both problems.
A sample can cover only small part of the system. It is sometimes argued [27, 32], however,that if a sample is large enough, then it is likely to cover most, if not all, importantbehaviors. We show that for some
LimAvg -automata, randomly drawn samples of size lessthan exponential are unlikely to demonstrate any probable values.Let || S || denote the sum of the lengths of all the examples in a sample S . A sample distinguishes two automata if it is consistent with exactly one of them. We show the following. (cid:73) Theorem 4.
For any n there are two automata A n , A n of size n +4 such that for almost allwords w we have |L A n ( w ) −L A n ( w ) | = 2 , but a random U -sample, E -sample or ( E, k ) -sample(for any k ) S distinguishes A n and A n with the probability at most || S || n . Proof.
Consider the alphabet { a, b } and n ∈ N . We construct a LimAvg -automaton A n with n + 4 states. We use n + 2 states to find the first occurrence of the infix a n b in thestandard manner. When such an infix is found, the automaton moves to a state q a if thefollowing letter is a and to a state q b otherwise. In q a it loops with the weight 1 and in q b itloops with the weight −
1. All other weights are 0. So A n returns 1 if the first occurrence of a n b if followed by a , − b , and 0 if there is no a n b . The automaton A n has the same structure as A n , but the weights − A n and A n differ over almost all infinite words. Indeed, an infiniteword with probability 1 contains the infix a n b , and so on almost all words one of the automatareturns − S (it can be an E -sample, ( E, k )-sampleor U -sample). This sample distinguishes automata A n and A n only if it contains an examplewith the infix a n b (in case of U -samples, this means that this infix occurs in uv of someexample ( u, v )) - all other examples for both automata are the same. The probability that S contains a n b as an infix of one of its examples is bounded by || S || n +1 . Indeed, the number ofpositions in all words is || S || and the probability that a n b occurs on some specific position isat most n +1 for all types of samples and any k > k < n + 1, the probability is 0). (cid:74) Approximate Learning of Limit-Average Automata
Therefore to be able to distinguish just two automata with probability 1 − (cid:15) , we need asample such that || S || n +1 > − (cid:15) , which for fixed (cid:15) < U -samples. If automata A , A of size n recognize different languages, then there is a word uw ω such that L A ( uw ω ) = L A ( uw ω ) and the length of u and w is bounded by n . Assumefor simplicity that λ = λ = λ . If the sample size is at least | Σ | n · ln | Σ | n (cid:15) · (1 − λ ) − n · λ − ,then with probability 1 − (cid:15) a random sample contains all such words, and so distinguishes allthe automata of size n . E -samples. E -samples do not distinguish almost-equivalent automata, hence we cannotlearn automata exactly. However, exponential samples are enough to learn automata up toalmost equivalence. To see that consider two automata A and A of size n that are notalmost equivalent. Due to Theorem 1 there is a word u ∈ Σ ∗ such that u reaches bottomSCCs in both A and A , and these bottom SCCs have different expected values. Usingstandard pumping argument, we can reduce the size of u to at most n . So if the samplesize is at least | Σ | n · ln | Σ | n (cid:15) · (1 − λ ) − n · λ − , then with probability 1 − (cid:15) it contains allwords u of size n , and therefore distinguishes all automata of size n .For ( E, n )-samples, the reasoning is the similar, assuming n is quadratic in the size ofthe automaton: the sufficient sample size is | Σ | n · ln | Σ | n (cid:15) . We discuss the consequences of our results to the probably approximately correct (PAC) modelof learning [35]. In the PAC framework, the learning algorithm should work independently ofthe probability distribution on samples. However, variants of the PAC framework have beenconsidered where the distribution on samples is uniform [24]. In particular, PAC learning ofDFA under the uniform distribution over words is a long-standing open problem [33, 2].We restrict the classical PAC model and assume that observations are drawn accordingto the distributions U n , G ( λ ) (as discussed in Section 2) and the quality of the learnedautomaton is assessed using the uniform distribution over infinite words U ∞ . (cid:66) Problem 5 (PAC learning under fixed distributions).
Given (cid:15), δ ∈ Q + , n ∈ N and an oraclereturning random labeled examples consistent with some automaton A T of size n , constructan automaton A such that with probability 1 − (cid:15) w e have E ( |L A − L A T | ) < − δ .As a consequence of Theorem 4, there is no PAC-learning algorithm for LimAvg -automatawith U -samples (resp., E -samples or ( E, k )-samples) that uses samples of polynomial size; inparticular, there is no such algorithm working in polynomial time. (cid:73)
Theorem 6.
The class of
LimAvg -automata is not PAC-learnable with U -samples, E -samples or ( E, k ) -samples. Once we have a sample, the problem of finding an automaton fitting the sample can besolved in polynomial time in a trivial way: we create an automaton that is a tree such thatevery word of a given sample leads to a different leaf in this tree, and then we add loops withappropriate values in the leaves (similarly to a prefix tree acceptor [17] for finite automata).This solution leads to an automaton that overfits the samples, as it works well only for thesample and it is unlikely to work well with words not included in the sample. Besides, theautomaton is linear in the size of the sample, not in the size of the black-box system. For akub Michaliszyn and Jan Otop 7 q q q q n . . . c : 1 c : 1 c n : 1 c : 1 a : 1 a . . . a n : 0 a : 0 a : 1 a . . . a n : 0 a , a : 0 a : 1 a . . . a n : 0 a . . . a n − : 0 a n : 1 Figure 1
The canonical automaton A ϕ from the proof of Theorem 8 a fixed automaton we can construct arbitrarily large U-samples (or E-samples) consistentwith it and hence the gap between the size of such an automaton and the black-box systemis arbitrarily large. To exclude such solutions, we restrict the size of the automaton to beconstructed. We study the following problem. (cid:66) Problem 7 (Sample fitting).
Given a sample S and n ∈ N , construct a LimAvg -automatonwith at most n states, which is consistent with S .The decision version of this problem only asks whether such an automaton exists. Weshow that this problem is NP -complete, regardless of the sample representation. For hardness,we reduce the NP-complete problem SAT [17], which is the SAT problem restricted to CNFformulas such that each clause contains only positive literals or only negative literals. (cid:73) Theorem 8.
The sample fitting problem is NP -complete for U -samples. Proof.
The membership in NP follows from the following observation: if n is greater thanthe total length of the samples, then return yes as the tree-like solution works. Otherwise,non-deterministically pick an automaton of the size n and check whether it fits the sample.The NP -hardness proof is inspired by the construction from [17, Theorem 6.2.1]. For agiven instance of the SAT problem ϕ = V ni =1 C i over variables x , . . . , x n (not all variablesneed to occur in ϕ ), we construct a U -sample S ϕ such that there is an automaton with n states fitting S ϕ if and only if ϕ is satisfiable.We fix the alphabet { a i , c i , d i | i = 1 , . . . , n } ∪ { b, t } . The sample S ϕ consists of: S1 ( c i , a j , x ) for each i, j ∈ { , . . . , n } , where x is 1 if i = j , and 0 otherwise. S2 ( c i , d j , x ) for each i, j ∈ { , . . . , n } , where x is 1 if x i is in C j , and 0 otherwise. S3 ( c i b, d i ,
1) for each i ∈ { , . . . , n } . S4 ( c i b, t, x ) for each i ∈ { , . . . , n } , where x is 1 if the clause C i contains only positiveliterals and 0 if it contains only negative literals.Assume that ϕ is satisfiable and let σ : { x , . . . , x n } → { , } be a satisfying valuation.Then, we construct an automaton A ϕ consistent with the sample S ϕ starting from thestructure presented in Figure 1. Then, we add the following transitions:for each i , a loop in q i on the letter t with the value σ ( x i ),for each i, j , a loop in q i on the letter d j with the value 1 if x i is in C j and 0 otherwise,for each clause C i , if C i is satisfied because of a variable x j , then we add a transitionfrom q i to q j over b (if there are multiple possible variables, we choose any).The remaining transitions can be set in arbitrary way. The obtained automaton A ϕ isconsistent with the sample S ϕ .Now assume that there is an automaton A of n states, which is consistent with thesample. We show that the valuation σ such that σ ( x i ) = L A ( c i t ω ) satisfies ϕ . Let q i be the Approximate Learning of Limit-Average Automata q q q n . . . q F q T c c n a a . . . a n a a a . . . a n a n a . . . a n − c ∗ : 1 ∗ Figure 2
A picture of the canonical automaton from the proof of Theorem 9. All the weights are0 except from transitions from the state q T state where the automaton A is after reading the word c i . By S1 , all the states q , . . . , q n are pairwise different. Since there are only n states, q , . . . , q n are all the states of A . Nowconsider any clause C i . Let q j be the state of A after reading c i b . Notice that by S3 , thevalue of d ωi in q j is 1, and by S2 , this means that x j is in C i . If C i contains only positiveliterals, then the value of t ω in q j is 1 by S4 , which means that σ ( x j ) = 1 and that C i issatisfied. The other case is symmetric. (cid:74)(cid:73) Theorem 9.
The sample fitting problem is NP -complete for E -samples. Proof.
The proof is similar to the proof of Theorem 8. For the NP -hardness, the sample nowis obtained from the sample in the proof of Theorem 8 by replacing every triple ( u, v, x ) bythe pair ( uv, x ). However, now we ask for an automaton of size n + 2. If there is a valuationthat satisfies a given set of clauses, then one can construct an automaton based on the onepresented in Figure 2.On the other hand, if there is an automaton fitting the sample, then it has to have a statewhere the expected value of any word is 0, a state where the expected value of any word is1, and n different states reachable by each c , . . . , c n . The rest of the proof is virtually thesame as in Theorem 8, except that now we define σ such that σ ( x i ) is the expected value ofwords with the prefix c i t . (cid:74) The above proofs also work with some natural relaxations of the sample fitting problem.For example, if we only require the automaton to fit the samples up to some (cid:15) < , then theproofs still hold since we use only weights 0 and 1. Another relaxation for the E -samples caseis to allow the automaton to give wrong values for some samples as long as the summarizedprobability of the examples with wrong value is less than some (cid:15) . However, since all thewords are of the length at most three, the probability of u Σ ω for each example u is greaterthan n +2) (recall that | Σ | = 3 n + 2), which means that for any (cid:15) < n +2) for everyexample some its extension must fit and hence the whole sample must fit. In the active case, the learning algorithm can ask queries to an oracle, which is called theteacher , which has a (hidden) function L : Σ ω → R and answers two types of queries: expectation queries : given a finite word u , the teacher returns E ( L | u Σ ω ), (cid:15) -consistency queries : given an automaton A , if L A (cid:15) -approximates L (i.e., E ( |L −L A | ) ≤ (cid:15) ), the teacher returns YES, otherwise the teacher returns a word u such that | E ( L | u Σ ω ) − E ( L A | u Σ ω ) | > (cid:15) . akub Michaliszyn and Jan Otop 9 Remark . Consider functions L , L defined by LimAvg -automata. If E ( |L − L | ) > (cid:15) , thendue to Theorem 1 there is a word u Σ ω such that E ( |L − L | | u Σ ω ) > (cid:15) and L (resp., L ) returns E ( L | u Σ ω ) (resp,. E ( L | u Σ ω )) on almost all words from u Σ ω . Therefore, | E ( L | u Σ ω ) − E ( L | u Σ ω ) | = E ( |L − L | | u Σ ω ) > (cid:15) .In the active learning case we consider two problems: approximate learning and rigidapproximate learning. We first define approximate learning: (cid:66) Problem 10 (Approximate learning).
Given (cid:15) ∈ Q + ∪ { } and a teacher with a (hidden)function L , construct a LimAvg -automaton A such that L A (cid:15) -approximates L and A hasthe minimal number of state among such automata. We define a decision problem, called approximate minimization, which can be solved inpolynomial-time having a polynomial-time approximate learning algorithm. (cid:66)
Problem 11 (Approximate minimization).
Given a
LimAvg -automaton A , n ∈ N and (cid:15) ∈ Q + , the approximate minimization problem asks whether there exists a LimAvg -automaton A with at most n states such that E ( |L A − L A | ) ≤ (cid:15) .An efficient learning algorithm can be used to efficiently compute approximate mini-mization of a given LimAvg -automaton A ; we can run it and compute answers to queriesof the learning algorithm in polynomial time in |A| [4]. We show that the approximateminimization problem is NP -complete, which means that approximate learning cannot bedone in polynomial time if P = N P . (cid:73) Theorem 12.
The approximate minimization problem is NP -complete. Proof sketch.
The problem is contained in NP as we can non-deterministically pick anautomaton with n states and check whether it (cid:15) -approximates A .For a vector ~v ∈ R m we define k ~v k = P mi =1 | ~v [ i ] | . For NP -hardness, consider thefollowing problem: Binary k-Median Problem (BKMP) : given numbers n, m, k, t ∈ N anda set of Boolean vectors C = { ~v , . . . , ~v n } ⊆ { , } m , decide whether there are vectors ~u , . . . , ~u k ∈ { , } m and a partition D , . . . , D k of C such that P kj =1 P ~v ∈D j k ~v − ~u j k ≤ t ?BKMP has been shown NP -complete in [30]. To ease the reduction, we consider a variantof BKMP, called Modified BKMP, where we assume that ~ ,~ NP -complete.For C = { ~v , . . . , ~v n } ⊆ { , } m we define L C over the alphabet Σ = { a , . . . , a m } asfollows: for all a i , a j ∈ Σ, w ∈ Σ ω if i ≤ n , we set L C ( a i a j w ) = ~v i [ j ].if i ∈ { n + 1 , m } , we set L C ( a i a j w ) = ~v n [ j ].Intuitively, we select the vector with the first letter and the vector’s component with thesecond letter. This language can be defined with a tree-like LimAvg -automaton A C .We show that, if an instance ( C , k, t ) of Modified BKMP has a solution ~u , . . . , ~u k and D , . . . , D k , then there exists an automaton A with k + 1 states that tnm -approximates L C .Let ~v = ~u = ~ ~v = ~u = ~
1. The automaton A consists of the initial state q and thesuccessors of the initial state, q , . . . , q k , which correspond to vectors ~u , . . . , ~u k , i.e., q is abottom SCC of the value 0, q is a bottom SCC of the value 1, and for i > q i encode ~u i via δ ( q i , a j ) = q if ~u i [ j ] = 0 and δ ( q i , a j ) = q otherwise. The successors of q are defined based on the partition D , . . . , D k , i.e., if ~v i belongs to C j , then δ ( q , a i ) = q j .Observe that A (cid:15) -approximates A C . Conversely, consider A with k + 1 states that tnm -approximates L C . We define vectors ~p , . . . , ~p m ∈ R m such that ~p i [ j ] = E ( A | a i a j Σ ω ). The structure of Modified BKMP, impliesthat the initial state of A has no self loops and hence it has at most k different successorstates. Therefore, there are at most k different vectors among ~p , . . . , ~p m . Finally, we observethat since we consider k·k , w.l.o.g. we can assume that ~p , . . . , ~p m ∈ { , } m . Therefore,these vectors give us a solution to the instance ( C , k, t ) of Modified BKMP. (cid:74) One of the drawbacks of the standard approximation is that the counterexamples may bedubious, if not useless. We illustrate this with an example. (cid:73)
Example 13 (Dubious counterexamples) . Consider a minimal DFA B with 2 n states whoselanguage consists of words of length m for some n < m , and a word v ∈ Σ n . We define afunction L B ,v : { a, b } ω → R such that for all u, w L B ,v ( auw ) = 0 if | u | = n and B accepts u , L B ,v ( auw ) = 0 . | u | = n and B rejects u , L B ,v ( bw ) is 0 . v in w is followed by a , and 1 otherwise.Fix (cid:15) = 0 .
1. Observe that L B ,v can be 0 . A , which isfaithful to L B ,v on b Σ ω and returns 0 .
15 for all other words. A has n + O (1) states.Assume that the teacher gives only counterexamples starting with a , and hence (cid:15) -consistency queries do not give any information about the values of words starting with b .The teacher can do it as long as the algorithm does not know the whole B , which takes Ω( |B| )queries to learn. Yet even if the algorithm learns the whole B and returns the 0 . b , the expected difference is 0 . · . .
15. It follows that to learn theapproximation, the algorithm needs to learn something about v .Suppose that the algorithm did not learn the whole B . Then, to learn something non-trivial about words starting with b , it has to ask an expectation query containing v . Sincethe learning algorithm is deterministic, it asks the same expectation queries for differentwords v . Therefore, for every learning algorithm there are words v that can be learned onlyafter asking queries of the total length 2 Ω( | v | ) .It follows that any learning algorithm has to ask queries of total length Ω( |B| ) or 2 Ω( | v | ) ,which totals to 2 Ω( n ) .In Example 13 we assumed an antagonistic teacher, which misleads the algorithm onpurpose. But even with a stochastic teacher, it is not known whether “fixing” a given randomcounterexample is a step towards better approximation. To resolve this issue, we consider astronger notion of approximation, called rigid approximation , where we require all conditionalexpected values to be (cid:15) -close, i.e., for all words u ∈ Σ ∗ we have E ( |L − L | | u Σ ω ) ≤ (cid:15) . Inthis framework counterexamples are certain, i.e., if for some u ∈ Σ ∗ , the expectation over u Σ ω is more than (cid:15) off the intended value, it has to be modified. Formally, we define theproblem as follows: (cid:66) Problem 14 (Rigid approximate learning).
Given (cid:15) ∈ Q + and a teacher with a (hidden)function L , construct an automaton A such that L A is a rigid (cid:15) -approximation of L and A has the minimal number of state among such automata.Even though the counterexamples are certain in this framework, we observe that thisdoes not eliminate ambiguity. For instance, there can be multiple automata with the minimalnumber of states. akub Michaliszyn and Jan Otop 11 A :a b c0 0.5 1 A :a b c0.25 1 A :a b c0 0.75 A exp : . . . a b c0 n n a b c n n n a b c n − n n − n n − n Figure 3
The automaton A approximated by two minimal non-equivalent automata and theautomaton A exp approximated by exponentially many non-equivalent automata (cid:73) Example 15 (Non-unique minimalization) . Consider the automaton A depicted in Figure 3and (cid:15) = . Any automaton A , which is a rigid (cid:15) -approximation of L A , has at least twobottom SCCs and hence it requires at least 3 states. Therefore the automata A , A depictedin Figure 3, which (cid:15) -approximate L A , have the minimal number of states. This shows thatthere can be multiple correct answers to in the rigid approximate learning problem.Based on this example, we construct the automaton A exp parametrized by n ∈ N depictedin Figure 3, which has O ( n ) states and there are exponentially many (in n ) minimal non-equivalent automata, which are rigid n -approximations of L A exp .As in the approximate learning case, efficient rigid approximate learning enables us tosolve efficiently the following rigid approximate minimization problem. (cid:66) Problem 16 (Rigid approximate minimization).
Given a
LimAvg -automaton A , n ∈ N and (cid:15) ∈ Q + , construct a LimAvg -automaton A with at most n states such that for all words u ∈ Σ ∗ we have E ( |L A − L A | | u Σ ω ) ≤ (cid:15) .First, we consider a naive approach to solve the rigid approximate minimization problembased on state merging. We start with an input automaton A , merge its states to maintainthe property that the automaton with merged states A rigidly (cid:15) -approximates L A . Weterminate if the automaton is minimal , i.e., merging any two states of A violates the property.However, it may happen that q can be merged either with q or q , but not with both. Weshow in the following example that the choice of merged states can have profound impact onthe size of a minimal automaton. (cid:73) Example 17 (Minimal automata of different size) . Assume n ∈ N and the function L thatreturns 0 on words from a Σ n a Σ ω , 1 on b Σ n b Σ ω , and 0 . A be a minimalautomaton defining such a language, as depicted in Figure 4.Let (cid:15) = . To minimize A , we can merge it to the automaton A S with 3 states or to A L with n + 3 states. Observe that A S and A L are rigid (cid:15) -approximations of L A . For A S , it isbecause for every word au we have E ( A B | au Σ ω ) ∈ { , . } and hence it is (cid:15) -close to 0 . ba . For A L , it is because for every u of length at least n + 2 the expectedvalue of A L is 0 .
75 or 0 .
25 depending on the ( n + 2)th letter of u whereas for A it is in { . , } or in { , . } , resp. For shorter u , we simply observe that the difference of expectedvalues w.r.t. to a word does not exceed the maximal difference in its suffixes.The automaton A L is minimal, as there are no states that can be merged. Therefore, thedifference and even the ratio between the sizes of both automata are unbounded.We show that the rigid approximate minimization problem is NP -complete, which impliesthat there is no polynomial-time rigid approximate learning algorithm (unless P = N P ). A : A L : A S :. . .. . .a * * *b * * * 00.50.51abab * * * * 0.250.75abab 0.250.75 Figure 4
Minimal automata of different size (cid:73)
Theorem 18.
The rigid approximate minimization problem is NP -complete. Proof sketch.
The rigid approximate minimization problem is in NP . We show that it is NP -hard. We define a problem, which is an intermediate step in our reduction.Given n, k ∈ N and vectors C = { ~v , . . . , ~v n } ⊆ { , , } n , the -vector cover problem askswhether there exist ~u , . . . , ~u k ∈ R n such that for every vector ~v ∈ C there is j such that k ~v − ~u j k ∞ ≤ ?The -vector cover problem is NP -complete; the NP -hardness is via reduction from thedominating set problem. We show that the -vector cover problem reduces to the rigidapproximate minimization problemLet C = { ~v , . . . , ~v n } ⊆ { , , } n . We define L C over the alphabet Σ = { a , . . . , a m } suchthat for all a i , a j ∈ Σ, w ∈ Σ ω we have L C ( a i a j w ) = ~v i [ j ]. Such L C can be defined with a atree-like LimAvg -automaton A C .Let (cid:15) = . We show that an instance n, k, C of the -vector cover problem has a solutionif and only if A C has a rigid (cid:15) -approximation with k + 3 states.Assume that there is A with k + 3 states such that for all words u ∈ Σ ∗ we have E ( |L A C − L A | | u Σ ω ) ≤ (cid:15) . Observe that the initial state of A has at most k differentsuccessors. To see that consider functions L u ( w ) = L A C ( uw ) defined for u ∈ Σ ∗ . If forsome u , u , u ∈ Σ ∗ we have E ( |L u u − L u u | ) > (cid:15) , then words u and u lead (from theinitial state) to different states in A . It follows that A has at least 3 states that are nosuccessors of the initial state: the initial state and (at least two) states that correspond tobottom SCCs. We define vectors ~u , . . . , ~u k from the successors of the initial state of A .Formally, for i ∈ { , . . . , n } we define a vector ~y i as ~y i [ j ] = E ( L A | a i a j Σ ω ). There are atmost k different vectors among ~y , . . . , ~y n and we take these distinct vectors as ~u , . . . , ~u k .The condition E ( |L A C − L A | | u Σ ω ) ≤ (cid:15) implies that ~u , . . . , ~u k form a solution to n, k, C .Conversely, assume that the instance n, k, C has a solution. Observe that w.l.o.g. we canassume that the solution vectors ~u , . . . , ~u k belong to { . , . } n . Based on this solution wedefine A with k + 3 states q , q , . . . , q k , s , s such that q is initial, q , . . . , q k are successorsof q , and s , s are single-state bottom SCC of the values 0 .
25 and 0 .
75. The states q , . . . , q k correspond to vectors ~u , . . . , ~u k , i.e., the successors of q i encode ~u i via δ ( q i , a j ) = s if ~u i [ j ] = 0 .
25 and δ ( q i , a j ) = s otherwise. The successors of q are defined based on thematching from the vector cover, i.e., if δ ( q , a i ) = q j , then ~u j is -close to ~v i . Observe that A is a rigid -approximation of A C . (cid:74) akub Michaliszyn and Jan Otop 13 The almost-exact learning the minimal automaton is defined as follows. (cid:66)
Problem 19 (Almost-exact learning).
Given a teacher with a (hidden) function L , constructa LimAvg -automaton A such that L A is a 0-approximation of L .Notice that for functions L and L A , the following conditions are equivalent: L A is a 0-approximation of L . L A is a rigid 0-approximation of L . P ( { w : L A ( w ) = L ( w ) } )=0. P ( { w : L A ( uw ) = L ( uw ) } )=0 for each u .We show that there is a polynomial-time algorithm for almost-exact learning. (cid:73) Theorem 20.
The almost-exact learning problem for
LimAvg -automata can be solved inpolynomial time in the size of the minimal automaton that is almost equivalent with the targetfunction L and the maximal length of counterexamples returned by the teacher. Proof.
We define a relation ≡ L on finite words u, v ∈ Σ ∗ as follows u ≡ L v if and only if P ( { w : L ( uw ) = L ( vw ) } ) = 0A quick check shows that the relation ≡ L is an equivalence relation. We show that ≡ L is a right congruence, i.e., if u ≡ L v , then for all a we have ua ≡ L va . Indeed, consider X = { w : L ( uaw ) = L ( vaw ) } and X = { w : L ( uw ) = L ( vw ) } . Note that u ≡ L v implies P ( X ) = 0. For all w , if w ∈ X , then aw ∈ X . It follows that (under the uniformdistribution) P ( X ) ≤ | Σ | P ( X ) = 0. Thus, P ( X ) = 0 and ua ≡ L va .We now show a counterpart of the Myhill–Nerode theorem: there is a LimAvg -automatonof n states defining almost-exactly L if and only if the index of ≡ L is n .If A defines almost-exactly L , then the index of ≡ L is bounded by the number of states of A . Indeed, if for words u, v the automaton A starting from the initial state ends up in the samestate, then for all words w we have L ( uw ) = L ( vw ) and hence u ≡ L v . Conversely, assumethat ≡ L has a finite index. Then, we construct a LimAvg -automaton A ≡ L correspondingto ≡ L . The states of A ≡ L are the equivalence classes of ≡ L and A ≡ L has a transition from[ u ] ≡ L to [ v ] ≡ L over a if and only if [ ua ] ≡ L = [ v ] ≡ L . Observe that due to Theorem 1, if [ u ] ≡ L and [ v ] ≡ L are in a bottom SCC, then [ u ] ≡ L = [ v ] ≡ L . Then, for every bottom SCCs [ u ] ≡ L we assign the value of all outgoing transitions (which are self-loops) to E ( L | u Σ ω ). For theremaining transitions we set the weights to 0 (due to Theorem 1 these weights are irrelevantas they do not change any of the expected values). Observe that A ≡ L computes L .A classical result of [1] states that DFA can be learned in polynomial time using mem-bership and equivalence queries. We adapt this result here. The learning algorithm for LimAvg -automata maintains a pair ( Q , T ), where Q , the set of access words, containsdifferent representatives the right congruence relation, and T , the set of test words, containswords that approximate the right congruence relation. T defines the relation ≡ T such that u ≡ T u if and only if for all v ∈ T we have P ( { w | L ( u vw ) = L ( u vw ) } ) = 0.The algorithm maintains two properties: separability : all different words u , u ∈ Q belong to different equivalence classes of ≡ T , and closedness ; for all u ∈ Q and a ∈ Σ there is u ∈ Q with u a ≡ T u . Each separable and closed pair ( Q , T ) defines a LimAvg -automaton A Q , T , which can be tested for 0-consistency against the teacher’s function. If the teacherprovides a counterexample u , then it is used to extend Q and T . To do so, we split u into v a, v such that v a is the minimal prefix of u such that E ( L | v Σ ω ) = E ( A Q , T | u Σ ω ). Let v be a word from Q that is ≡ T -equivalent to v . We take Q = Q ∪ { v a } , T = T ∪ { v } and close ( Q , T ) using expectation queries and test the equivalence again. We repeat thisuntil we get a LimAvg -automaton defining almost equivalent to L .The proof of correctness of this algorithm is a straightforward modification of the proofof correctness of the algorithm from [1], and thus it is omitted. (cid:74) So far we only discussed the uniform distribution of words. Here we briefly discuss whetherTheorem 20 can be generalized to arbitrary distributions, represented by Markov chains. Weassume that the Markov chain is given to a learning algorithm in the input.
Markov chains.
A (finite-state discrete-time)
Markov chain is a tuple h Σ , S, s , E i , where Σis the alphabet of letters, S is a finite set of states, s is an initial state, E : S × Σ × S [0 , s ∈ S satisfies P a ∈ Σ ,s ∈ S E ( s, a, s ) = 1.The probability of a finite word u w.r.t. a Markov chain M , denoted by P M ( u ), is thesum of probabilities of paths from s labeled by u , where the probability of a path is theproduct of probabilities of its edges. For basic open sets u · Σ ω = { uw | w ∈ Σ ω } , we have P M ( u · Σ ω ) = P M ( u ), and then the probability measure over infinite words defined by M is the unique extension of the above measure [19]. We will denote the unique probabilitymeasure defined by M as P M .Observe that the uniform distribution can be expressed with a (single-state) Markovchain and hence all the lower bounds from Section 3 and Section 4 still hold. Non-vanishing Markov chains.
A Markov chain M is non-vanishing if for all words u ∈ Σ ∗ we have P M ( u Σ ∗ ) >
0. The almost-exact active learning over distributions givenby non-vanishing Markov chains can be solved in polynomial time using Theorem 20. Forevery measurable set X , non-vanishing Markov chain M we have P M ( X ) > P ( X ) >
0. Thus, almost-exact learning over M and over the uniform distribution coincide. General Markov chains.
The proof for the uniform distribution does not extend tovanishing Markov chains because the relation ≡ L is not a right congruence. This cannot besimply fixed as we show that learning cannot be done in polynomial time, assuming P = NP . We define the almost-exact minimization problem as an instance of the approximateminimization problem with (cid:15) = 0. Having a polynomial-time algorithm for almost-exactlearning, we can solve the almost-exact minimization problem in polynomial time. (cid:73) Theorem 21.
The almost-exact minimization problem for
LimAvg -automata under dis-tributions given by Markov chains is NP -complete. Proof.
The problem is in NP as we can non-deterministically pick an automaton A andcheck in polynomial time whether LimAvg -automata A and A are almost equivalent w.r.t.a given Markov chain.We reduce the sample fitting problem, which is NP -complete for E -samples (Theorem 9),to the almost-exact minimization problem. Consider a finite E -sample S based on words u , . . . , u k . Let M S be a Markov chain which assigns probability k to u i Σ ω , for each i , and0 to words not starting with any of u i . On each u i Σ ω , M S defines the uniform distribution.Let A S be a a tree-like LimAvg -automaton consistent with S . Both, M S and A S are ofpolynomial size in S . Observe that every automaton A , which is consistent with S (accordingto the uniform distribution over infinite words), is almost equivalent to A S (over P M ) andvice versa. Therefore, there is an automaton of n states almost equivalent to A S (under thedistribution P M ) if and only if the sample fitting problem with S and n has a solution. (cid:74) akub Michaliszyn and Jan Otop 15 References Dana Angluin. Learning regular sets from queries and counterexamples.
Information andcomputation , 75(2):87–106, 1987. Dana Angluin and Dongqu Chen. Learning a random DFA from uniform strings and stateinformation. In
ALT 2015 , pages 119–133, Cham, 2015. Springer International Publishing. Dana Angluin and Dana Fisman. Learning regular omega languages.
Theor. Comput. Sci. ,650:57–72, 2016. Christel Baier and Joost-Pieter Katoen.
Principles of model checking . MIT Press, 2008. Borja Balle and Mehryar Mohri. Learning weighted automata. In
CAI 2015 , pages 1–21, 2015. Borja Balle and Mehryar Mohri. On the rademacher complexity of weighted automata. In
ALT 2015 , pages 179–193, 2015. Borja Balle and Mehryar Mohri. Generalization bounds for learning weighted automata.
Theor.Comput. Sci. , 716:89–106, 2018. Borja Balle, Prakash Panangaden, and Doina Precup. A canonical form for weighted automataand applications to approximate minimization. In
LICS 2015 , pages 701–712, 2015. Amos Beimel, Francesco Bergadano, Nader Bshouty, Eyal Kushilevitz, and Stefano Varricchio.Learning functions represented as multiplicity automata.
Journal of the ACM , 47, 10 1999. Michael Benedikt, Gabriele Puppis, and Cristian Riveros. Regular repair of specifications. In
LICS 2011 , pages 335–344, 2011. Udi Boker, Krishnendu Chatterjee, Thomas A. Henzinger, and Orna Kupferman. Temporalspecifications with accumulative values.
ACM Trans. Comput. Log. , 15(4):27:1–27:25, 2014. Benedikt Bollig, Peter Habermehl, Carsten Kern, and Martin Leucker. Angluin-style learningof NFA. In
IJCAI 2009 , pages 1004–1009, 2009. Patricia Bouyer, Nicolas Markey, and Raj Mohan Matteplackel. Averaging in LTL. In
CONCUR2014 , pages 266–280, 2014. Pavol Cerný, Thomas A. Henzinger, and Arjun Radhakrishna. Quantitative abstractionrefinement. In
POPL 2013 , pages 115–128, 2013. Krishnendu Chatterjee, Laurent Doyen, and Thomas A. Henzinger. Quantitative languages.
ACM TOCL , 11(4):23, 2010. Krishnendu Chatterjee, Thomas A. Henzinger, and Jan Otop. Quantitative automata underprobabilistic semantics. In
LICS 2016 , pages 76–85, 2016. Colin de la Higuera.
Grammatical Inference: Learning Automata and Grammars . CambridgeUniversity Press, New York, NY, USA, 2010. Manfred Droste, Werner Kuich, and Heiko Vogler.
Handbook of Weighted Automata . Springer,1st edition, 2009. W. Feller.
An introduction to probability theory and its applications . Wiley, 1971. Dana Fisman. Inferring regular languages and ω -languages. J. Log. Algebr. Meth. Program. ,98:27–49, 2018. E Mark Gold. Complexity of automaton identification from given data.
Information andcontrol , 37(3):302–320, 1978. Amaury Habrard and José Oncina. Learning multiplicity tree automata. In
ICGI 2006 , pages268–280, 2006. Thomas A. Henzinger and Jan Otop. Model measuring for discrete and hybrid systems.
Nonlinear Analysis: Hybrid Systems , 23:166 – 190, 2017. Jeffrey C Jackson. An efficient membership-query algorithm for learning dnf with respect tothe uniform distribution.
Journal of Computer and System Sciences , 55(3):414–440, 1997. Michael Kearns and Leslie Valiant. Cryptographic limitations on learning boolean formulaeand finite automata.
Journal of the ACM (JACM) , 41(1):67–95, 1994. Michael J Kearns, Umesh Virkumar Vazirani, and Umesh Vazirani.
An introduction tocomputational learning theory . MIT Press, 1994. Kevin J. Lang. Random DFA’s can be approximately learned from sparse uniform examples.In
COLT 1992 , pages 45–52, 1992. Ines Marusic and James Worrell. Complexity of equivalence and learning for multiplicity treeautomata.
Journal of Machine Learning Research , 16:2465–2500, 2015. Jakub Michaliszyn and Jan Otop. Non-deterministic weighted automata on random words. In
CONCUR 2018 , volume 118 of
LIPIcs , pages 10:1–10:16, 2018. Pauli Miettinen, Taneli Mielikäinen, Aristides Gionis, Gautam Das, and Heikki Mannila. Thediscrete basis problem. In
European Conference on Principles of Data Mining and KnowledgeDiscovery , pages 335–346. Springer, 2006. Joshua Moerman, Matteo Sammartino, Alexandra Silva, Bartek Klin, and Michal Szynwelski.Learning nominal automata. In
POPL 2017 , pages 613–625, 2017. José Oncina and Pedro Garcia. Inferring regular languages in polynomial updated time. In
Pattern recognition and image analysis: selected papers from the IVth Spanish Symposium ,pages 49–61. World Scientific, 1992. Leonard Pitt. Inductive inference, DFAs, and computational complexity. In
InternationalWorkshop on Analogical and Inductive Inference , pages 18–44. Springer, 1989. Leonard Pitt and Manfred K Warmuth. The minimum consistent DFA problem cannot beapproximated within any polynomial.
Journal of the ACM (JACM) , 40(1):95–142, 1993. Leslie G Valiant. A theory of the learnable. In
Proceedings of the sixteenth annual ACMsymposium on Theory of computing , pages 436–445. ACM, 1984.
AppendixA NP-completeness of approximate minimization (cid:73)
Theorem 12.
The approximate minimization problem is NP -complete. Proof.
The problem is contained in NP as we can non-deterministically pick an automaton A with n states and check whether it (cid:15) -approximates A . It can be done in polynomialtime in the following way. For all bottom SCCs C from A and D from A , we compute theprobability p C,D of the set of infinite words w such that A reaches C over w and A reaches D over w . Let x C be the expected value of C in A , which is attained by almost all wordsthat reach C and y D be the corresponding value for D in A . The values p C,D , x C , y D canbe computed in polynomial time [4]. Then, the expected difference E ( |L A − L A | ) is the sumover all bottom SCCs C from A and D from A of p C,D · | x C − y D | .For NP -hardness, consider the following problem. Given a vector ~v ∈ R m , we define k ~v k = P mi =1 | ~v [ i ] | . Binary k-Median Problem (BKMP) : given numbers n, m, k, t ∈ N and a set of Booleanvectors C = { ~v , . . . , ~v n } ⊆ { , } m , decide whether there are vectors ~u , . . . , ~u k ∈ { , } m and a partition D , . . . , D k of C such that P kj =1 P ~v ∈D j k ~v − ~u j k ≤ t .BKMP has been shown NP -complete in [30]. To ease the reduction, we consider a variantof BKMP. Modified Binary k-Median Problem (RBKMP) : given numbers n, m, h, k, t ∈ N , with h > max (2 t, m, n ), and a set of Boolean vectors C = { ~v , . . . , ~v n } ⊆ { , } m +2 h , whichsatisfies the following conditions: (C1) C contains vectors ~ ~ (C2) each vector ~v ∈ C \ { ~ ,~ } has as many 0’s as 1’s, (C3) for every vector ~v ∈ C \ { ~ ,~ } we have ~v [2 m + 1] = . . . = ~v [2 m + h ] = 1 and ~v [2 m + h + 1] = . . . = ~v [2 m + 2 h ] = 0.decide whether there are vectors ~u , . . . , ~u k ∈ { , } m +2 h , including ~ ~ D , . . . , D k of C such that P kj =1 P ~v ∈D j k ~v − ~u j k ≤ t . akub Michaliszyn and Jan Otop 17 We reduce BKMP to Modified BKMP. Consider an instance ( C , k, t ) of BKMP. For everyvector ~v ∈ { , } m we define ~v ∗ ∈ { , } m +2 h such that ~v ∗ [1 , m ] = ~v , ~v ∗ [ m + 1 , m ] = ~ − ~v , ~v ∗ [2 m + 1 , m + h ] = 1 ~v ∗ [2 m + h + 1 , m + 2 h ] = 0.Then, we define C ∗ = { ~v ∗ | ~v ∈ C} ∪ { ~ ,~ } and an instance ( C ∗ , k + 2 , t ) of BKMP. Observethat if ( C , k, t ) has a solution, then ( C ∗ , k + 2 , t ) has a solution as well, which satisfiesadditional assumptions as above. Conversely, if ( C ∗ , k +2 , t ) has a solution, then this solutioncontains ~ ~ ~v ∗ ∈ C ∗ we have (cid:13)(cid:13) ~v ∗ − ~ (cid:13)(cid:13) = (cid:13)(cid:13) ~v ∗ − ~ (cid:13)(cid:13) = m + h > t . It alsofollows that in the partition D , . . . , D k +2 two sets that correspond to ~ ~ C ∗ ∪ { ~ ,~ } , k + 2 , t ) can be transformed to asolution of ( C , k, t ).Let C = { ~v , . . . , ~v n } ⊆ { , } m +2 h . Let M = 2 m + 2 h , N = M − n ( M = n + 2 N ) andassume that n is even. We know that h > t and hence N > t .We define L C over the alphabet Σ = { a , . . . , a M } as follows: for all a i , a j ∈ Σ, w ∈ Σ ω if i ≤ n , we set L C ( a i a j w ) = ~v i [ j ].if i ∈ { n + 1 , n + N } , we set L C ( a i a j w ) = 0.if i ∈ { n + N, M } , we set L C ( a i a j w ) = 1.This language can be defined with a tree-like LimAvg -automaton A C . In this automatonstates reached over a , . . . , a n , i.e., the states δ ( q , a ) , . . . , δ ( q , a n ) correspond to vectors ~v , . . . , ~v n via E ( A C | a i a j ) = ~v i [ j ]. The vectors ~ ~ N times.First, consider a solution ~u , . . . , ~u k and D , . . . , D k for ( C , k, t ), and ~u = ~ ~u = ~
1. Weconstruct an automaton A D with an initial state q and k states q , . . . , q k First, for all i, j ∈ { , . . . , M } we set δ ( q , a i ) = q j if ~v j ∈ D i , i.e., states q , . . . , q k correspond to vectors ~u , . . . , ~u k . Next, q , q correspond to vectors ~ ~ a ∈ Σwe set δ ( q , a i ) = q with the weight 0 and δ ( q , a i ) = q with the weights 1. Finally, forall i > a j ∈ Σ, we define δ ( q i , a j ) = q if ~u i [ j ] = 0 and δ ( q i , a j ) = q otherwise.Observe that P kj =1 P ~v ∈D j k ~v − ~u j k ≤ t implies that A D is tnM -approximation of L C .Conversely, assume that A is the optimal approximation of L C among automata with k + 1 states, i.e., A (cid:15) -approximates L C , where (cid:15) ≤ tnM , and no automaton with at most k + 1 states approximates L C with smaller (cid:15) .Consider vectors ~p , . . . , ~p M ∈ R M defined as follows: for all i, j ∈ { , . . . , M } we put ~p i [ j ] = E ( A | a i a j Σ ω ). Since A tnM -approximates L C we know that n X i =1 k ~p i − ~v i k + n + N X i = n +1 (cid:13)(cid:13) ~p i − ~ (cid:13)(cid:13) + M X i = n + N +1 (cid:13)(cid:13) ~p i − ~ (cid:13)(cid:13) ≤ tMn (*)We need to show that there are at most k different vectors ~p i and among these vectorsthere are ~ ~ A are reachable from the initial state with some word oflength at most 2. Indeed, we can change states, which are in distance 2 from the initial state,to bottom SCCs with the values being the expected value from the old state. Second, theinitial state cannot be a bottom SCC and hence all bottom SCCs are in distance 1 or 2 fromthe initial state. We can change values of all bottom SCCs to 0 or 1. To see that observethat we can take values of all bottom SCC and consider these values as variables x , . . . , x l .Then, these values appear as convex combinations in (*). However, since all constants in (*)are 0 and 1, then (*) is minimal under the substitution mapping x , . . . , x l to { , } . Knowing that all bottom SCC have value 0 and 1, we observe that we need to haveprecisely two bottom SCC s , s of the values 0 and 1. It only improves the approximationif s is the successor of the initial state over a n +1 , . . . , a n + N and s is the successor of theinitial state over a n + N +1 , . . . , a n +2 N . Therefore, all vectors ~p n +1 = . . . = ~p n + N = ~ ~p n + N +1 = . . . = ~p M = ~ q of A has a self-loop over a i . We know that i ≤ n asotherwise we have E ( A ) ∈ { , } and E ( L C ) = . A contradiction. Consider vector ~q definedas ~q [ i ] = E ( A | a i Σ ω ). Observe that due to (**) we have q [ n + 1] , . . . , q [ n + N ] = 0 and q [ n + N +1] , . . . , q [ n +2 N ] = 1. Condition (3) of Modified BKMP implies that k ~q − ~v i k ≥ h ,while condition (2) implies that (cid:13)(cid:13) ~ − ~v i (cid:13)(cid:13) = (cid:13)(cid:13) ~ − ~v i (cid:13)(cid:13) = M . Since h > n , we have 2 h > M and hence changing the self-loop ( q , a i , q ) to a transition ( q , a i , q ) to the state q thatcorresponds to ~ L C . A contradiction. It follows that q has at most k successors.Then, ~p , . . . , ~p n give us a solution to C . First, observe that we can approximate ~p , . . . , ~p n with vectors from { , } M without increasing the left hand side of (*). Hence, we assume thatthey belong to { , } M . Second, we select different vectors among ~p , . . . , ~p n as ~u , . . . , ~u k and define D i = { ~v j | ~p j = ~u i } . The, (*) implies that this is a solution to ( C , k, t ). (cid:74) B NP-completeness of rigid approximate minimization (cid:73)
Theorem 18.
The rigid approximate minimization problem is NP -complete. Proof sketch.
The problem is in NP . We can non-deterministically pick an automaton A with n states and check whether it is a rigid (cid:15) -approximation of A . Observe that having twoautomata A , A , we can check whether A is a rigid (cid:15) -approximation of A in polynomialtime. Indeed, let P be a set of pairs of states defined as follows: ( q, s ) ∈ P if and only if q is astate of A , s is a state of A and they are both reached in the respective automata from the(respective) initial states over a common word u . Observe that A is a rigid (cid:15) -approximationof A if and only if for every pair ( q, s ) the automaton A starting from q (cid:15) -approximates A starting from the state s . We have shown in the proof of Theorem 12 (in Appendix A)that we can decide (cid:15) -approximation in polynomial time. Therefore, we can decide rigid (cid:15) -approximation in polynomial time as well.We show that the problem is NP -hard. To ease the presentation, we define the following -vector cover problem, which is an intermediate step in our reduction. The -vector cover problem : Given n, k ∈ N and vectors C = { ~v , . . . , ~v n } ⊆ { , , } n , decidewhether there exist ~u , . . . , ~u k ∈ R n such that for every vector ~v ∈ C there is j such that k ~v − ~u j k ∞ ≤ .The -vector cover problem is related to BKMP presented in the proof of Theorem 12.We show two reductions, which together show NP -hardness. The -vector cover problem reduces to the rigid approximate minimization problem . Let C = { ~v , . . . , ~v n } ⊆ { , , } n . We define L C over the alphabet Σ = { a , . . . , a m } such thatfor all a i , a j ∈ Σ, w ∈ Σ ω we have L C ( a i a j w ) = ~v i [ j ]. Such L C can be defined with a atree-like LimAvg -automaton A C defined as follows. It has three single-state bottom SCCs: p , p . , p with the expected values 0 , . A C moves over akub Michaliszyn and Jan Otop 19 letters a i to different states. Then, over any two-letter word a i a j the automaton A C ends upin a single-state bottom SCC of the value ~v i [ j ]. Therefore, this automaton has n + 4 states:the initial state q , n different successors s , . . . , s n of q and the states p , p . , p .Let (cid:15) = . We show that an instance n, k, C of the -vector cover problem has a solutionif and only if A C has a rigid (cid:15) -approximation with k + 3 states.First, assume that there is A has at most k + 3 states and for all words u ∈ Σ ∗ wehave E ( |L A C − L A | | u Σ ω ) ≤ (cid:15) . Observe that the initial state of A has at most k differentsuccessors. To see that consider functions L u ( w ) = L A C ( uw ) defined for u ∈ Σ ∗ . If for some u , u , u ∈ Σ ∗ we have E ( |L u u − L u u | ) > (cid:15) , then words u and u lead (from the initialstate) to different states in A . Using this observation, we can show that A has at least 3states that cannot be successors of the initial state: the initial state and (at least two) statesthat corresponding to bottom SCCs.Finally, we define vectors ~u , . . . , ~u k from the successors of the initial state of A . Formally,for i ∈ { , . . . , n } we define a vector ~y i as ~y i [ j ] = E ( L A | a i a j Σ ω ). Note that there are atmost k different vectors among ~y , . . . , ~y n and we take these distinct vectors as ~u , . . . , ~u k .The condition E ( |L A C − L A | | u Σ ω ) ≤ (cid:15) implies that ~u , . . . , ~u k satisfy the -vector coverproblem.Conversely, assume that the instance n, k, C has a solution. Observe that w.l.o.g we canassume that the solution vectors ~u , . . . , ~u k belong to { . , . } n . Based on this solution wedefine an automaton with k + 3 states such that the successors of the initial state correspondto vectors ~u , . . . , ~u k and there are two bottom SCC of the values 0 .
25 and 0 . A as follows. Let f : { , . . . , n } → { , . . . , k } be the mapping ofvectors ~v i ∈ C to ~u j such that k ~v i − ~u j k ∞ ≤ (and then f ( i ) = j ). We define A withstates q , q , . . . , q k , s , s such that q is the initial state, q , . . . , q k are the successors ofthe initial state, and s , s are single-state SCCs of the values 0 .
25 and 0 .
75 respectively.We define the transition function as follows. For all a i ∈ Σ we set δ ( q , a i ) = q f ( i ) . Thenfor every i, j ∈ { , . . . , n } we set δ ( q i , a j ) = s if ~u i [ j ] = 0 .
25 and δ ( q i , a j ) = s otherwise( ~u i [ j ] = 0 . ~u , . . . , ~u k is the solution to the -vector cover problemimplies that A is a rigid -approximation of A C . The dominating set problem reduces to the -vector cover problem . Consider a graph G =( V, E ) with V = { b , . . . , b n } and k ∈ N . We assume that G has no cycles of length less than5; this restriction does not influence NP -hardness of the problem, since each edge can besubstituted with a path of odd length greater than 5. Denote by N Gk ( b ) the set of all nodesof G connected to b with a path of length at most k ; b is connected with itself with a pathof length 0 and hence b ∈ N Gk ( b ). We define vectors ~v , . . . , ~v n ∈ { , , } n as follows. For i, j ∈ { , . . . , n } we set ~v j [ i ] = 1 if i = j , ~v j [ i ] = 0 .
5, if i = j , but b i ∈ N G ( b j ), and ~v j [ i ] = 0 otherwise.We claim that there exist ~u , . . . , ~u k ∈ R n as in the problem statement if and only if G has adominating set of the size k .Assume that G has a dominating set d , . . . d k . Consider ~u , . . . , ~u k such that for all i ∈ { , . . . , k } we have ~u i [ j ] = if i = j or ( d i , b j ) ∈ E , and ~u i [ j ] = otherwise.Observe that for all i ∈ { , . . . , k } and j ∈ { , . . . , n } , if i satisfies i = j or ( b j , d i ) ∈ E , then k ~v j − ~u i k ∞ ≤ . Therefore, the vectors ~u , . . . , ~u k solve the -vector cover problem. Conversely, assume that there exist ~u , . . . , ~u k ∈ R n that solve the -vector cover problem.Let ~u j be a vector such that for ~v m [1] , . . . , ~v m [ l ] we have (cid:13)(cid:13) ~u j − ~v m [ i ] (cid:13)(cid:13) ∞ ≤ . Note that we canassume that all coefficients of ~u j are and . We claim that for some i nodes b m [1] , . . . , b m [ l ] of G belong to N G ( b m [ i ] ) for some i . To see that observe that the distance between anytwo nodes among b m [1] , . . . , b m [ l ] is at most 2. Indeed, for all k the component m [ k ] of ~v m [ k ] is 1 and hence this component of ~u j is . That in turn implies that the component m [ k ] of ~v m [1] , . . . , ~v m [ l ] is 1 or and hence b m [ k ] belongs to N G ( b m [1] ) , . . . , N G ( b m [ l ] ). Sincethere are no short cycles in G and all distances are bounded by 2, there has to i such that b m [1] , . . . , b m [ l ] of G belong to N G ( b m [ i ] ). Then, we define d j as b i .Note that d j dominates all nodes b m [1] , . . . , b m [ k ] , which correspond to vectors ~v m [1] , . . . , ~v m [ l ] .Therefore, nodes d , . . . , d k picked as above form a dominating set in G . (cid:74) C Estimating minimal sample size in passive learning
The probability that a single words of length l is not generated by a random sample S with s examples, generated w.r.t. a distribution G ( λ ), can be bounded by (cid:18) − | Σ | l (1 − λ ) l λ (cid:19) s We want to compute a sample size s such that the probability that there is a word of size l not in this sample is at most (cid:15) . This can be, very roughly, represented by the followinginequality: | Σ | l (cid:18) − | Σ | l (1 − λ ) l λ (cid:19) s < (cid:15) which we can conveniently rewrite as (cid:18) − | Σ | l (1 − λ ) l λ (cid:19) | Σ | l (1 − λ ) l λ · s/ ( | Σ | l (1 − λ ) l λ ) < (cid:15) | Σ | l By the fact that (1 − x ) x < e − , the above equality is a consequence of the following one e − s/ ( | Σ | l (1 − λ ) l λ ) < (cid:15) | Σ | l which is equivalent to e s/ ( | Σ | l (1 − λ ) l λ ) > | Σ | l (cid:15) Now we apply the natural logarithm. s/ ( | Σ | l (1 − λ ) l λ ) > ln | Σ | l (cid:15) so s > | Σ | l (1 − λ ) l λ · ln | Σ | l (cid:15) For (