[PDF] Approximate Learning of Limit-Average Automata

Abstract

Limit-average automata are weighted automata on infinite words that use average to aggregate the weights seen in infinite runs. We study approximate learning problems for limit-average automata in two settings: passive and active. In the passive learning case, we show that limit-average automata are not PAC-learnable as samples must be of exponential-size to provide (with good probability) enough details to learn an automaton. We also show that the problem of finding an automaton that fits a given sample is NP-complete. In the active learning case, we show that limit-average automata can be learned almost-exactly, i.e., we can learn in polynomial time an automaton that is consistent with the target automaton on almost all words. On the other hand, we show that the problem of learning an automaton that approximates the target automaton (with perhaps fewer states) is NP-complete. The abovementioned results are shown for the uniform distribution on words. We briefly discuss learning over different distributions.

Full PDF

AApproximate Learning of Limit-Average Automata

Jakub Michaliszyn

University of Wrocł[email protected]

Jan Otop

University of Wrocł[email protected]

Abstract

Limit-average automata are weighted automata on inﬁnite words that use average to aggregate theweights seen in inﬁnite runs. We study approximate learning problems for limit-average automata intwo settings: passive and active. In the passive learning case, we show that limit-average automataare not PAC-learnable as samples must be of exponential-size to provide (with good probability)enough details to learn an automaton. We also show that the problem of ﬁnding an automaton thatﬁts a given sample is NP -complete. In the active learning case, we show that limit-average automatacan be learned almost-exactly, i.e., we can learn in polynomial time an automaton that is consistentwith the target automaton on almost all words. On the other hand, we show that the problemof learning an automaton that approximates the target automaton (with perhaps fewer states) is NP -complete. The abovementioned results are shown for the uniform distribution on words. Webrieﬂy discuss learning over diﬀerent distributions. Theory of computation → Automata over inﬁnite objects; Theoryof computation → Quantitative automata

Keywords and phrases weighted automata, learning, expected value

Funding

The National Science Centre (NCN), Poland under grant 2017/27/B/ST6/00299.

Quantitative veriﬁcation has been proposed to verify non-Boolean system properties such asperformance, energy consumption, fault-tolerance, etc. There are two main challenges inapplying quantitative veriﬁcation in practice. First, formalization of quantitative propertiesis diﬃcult as speciﬁcations are given directly as weighted automata, which extend ﬁniteautomata with weights on transitions [18, 15]. Quantitative logics, which could facilitatethe speciﬁcation task, have either limited expression power or their model-checking problemis undecidable [11, 13]. Second, there is little research on abstraction in the quantitativesetting [14], which would allow us to reduce the size of quantitative models, presented asweighted automata.We approach both problems using learning of weighted automata. We can apply thelearning framework to facilitate writing quantitative speciﬁcations and make it more accessibleto non-experts. For abstraction purposes, we study approximate learning, having in mindthat approximation of a system can signiﬁcantly reduce its size.We focus on weighted automata over inﬁnite words which compute the long-run averageof weights [15]. Such automata express interesting quantitative properties [15, 23] and admitpolynomial-time algorithms for basic decision questions [15]. Some of the interesting propertiescan be only expressed by non-deterministic automata [15], but every non-deterministicweighted automaton is approximated by some deterministic weighted automaton on almostall words [29]. This means that, by allowing some small margin of error, we can focus ondeterministic automata while being able to model important system properties, such asminimal response time, minimal number of errors, the edit distance problem [23], and the a r X i v : . [ c s . F L ] J un Approximate Learning of Limit-Average Automata speciﬁcation repair framework from [10].We therefore focus on approximate learning under probabilistic semantics, which cor-responds to the average-case analysis in quantitative veriﬁcation [16]. We treat words asrandom events and functions deﬁned by weighted automata as random variables. In thissetting, an automaton A (cid:15) -approximates A if the expected diﬀerence between A and A (over all words) is bounded by (cid:15) . We consider two learning approaches: passive and active.In passive learning, we think of an automaton as a black-box. This might be a program,a speciﬁcation, a working correct system to be replaced or abstracted, or even only someexamples of how the automaton should work. Our goal is to construct a weighted automatonbased on this black-box model by observing only inputs and outputs of the model.In active learning, we assume presence of an interactive teacher that reveals the values ofgiven examples and veriﬁes whether a provided automaton is as intended; if not, the teacherresponses with a witness, which is a word showing the diﬀerence between the constructedautomaton and the intended automaton. This does not necessary mean that the teacher isfamiliar with the weighted automata formalism — to provide a witness, they may simply runthe constructed automaton in parallel with the black-box and, if at any point their behaviorsdiﬀer, provide the appropriate input to the learning algorithm. Our contributions.

We start with a discussion on the setting for learning problems. Theﬁrst step is to ﬁnd a suitable representation for samples; not all inﬁnite words have ﬁniterepresentations. A natural idea is to use ultimately periodic words in samples; we studysuch samples. However, the probability distribution over ultimately periodic words is verydiﬀerent from the uniform distribution over inﬁnite words. Therefore, we also considersamples consisting of a ﬁnite word u labeled with the expected value over all extensions of u . The probability distribution over such samples is closer to the uniform distribution overinﬁnite words.Then we study the passive learning problem. We show that for unique characterizationof an automaton, we need a sample of exponential size. We study the complexity of theproblem of ﬁnding an automaton that ﬁts the whole sample. The problem, without additionalrestrictions, has a trivial and overﬁtting solution. To mitigate this, we impose bounds onautomata size, and show that then the problem is NP -complete.For active learning, we show that the problem of learning an almost-exact automaton canbe solved in polynomial time, and ﬁnding an automaton of bounded size that approximatesthe target one cannot be done in polynomial time if P = NP . We conclude with a discussionon diﬀerent probability distributions. Related work.

The probably approximately correct (PAC) learning, introduced in [35],is a general passive learning framework applied to various objects (DNF/CNF formulas,decision trees, automata, etc.) [26]. PAC learning of deterministic ﬁnite automata (DFA)has been extensively studied despite negative indicators. First, the sample ﬁtting problemfor DFA, where the task is to construct a minimal-size DFA consistent with a given sample,has been shown NP -complete [21]. Even approximate sample ﬁtting, where we ask for aDFA at most polynomially greater than a miniaml-size DFA, remains NP -complete [34].Second, it has been shown that existence of a polynomial-time PAC learning algorithm forDFA would break certain cryptographic systems (such as RSA) and hence it is unlikely [25].Despite these negative results, it has been empirically shown that DFA can be eﬃcientlylearned [27]. In particular, if we assume structural completeness of a sample, then itdetermines a minimal DFA [32]. Pitt posed a question whether DFA are PAC-learnableunder the uniform distribution [33], which remains open [2].Angluin showed that DFA can be learned in polynomial time, if the learning algorithm akub Michaliszyn and Jan Otop 3 can ask membership and equivalence queries [1]. This approach proved to be very fruitfuland versatile. Angluin’s algorithm has been adapted to learn NFA [12], automata overinﬁnite words [3, 20], nominal automata over inﬁnite alphabets [31], weighted automata overwords [9] and weighted automata over trees [22, 28].Recently, there has been a renewed interest in learning weighted automata [8, 28, 6, 5, 7].These results apply to weighted automata over ﬁelds [18], which work over ﬁnite words. We,however, consider limit-average automata, which work over inﬁnite words and cannot bedeﬁned using a ﬁeld or even a semiring. Furthermore, we consider weighted (limit-average)automata under probabilistic semantics [16, 29], i.e., we consider functions represented byautomata as random variables. Given a ﬁnite alphabet Σ of letters, a word w is a ﬁnite or inﬁnite sequence of letters. Wedenote the set of all ﬁnite words over Σ by Σ ∗ , and the set of all inﬁnite words over Σ byΣ ω . For a word w , we deﬁne w [ i ] as the i -th letter of w , and we deﬁne w [ i, j ] as the subword w [ i ] w [ i + 1] . . . w [ j ] of w . We use the same notation for vectors and sequences; we assumethat sequences start with 0 index.A deterministic ﬁnite automaton (DFA) is a tuple (Σ , Q, q , F, δ ) consisting of the alphabetΣ, a ﬁnite set of states Q , the initial state q ∈ Q , a set of ﬁnal states F , and a transitionfunction δ : Q × Σ → Q .A (deterministic) LimAvg -automaton extends a DFA with a function C : δ → Q thatdeﬁnes rational weights of transitions. The size of such an automaton A , denoted by |A| , isthe sum of the number of states of A and the lengths of binary encodings of all the weights.A run π of a LimAvg -automaton A on a word w is a sequence of states π [0] π [1] . . . such that π [0] is the initial state and for every i > δ ( π [ i − , w [ i ]) = π [ i ]. Wedo not consider ω -accepting conditions and assume that all inﬁnite runs are accepting.Every run π of A on an inﬁnite word w deﬁnes a sequence of weights C ( π ) of successivetransitions of A , i.e., C ( π )[ i ] = C ( π [ i − , w [ i ] , π [ i ]). The value of the run π is thendeﬁned as LimAvg ( π ) = lim sup k →∞ Avg ( C ( π )[0 , k ]), where for ﬁnite runs π we have Avg ( C ( π )) = Sum ( C ( π )) / | C ( π ) | . The value of a word w assigned by the automaton A ,denoted by L A ( w ), is the value of the run of A on w .We consider three classes of probability measures on words over the alphabet Σ. U n , for n ∈ N , is the uniform probability distribution on Σ n assigning each word theprobability | Σ | − n . G ( λ ), for a termination probability λ ∈ Q + ∩ (0 , u ∈ Σ ∗ we have G ( λ )( u ) = | Σ | −| u | · (1 − λ ) | u | · λ . Observe that the probability of words of the samelength is equal, and the probability of generating a word of length k is (1 − λ ) k · λ . Wecan consider this as process generating ﬁnite words, which stops after each step withprobability λ . U ∞ is the uniform probability measure on Σ ω . Formally, we deﬁne U ∞ on basic sets u Σ ω = { uw | w ∈ Σ ω } as follows: U ∞ ( u Σ ω ) = | Σ | −| u | . Then, U ∞ is the unique extensionon all Borel sets in Σ ω considered with the product topology [19, 4]. Automata as random variables.

A weighted automaton A deﬁnes the measurablefunction L A ( w ) : Σ ω R that assigns values to words. We interpret such functions asrandom variables w.r.t. the probabilistic measure U ∞ . Hence, for a given automaton A , weconsider the following quantities: Approximate Learning of Limit-Average Automata E ( A ) — the expected value of the random variable L A deﬁned by A w.r.t. the uniformdistribution U ∞ on Σ ω , and E ( A | u Σ ω ) — the conditional expected value, deﬁned for U ∞ as the expected value of L u such that L u ( w ) = L A ( uw ).We consider automata as generators of random variables — two automata are almostequivalent if they deﬁne the almost equal random variables. Formally, we say that A and A are almost equivalent if and only if for almost all words w we have L A ( w ) = L A ( w ).Note that almost all words means all except for words from some Y of probability 0. LimAvg -automata considered over probability distributions are equivalent to Markovchains with the long-run average objectives presented in [4, Section 10.5.2]. (cid:73)

Theorem 1 ([4]) . Let A be a LimAvg -automaton. (1) If A is strongly connected, thenalmost all words have the same value equal to E ( A ) . (2) For almost all words w , the run on w eventually reaches some bottom strongly connected component (SCC) of A . Theorem 1 has important consequences. First, we can contract each bottom SCC to asingle state with self-loops of the same weight x being the expected value of that SCC. Werefer to x as the value of that SCC. Such an operation does not aﬀect almost equivalence.Second, while reasoning about LimAvg -automata, we can neglect all weights except for thevalues of bottom SCCs. We will omit weights other than the values of bottom SCCs.

Samples.

A sample is a set of labeled examples of some (hidden) function L : Σ ω → R . Inthe classical automata-learning approach words are ﬁnite, and hence they can be presentedas examples along with the information whether the word belongs to the hidden languageor not. In the inﬁnite-word case, however, words cannot be given directly to a learningalgorithm. We present two alternative solutions for this problem. One is to restrict examplesto ultimately periodic words, which have ﬁnite presentation, and the other is to considerﬁnite words and ask for conditional expected values. We discuss both approaches below.To distinguish samples with diﬀerent types of labeled examples, we call them U -samples, E -samples and ( E, n )-samples. Ultimately periodic words . Consider an example being an ultimately periodic word uv ω .It can presented as a pair of ﬁnite words ( u, v ) and we consider labeled examples ( u, v, x ),where u, v ∈ Σ ∗ and x = L ( uv ω ). A set of labeled examples ( u, v, x ) is called an U -sample.To draw a random U -sample, we consider ﬁnite words u, v to be selected independentlyat random according to distributions G ( λ ) and G ( λ ) for some λ , λ . For such a set ofpairs of words, we label them according to the function L . Conditional expected values . We consider examples, which are ﬁnite words u ∈ Σ ∗ . Alabeled example is a pair ( u, x ), where x = E ( L | u Σ ω ) is the conditional expected valueof L under the condition that random words start with u . For such labeled exampleswe consider E -samples consisting of labeled words of various length, and ( E, n )-samplesconsisting of words of length n . We assume that ﬁnite words for random E -samplesare drawn according to a distribution G ( λ ) for some λ , and ﬁnite words for random( E, n )-samples are drawn according to the uniform distribution U n .We only consider minimal consistent samples, i.e., samples that do not contain ex-amples whose value can be computed from other examples in the sample. For instance, { ( a, a, , ( a, aa, } is an inconsistent U -sample, and { ( aa, , ( ab, ) , ( a, } is an inconsistent E -sample over { a, b } . (cid:73) Remark 2 (Incompatible distributions).

Note that the distribution on ultimately periodicwords diﬀer from the uniform distribution on inﬁnite words. The set of ultimately periodic akub Michaliszyn and Jan Otop 5 words is countable and hence it has probability 0 (according to the distribution U ∞ ).Moreover, almost all inﬁnite words contain all ﬁnite words as inﬁxes, whereas this is not thecase for ultimately periodic words under any probability distribution. (cid:73) Remark 3 (Feasibility of conditional expectation).

Consider a

LimAvg -automaton A com-puting L : Σ ω → R . For a ﬁnite preﬁx u , we can compute E ( L | u Σ ω ) in polynomial timein |A| [4]. If we consider A to be a black-box, which can be controlled, then E ( L | u Σ ω )can be approximated in the following way. We pick random words v , . . . , v k of length k ,compute partial averages in A of uv , . . . , uv k and then take the average of these values.The probability that this process returns a value (cid:15) -close to E ( L | u Σ ω ) converges to 1 atexponential rate with k . Passive learning corresponds to a scenario with an uncontrolled working black-box system.The learner can only observe system’s output, and its goal is to create an approximate modelof the system. This task comprise of two problems. The ﬁrst problem, characterization ,is to assess whether the observations cover most, if not all, behaviors of the system. Thesecond one, called sample ﬁtting , is to create a reasonable automaton consistent with theobservations. In this section we discuss both problems.

A sample can cover only small part of the system. It is sometimes argued [27, 32], however,that if a sample is large enough, then it is likely to cover most, if not all, importantbehaviors. We show that for some

LimAvg -automata, randomly drawn samples of size lessthan exponential are unlikely to demonstrate any probable values.Let || S || denote the sum of the lengths of all the examples in a sample S . A sample distinguishes two automata if it is consistent with exactly one of them. We show the following. (cid:73) Theorem 4.

For any n there are two automata A n , A n of size n +4 such that for almost allwords w we have |L A n ( w ) −L A n ( w ) | = 2 , but a random U -sample, E -sample or ( E, k ) -sample(for any k ) S distinguishes A n and A n with the probability at most || S || n . Proof.

Consider the alphabet { a, b } and n ∈ N . We construct a LimAvg -automaton A n with n + 4 states. We use n + 2 states to ﬁnd the ﬁrst occurrence of the inﬁx a n b in thestandard manner. When such an inﬁx is found, the automaton moves to a state q a if thefollowing letter is a and to a state q b otherwise. In q a it loops with the weight 1 and in q b itloops with the weight −

1. All other weights are 0. So A n returns 1 if the ﬁrst occurrence of a n b if followed by a , − b , and 0 if there is no a n b . The automaton A n has the same structure as A n , but the weights − A n and A n diﬀer over almost all inﬁnite words. Indeed, an inﬁniteword with probability 1 contains the inﬁx a n b , and so on almost all words one of the automatareturns − S (it can be an E -sample, ( E, k )-sampleor U -sample). This sample distinguishes automata A n and A n only if it contains an examplewith the inﬁx a n b (in case of U -samples, this means that this inﬁx occurs in uv of someexample ( u, v )) - all other examples for both automata are the same. The probability that S contains a n b as an inﬁx of one of its examples is bounded by || S || n +1 . Indeed, the number ofpositions in all words is || S || and the probability that a n b occurs on some speciﬁc position isat most n +1 for all types of samples and any k > k < n + 1, the probability is 0). (cid:74) Approximate Learning of Limit-Average Automata

Therefore to be able to distinguish just two automata with probability 1 − (cid:15) , we need asample such that || S || n +1 > − (cid:15) , which for ﬁxed (cid:15) < U -samples. If automata A , A of size n recognize diﬀerent languages, then there is a word uw ω such that L A ( uw ω ) = L A ( uw ω ) and the length of u and w is bounded by n . Assumefor simplicity that λ = λ = λ . If the sample size is at least | Σ | n · ln | Σ | n (cid:15) · (1 − λ ) − n · λ − ,then with probability 1 − (cid:15) a random sample contains all such words, and so distinguishes allthe automata of size n . E -samples. E -samples do not distinguish almost-equivalent automata, hence we cannotlearn automata exactly. However, exponential samples are enough to learn automata up toalmost equivalence. To see that consider two automata A and A of size n that are notalmost equivalent. Due to Theorem 1 there is a word u ∈ Σ ∗ such that u reaches bottomSCCs in both A and A , and these bottom SCCs have diﬀerent expected values. Usingstandard pumping argument, we can reduce the size of u to at most n . So if the samplesize is at least | Σ | n · ln | Σ | n (cid:15) · (1 − λ ) − n · λ − , then with probability 1 − (cid:15) it contains allwords u of size n , and therefore distinguishes all automata of size n .For ( E, n )-samples, the reasoning is the similar, assuming n is quadratic in the size ofthe automaton: the suﬃcient sample size is | Σ | n · ln | Σ | n (cid:15) . We discuss the consequences of our results to the probably approximately correct (PAC) modelof learning [35]. In the PAC framework, the learning algorithm should work independently ofthe probability distribution on samples. However, variants of the PAC framework have beenconsidered where the distribution on samples is uniform [24]. In particular, PAC learning ofDFA under the uniform distribution over words is a long-standing open problem [33, 2].We restrict the classical PAC model and assume that observations are drawn accordingto the distributions U n , G ( λ ) (as discussed in Section 2) and the quality of the learnedautomaton is assessed using the uniform distribution over inﬁnite words U ∞ . (cid:66) Problem 5 (PAC learning under ﬁxed distributions).

Given (cid:15), δ ∈ Q + , n ∈ N and an oraclereturning random labeled examples consistent with some automaton A T of size n , constructan automaton A such that with probability 1 − (cid:15) w e have E ( |L A − L A T | ) < − δ .As a consequence of Theorem 4, there is no PAC-learning algorithm for LimAvg -automatawith U -samples (resp., E -samples or ( E, k )-samples) that uses samples of polynomial size; inparticular, there is no such algorithm working in polynomial time. (cid:73)

Theorem 6.

The class of

LimAvg -automata is not PAC-learnable with U -samples, E -samples or ( E, k ) -samples. Once we have a sample, the problem of ﬁnding an automaton ﬁtting the sample can besolved in polynomial time in a trivial way: we create an automaton that is a tree such thatevery word of a given sample leads to a diﬀerent leaf in this tree, and then we add loops withappropriate values in the leaves (similarly to a preﬁx tree acceptor [17] for ﬁnite automata).This solution leads to an automaton that overﬁts the samples, as it works well only for thesample and it is unlikely to work well with words not included in the sample. Besides, theautomaton is linear in the size of the sample, not in the size of the black-box system. For akub Michaliszyn and Jan Otop 7 q q q q n . . . c : 1 c : 1 c n : 1 c : 1 a : 1 a . . . a n : 0 a : 0 a : 1 a . . . a n : 0 a , a : 0 a : 1 a . . . a n : 0 a . . . a n − : 0 a n : 1 Figure 1

The canonical automaton A ϕ from the proof of Theorem 8 a ﬁxed automaton we can construct arbitrarily large U-samples (or E-samples) consistentwith it and hence the gap between the size of such an automaton and the black-box systemis arbitrarily large. To exclude such solutions, we restrict the size of the automaton to beconstructed. We study the following problem. (cid:66) Problem 7 (Sample ﬁtting).

Given a sample S and n ∈ N , construct a LimAvg -automatonwith at most n states, which is consistent with S .The decision version of this problem only asks whether such an automaton exists. Weshow that this problem is NP -complete, regardless of the sample representation. For hardness,we reduce the NP-complete problem SAT [17], which is the SAT problem restricted to CNFformulas such that each clause contains only positive literals or only negative literals. (cid:73) Theorem 8.

The sample ﬁtting problem is NP -complete for U -samples. Proof.

The membership in NP follows from the following observation: if n is greater thanthe total length of the samples, then return yes as the tree-like solution works. Otherwise,non-deterministically pick an automaton of the size n and check whether it ﬁts the sample.The NP -hardness proof is inspired by the construction from [17, Theorem 6.2.1]. For agiven instance of the SAT problem ϕ = V ni =1 C i over variables x , . . . , x n (not all variablesneed to occur in ϕ ), we construct a U -sample S ϕ such that there is an automaton with n states ﬁtting S ϕ if and only if ϕ is satisﬁable.We ﬁx the alphabet { a i , c i , d i | i = 1 , . . . , n } ∪ { b, t } . The sample S ϕ consists of: S1 ( c i , a j , x ) for each i, j ∈ { , . . . , n } , where x is 1 if i = j , and 0 otherwise. S2 ( c i , d j , x ) for each i, j ∈ { , . . . , n } , where x is 1 if x i is in C j , and 0 otherwise. S3 ( c i b, d i ,

1) for each i ∈ { , . . . , n } . S4 ( c i b, t, x ) for each i ∈ { , . . . , n } , where x is 1 if the clause C i contains only positiveliterals and 0 if it contains only negative literals.Assume that ϕ is satisﬁable and let σ : { x , . . . , x n } → { , } be a satisfying valuation.Then, we construct an automaton A ϕ consistent with the sample S ϕ starting from thestructure presented in Figure 1. Then, we add the following transitions:for each i , a loop in q i on the letter t with the value σ ( x i ),for each i, j , a loop in q i on the letter d j with the value 1 if x i is in C j and 0 otherwise,for each clause C i , if C i is satisﬁed because of a variable x j , then we add a transitionfrom q i to q j over b (if there are multiple possible variables, we choose any).The remaining transitions can be set in arbitrary way. The obtained automaton A ϕ isconsistent with the sample S ϕ .Now assume that there is an automaton A of n states, which is consistent with thesample. We show that the valuation σ such that σ ( x i ) = L A ( c i t ω ) satisﬁes ϕ . Let q i be the Approximate Learning of Limit-Average Automata q q q n . . . q F q T c c n a a . . . a n a a a . . . a n a n a . . . a n − c ∗ : 1 ∗ Figure 2

A picture of the canonical automaton from the proof of Theorem 9. All the weights are0 except from transitions from the state q T state where the automaton A is after reading the word c i . By S1 , all the states q , . . . , q n are pairwise diﬀerent. Since there are only n states, q , . . . , q n are all the states of A . Nowconsider any clause C i . Let q j be the state of A after reading c i b . Notice that by S3 , thevalue of d ωi in q j is 1, and by S2 , this means that x j is in C i . If C i contains only positiveliterals, then the value of t ω in q j is 1 by S4 , which means that σ ( x j ) = 1 and that C i issatisﬁed. The other case is symmetric. (cid:74)(cid:73) Theorem 9.

The sample ﬁtting problem is NP -complete for E -samples. Proof.

The proof is similar to the proof of Theorem 8. For the NP -hardness, the sample nowis obtained from the sample in the proof of Theorem 8 by replacing every triple ( u, v, x ) bythe pair ( uv, x ). However, now we ask for an automaton of size n + 2. If there is a valuationthat satisﬁes a given set of clauses, then one can construct an automaton based on the onepresented in Figure 2.On the other hand, if there is an automaton ﬁtting the sample, then it has to have a statewhere the expected value of any word is 0, a state where the expected value of any word is1, and n diﬀerent states reachable by each c , . . . , c n . The rest of the proof is virtually thesame as in Theorem 8, except that now we deﬁne σ such that σ ( x i ) is the expected value ofwords with the preﬁx c i t . (cid:74) The above proofs also work with some natural relaxations of the sample ﬁtting problem.For example, if we only require the automaton to ﬁt the samples up to some (cid:15) < , then theproofs still hold since we use only weights 0 and 1. Another relaxation for the E -samples caseis to allow the automaton to give wrong values for some samples as long as the summarizedprobability of the examples with wrong value is less than some (cid:15) . However, since all thewords are of the length at most three, the probability of u Σ ω for each example u is greaterthan n +2) (recall that | Σ | = 3 n + 2), which means that for any (cid:15) < n +2) for everyexample some its extension must ﬁt and hence the whole sample must ﬁt. In the active case, the learning algorithm can ask queries to an oracle, which is called theteacher , which has a (hidden) function L : Σ ω → R and answers two types of queries: expectation queries : given a ﬁnite word u , the teacher returns E ( L | u Σ ω ), (cid:15) -consistency queries : given an automaton A , if L A (cid:15) -approximates L (i.e., E ( |L −L A | ) ≤ (cid:15) ), the teacher returns YES, otherwise the teacher returns a word u such that | E ( L | u Σ ω ) − E ( L A | u Σ ω ) | > (cid:15) . akub Michaliszyn and Jan Otop 9 Remark . Consider functions L , L deﬁned by LimAvg -automata. If E ( |L − L | ) > (cid:15) , thendue to Theorem 1 there is a word u Σ ω such that E ( |L − L | | u Σ ω ) > (cid:15) and L (resp., L ) returns E ( L | u Σ ω ) (resp,. E ( L | u Σ ω )) on almost all words from u Σ ω . Therefore, | E ( L | u Σ ω ) − E ( L | u Σ ω ) | = E ( |L − L | | u Σ ω ) > (cid:15) .In the active learning case we consider two problems: approximate learning and rigidapproximate learning. We ﬁrst deﬁne approximate learning: (cid:66) Problem 10 (Approximate learning).

Given (cid:15) ∈ Q + ∪ { } and a teacher with a (hidden)function L , construct a LimAvg -automaton A such that L A (cid:15) -approximates L and A hasthe minimal number of state among such automata. We deﬁne a decision problem, called approximate minimization, which can be solved inpolynomial-time having a polynomial-time approximate learning algorithm. (cid:66)

Problem 11 (Approximate minimization).

Given a

LimAvg -automaton A , n ∈ N and (cid:15) ∈ Q + , the approximate minimization problem asks whether there exists a LimAvg -automaton A with at most n states such that E ( |L A − L A | ) ≤ (cid:15) .An eﬃcient learning algorithm can be used to eﬃciently compute approximate mini-mization of a given LimAvg -automaton A ; we can run it and compute answers to queriesof the learning algorithm in polynomial time in |A| [4]. We show that the approximateminimization problem is NP -complete, which means that approximate learning cannot bedone in polynomial time if P = N P . (cid:73) Theorem 12.

The approximate minimization problem is NP -complete. Proof sketch.

The problem is contained in NP as we can non-deterministically pick anautomaton with n states and check whether it (cid:15) -approximates A .For a vector ~v ∈ R m we deﬁne k ~v k = P mi =1 | ~v [ i ] | . For NP -hardness, consider thefollowing problem: Binary k-Median Problem (BKMP) : given numbers n, m, k, t ∈ N anda set of Boolean vectors C = { ~v , . . . , ~v n } ⊆ { , } m , decide whether there are vectors ~u , . . . , ~u k ∈ { , } m and a partition D , . . . , D k of C such that P kj =1 P ~v ∈D j k ~v − ~u j k ≤ t ?BKMP has been shown NP -complete in [30]. To ease the reduction, we consider a variantof BKMP, called Modiﬁed BKMP, where we assume that ~ ,~ NP -complete.For C = { ~v , . . . , ~v n } ⊆ { , } m we deﬁne L C over the alphabet Σ = { a , . . . , a m } asfollows: for all a i , a j ∈ Σ, w ∈ Σ ω if i ≤ n , we set L C ( a i a j w ) = ~v i [ j ].if i ∈ { n + 1 , m } , we set L C ( a i a j w ) = ~v n [ j ].Intuitively, we select the vector with the ﬁrst letter and the vector’s component with thesecond letter. This language can be deﬁned with a tree-like LimAvg -automaton A C .We show that, if an instance ( C , k, t ) of Modiﬁed BKMP has a solution ~u , . . . , ~u k and D , . . . , D k , then there exists an automaton A with k + 1 states that tnm -approximates L C .Let ~v = ~u = ~ ~v = ~u = ~

1. The automaton A consists of the initial state q and thesuccessors of the initial state, q , . . . , q k , which correspond to vectors ~u , . . . , ~u k , i.e., q is abottom SCC of the value 0, q is a bottom SCC of the value 1, and for i > q i encode ~u i via δ ( q i , a j ) = q if ~u i [ j ] = 0 and δ ( q i , a j ) = q otherwise. The successors of q are deﬁned based on the partition D , . . . , D k , i.e., if ~v i belongs to C j , then δ ( q , a i ) = q j .Observe that A (cid:15) -approximates A C . Conversely, consider A with k + 1 states that tnm -approximates L C . We deﬁne vectors ~p , . . . , ~p m ∈ R m such that ~p i [ j ] = E ( A | a i a j Σ ω ). The structure of Modiﬁed BKMP, impliesthat the initial state of A has no self loops and hence it has at most k diﬀerent successorstates. Therefore, there are at most k diﬀerent vectors among ~p , . . . , ~p m . Finally, we observethat since we consider k·k , w.l.o.g. we can assume that ~p , . . . , ~p m ∈ { , } m . Therefore,these vectors give us a solution to the instance ( C , k, t ) of Modiﬁed BKMP. (cid:74) One of the drawbacks of the standard approximation is that the counterexamples may bedubious, if not useless. We illustrate this with an example. (cid:73)

Example 13 (Dubious counterexamples) . Consider a minimal DFA B with 2 n states whoselanguage consists of words of length m for some n < m , and a word v ∈ Σ n . We deﬁne afunction L B ,v : { a, b } ω → R such that for all u, w L B ,v ( auw ) = 0 if | u | = n and B accepts u , L B ,v ( auw ) = 0 . | u | = n and B rejects u , L B ,v ( bw ) is 0 . v in w is followed by a , and 1 otherwise.Fix (cid:15) = 0 .

1. Observe that L B ,v can be 0 . A , which isfaithful to L B ,v on b Σ ω and returns 0 .

15 for all other words. A has n + O (1) states.Assume that the teacher gives only counterexamples starting with a , and hence (cid:15) -consistency queries do not give any information about the values of words starting with b .The teacher can do it as long as the algorithm does not know the whole B , which takes Ω( |B| )queries to learn. Yet even if the algorithm learns the whole B and returns the 0 . b , the expected diﬀerence is 0 . · . .

15. It follows that to learn theapproximation, the algorithm needs to learn something about v .Suppose that the algorithm did not learn the whole B . Then, to learn something non-trivial about words starting with b , it has to ask an expectation query containing v . Sincethe learning algorithm is deterministic, it asks the same expectation queries for diﬀerentwords v . Therefore, for every learning algorithm there are words v that can be learned onlyafter asking queries of the total length 2 Ω( | v | ) .It follows that any learning algorithm has to ask queries of total length Ω( |B| ) or 2 Ω( | v | ) ,which totals to 2 Ω( n ) .In Example 13 we assumed an antagonistic teacher, which misleads the algorithm onpurpose. But even with a stochastic teacher, it is not known whether “ﬁxing” a given randomcounterexample is a step towards better approximation. To resolve this issue, we consider astronger notion of approximation, called rigid approximation , where we require all conditionalexpected values to be (cid:15) -close, i.e., for all words u ∈ Σ ∗ we have E ( |L − L | | u Σ ω ) ≤ (cid:15) . Inthis framework counterexamples are certain, i.e., if for some u ∈ Σ ∗ , the expectation over u Σ ω is more than (cid:15) oﬀ the intended value, it has to be modiﬁed. Formally, we deﬁne theproblem as follows: (cid:66) Problem 14 (Rigid approximate learning).

Given (cid:15) ∈ Q + and a teacher with a (hidden)function L , construct an automaton A such that L A is a rigid (cid:15) -approximation of L and A has the minimal number of state among such automata.Even though the counterexamples are certain in this framework, we observe that thisdoes not eliminate ambiguity. For instance, there can be multiple automata with the minimalnumber of states. akub Michaliszyn and Jan Otop 11 A :a b c0 0.5 1 A :a b c0.25 1 A :a b c0 0.75 A exp : . . . a b c0 n n a b c n n n a b c n − n n − n n − n Figure 3

The automaton A approximated by two minimal non-equivalent automata and theautomaton A exp approximated by exponentially many non-equivalent automata (cid:73) Example 15 (Non-unique minimalization) . Consider the automaton A depicted in Figure 3and (cid:15) = . Any automaton A , which is a rigid (cid:15) -approximation of L A , has at least twobottom SCCs and hence it requires at least 3 states. Therefore the automata A , A depictedin Figure 3, which (cid:15) -approximate L A , have the minimal number of states. This shows thatthere can be multiple correct answers to in the rigid approximate learning problem.Based on this example, we construct the automaton A exp parametrized by n ∈ N depictedin Figure 3, which has O ( n ) states and there are exponentially many (in n ) minimal non-equivalent automata, which are rigid n -approximations of L A exp .As in the approximate learning case, eﬃcient rigid approximate learning enables us tosolve eﬃciently the following rigid approximate minimization problem. (cid:66) Problem 16 (Rigid approximate minimization).

Given a

LimAvg -automaton A , n ∈ N and (cid:15) ∈ Q + , construct a LimAvg -automaton A with at most n states such that for all words u ∈ Σ ∗ we have E ( |L A − L A | | u Σ ω ) ≤ (cid:15) .First, we consider a naive approach to solve the rigid approximate minimization problembased on state merging. We start with an input automaton A , merge its states to maintainthe property that the automaton with merged states A rigidly (cid:15) -approximates L A . Weterminate if the automaton is minimal , i.e., merging any two states of A violates the property.However, it may happen that q can be merged either with q or q , but not with both. Weshow in the following example that the choice of merged states can have profound impact onthe size of a minimal automaton. (cid:73) Example 17 (Minimal automata of diﬀerent size) . Assume n ∈ N and the function L thatreturns 0 on words from a Σ n a Σ ω , 1 on b Σ n b Σ ω , and 0 . A be a minimalautomaton deﬁning such a language, as depicted in Figure 4.Let (cid:15) = . To minimize A , we can merge it to the automaton A S with 3 states or to A L with n + 3 states. Observe that A S and A L are rigid (cid:15) -approximations of L A . For A S , it isbecause for every word au we have E ( A B | au Σ ω ) ∈ { , . } and hence it is (cid:15) -close to 0 . ba . For A L , it is because for every u of length at least n + 2 the expectedvalue of A L is 0 .

75 or 0 .

25 depending on the ( n + 2)th letter of u whereas for A it is in { . , } or in { , . } , resp. For shorter u , we simply observe that the diﬀerence of expectedvalues w.r.t. to a word does not exceed the maximal diﬀerence in its suﬃxes.The automaton A L is minimal, as there are no states that can be merged. Therefore, thediﬀerence and even the ratio between the sizes of both automata are unbounded.We show that the rigid approximate minimization problem is NP -complete, which impliesthat there is no polynomial-time rigid approximate learning algorithm (unless P = N P ). A : A L : A S :. . .. . .a * * *b * * * 00.50.51abab * * * * 0.250.75abab 0.250.75 Figure 4

Minimal automata of diﬀerent size (cid:73)

Theorem 18.

The rigid approximate minimization problem is NP -complete. Proof sketch.

The rigid approximate minimization problem is in NP . We show that it is NP -hard. We deﬁne a problem, which is an intermediate step in our reduction.Given n, k ∈ N and vectors C = { ~v , . . . , ~v n } ⊆ { , , } n , the -vector cover problem askswhether there exist ~u , . . . , ~u k ∈ R n such that for every vector ~v ∈ C there is j such that k ~v − ~u j k ∞ ≤ ?The -vector cover problem is NP -complete; the NP -hardness is via reduction from thedominating set problem. We show that the -vector cover problem reduces to the rigidapproximate minimization problemLet C = { ~v , . . . , ~v n } ⊆ { , , } n . We deﬁne L C over the alphabet Σ = { a , . . . , a m } suchthat for all a i , a j ∈ Σ, w ∈ Σ ω we have L C ( a i a j w ) = ~v i [ j ]. Such L C can be deﬁned with a atree-like LimAvg -automaton A C .Let (cid:15) = . We show that an instance n, k, C of the -vector cover problem has a solutionif and only if A C has a rigid (cid:15) -approximation with k + 3 states.Assume that there is A with k + 3 states such that for all words u ∈ Σ ∗ we have E ( |L A C − L A | | u Σ ω ) ≤ (cid:15) . Observe that the initial state of A has at most k diﬀerentsuccessors. To see that consider functions L u ( w ) = L A C ( uw ) deﬁned for u ∈ Σ ∗ . If forsome u , u , u ∈ Σ ∗ we have E ( |L u u − L u u | ) > (cid:15) , then words u and u lead (from theinitial state) to diﬀerent states in A . It follows that A has at least 3 states that are nosuccessors of the initial state: the initial state and (at least two) states that correspond tobottom SCCs. We deﬁne vectors ~u , . . . , ~u k from the successors of the initial state of A .Formally, for i ∈ { , . . . , n } we deﬁne a vector ~y i as ~y i [ j ] = E ( L A | a i a j Σ ω ). There are atmost k diﬀerent vectors among ~y , . . . , ~y n and we take these distinct vectors as ~u , . . . , ~u k .The condition E ( |L A C − L A | | u Σ ω ) ≤ (cid:15) implies that ~u , . . . , ~u k form a solution to n, k, C .Conversely, assume that the instance n, k, C has a solution. Observe that w.l.o.g. we canassume that the solution vectors ~u , . . . , ~u k belong to { . , . } n . Based on this solution wedeﬁne A with k + 3 states q , q , . . . , q k , s , s such that q is initial, q , . . . , q k are successorsof q , and s , s are single-state bottom SCC of the values 0 .

25 and 0 .

75. The states q , . . . , q k correspond to vectors ~u , . . . , ~u k , i.e., the successors of q i encode ~u i via δ ( q i , a j ) = s if ~u i [ j ] = 0 .

25 and δ ( q i , a j ) = s otherwise. The successors of q are deﬁned based on thematching from the vector cover, i.e., if δ ( q , a i ) = q j , then ~u j is -close to ~v i . Observe that A is a rigid -approximation of A C . (cid:74) akub Michaliszyn and Jan Otop 13 The almost-exact learning the minimal automaton is deﬁned as follows. (cid:66)

Problem 19 (Almost-exact learning).

Given a teacher with a (hidden) function L , constructa LimAvg -automaton A such that L A is a 0-approximation of L .Notice that for functions L and L A , the following conditions are equivalent: L A is a 0-approximation of L . L A is a rigid 0-approximation of L . P ( { w : L A ( w ) = L ( w ) } )=0. P ( { w : L A ( uw ) = L ( uw ) } )=0 for each u .We show that there is a polynomial-time algorithm for almost-exact learning. (cid:73) Theorem 20.

The almost-exact learning problem for

LimAvg -automata can be solved inpolynomial time in the size of the minimal automaton that is almost equivalent with the targetfunction L and the maximal length of counterexamples returned by the teacher. Proof.

We deﬁne a relation ≡ L on ﬁnite words u, v ∈ Σ ∗ as follows u ≡ L v if and only if P ( { w : L ( uw ) = L ( vw ) } ) = 0A quick check shows that the relation ≡ L is an equivalence relation. We show that ≡ L is a right congruence, i.e., if u ≡ L v , then for all a we have ua ≡ L va . Indeed, consider X = { w : L ( uaw ) = L ( vaw ) } and X = { w : L ( uw ) = L ( vw ) } . Note that u ≡ L v implies P ( X ) = 0. For all w , if w ∈ X , then aw ∈ X . It follows that (under the uniformdistribution) P ( X ) ≤ | Σ | P ( X ) = 0. Thus, P ( X ) = 0 and ua ≡ L va .We now show a counterpart of the Myhill–Nerode theorem: there is a LimAvg -automatonof n states deﬁning almost-exactly L if and only if the index of ≡ L is n .If A deﬁnes almost-exactly L , then the index of ≡ L is bounded by the number of states of A . Indeed, if for words u, v the automaton A starting from the initial state ends up in the samestate, then for all words w we have L ( uw ) = L ( vw ) and hence u ≡ L v . Conversely, assumethat ≡ L has a ﬁnite index. Then, we construct a LimAvg -automaton A ≡ L correspondingto ≡ L . The states of A ≡ L are the equivalence classes of ≡ L and A ≡ L has a transition from[ u ] ≡ L to [ v ] ≡ L over a if and only if [ ua ] ≡ L = [ v ] ≡ L . Observe that due to Theorem 1, if [ u ] ≡ L and [ v ] ≡ L are in a bottom SCC, then [ u ] ≡ L = [ v ] ≡ L . Then, for every bottom SCCs [ u ] ≡ L we assign the value of all outgoing transitions (which are self-loops) to E ( L | u Σ ω ). For theremaining transitions we set the weights to 0 (due to Theorem 1 these weights are irrelevantas they do not change any of the expected values). Observe that A ≡ L computes L .A classical result of [1] states that DFA can be learned in polynomial time using mem-bership and equivalence queries. We adapt this result here. The learning algorithm for LimAvg -automata maintains a pair ( Q , T ), where Q , the set of access words, containsdiﬀerent representatives the right congruence relation, and T , the set of test words, containswords that approximate the right congruence relation. T deﬁnes the relation ≡ T such that u ≡ T u if and only if for all v ∈ T we have P ( { w | L ( u vw ) = L ( u vw ) } ) = 0.The algorithm maintains two properties: separability : all diﬀerent words u , u ∈ Q belong to diﬀerent equivalence classes of ≡ T , and closedness ; for all u ∈ Q and a ∈ Σ there is u ∈ Q with u a ≡ T u . Each separable and closed pair ( Q , T ) deﬁnes a LimAvg -automaton A Q , T , which can be tested for 0-consistency against the teacher’s function. If the teacherprovides a counterexample u , then it is used to extend Q and T . To do so, we split u into v a, v such that v a is the minimal preﬁx of u such that E ( L | v Σ ω ) = E ( A Q , T | u Σ ω ). Let v be a word from Q that is ≡ T -equivalent to v . We take Q = Q ∪ { v a } , T = T ∪ { v } and close ( Q , T ) using expectation queries and test the equivalence again. We repeat thisuntil we get a LimAvg -automaton deﬁning almost equivalent to L .The proof of correctness of this algorithm is a straightforward modiﬁcation of the proofof correctness of the algorithm from [1], and thus it is omitted. (cid:74) So far we only discussed the uniform distribution of words. Here we brieﬂy discuss whetherTheorem 20 can be generalized to arbitrary distributions, represented by Markov chains. Weassume that the Markov chain is given to a learning algorithm in the input.

Markov chains.

A (ﬁnite-state discrete-time)

Markov chain is a tuple h Σ , S, s , E i , where Σis the alphabet of letters, S is a ﬁnite set of states, s is an initial state, E : S × Σ × S [0 , s ∈ S satisﬁes P a ∈ Σ ,s ∈ S E ( s, a, s ) = 1.The probability of a ﬁnite word u w.r.t. a Markov chain M , denoted by P M ( u ), is thesum of probabilities of paths from s labeled by u , where the probability of a path is theproduct of probabilities of its edges. For basic open sets u · Σ ω = { uw | w ∈ Σ ω } , we have P M ( u · Σ ω ) = P M ( u ), and then the probability measure over inﬁnite words deﬁned by M is the unique extension of the above measure [19]. We will denote the unique probabilitymeasure deﬁned by M as P M .Observe that the uniform distribution can be expressed with a (single-state) Markovchain and hence all the lower bounds from Section 3 and Section 4 still hold. Non-vanishing Markov chains.

A Markov chain M is non-vanishing if for all words u ∈ Σ ∗ we have P M ( u Σ ∗ ) >

0. The almost-exact active learning over distributions givenby non-vanishing Markov chains can be solved in polynomial time using Theorem 20. Forevery measurable set X , non-vanishing Markov chain M we have P M ( X ) > P ( X ) >

0. Thus, almost-exact learning over M and over the uniform distribution coincide. General Markov chains.

The proof for the uniform distribution does not extend tovanishing Markov chains because the relation ≡ L is not a right congruence. This cannot besimply ﬁxed as we show that learning cannot be done in polynomial time, assuming P = NP . We deﬁne the almost-exact minimization problem as an instance of the approximateminimization problem with (cid:15) = 0. Having a polynomial-time algorithm for almost-exactlearning, we can solve the almost-exact minimization problem in polynomial time. (cid:73) Theorem 21.

The almost-exact minimization problem for

LimAvg -automata under dis-tributions given by Markov chains is NP -complete. Proof.

The problem is in NP as we can non-deterministically pick an automaton A andcheck in polynomial time whether LimAvg -automata A and A are almost equivalent w.r.t.a given Markov chain.We reduce the sample ﬁtting problem, which is NP -complete for E -samples (Theorem 9),to the almost-exact minimization problem. Consider a ﬁnite E -sample S based on words u , . . . , u k . Let M S be a Markov chain which assigns probability k to u i Σ ω , for each i , and0 to words not starting with any of u i . On each u i Σ ω , M S deﬁnes the uniform distribution.Let A S be a a tree-like LimAvg -automaton consistent with S . Both, M S and A S are ofpolynomial size in S . Observe that every automaton A , which is consistent with S (accordingto the uniform distribution over inﬁnite words), is almost equivalent to A S (over P M ) andvice versa. Therefore, there is an automaton of n states almost equivalent to A S (under thedistribution P M ) if and only if the sample ﬁtting problem with S and n has a solution. (cid:74) akub Michaliszyn and Jan Otop 15 References Dana Angluin. Learning regular sets from queries and counterexamples.

Information andcomputation , 75(2):87–106, 1987. Dana Angluin and Dongqu Chen. Learning a random DFA from uniform strings and stateinformation. In

ALT 2015 , pages 119–133, Cham, 2015. Springer International Publishing. Dana Angluin and Dana Fisman. Learning regular omega languages.

Theor. Comput. Sci. ,650:57–72, 2016. Christel Baier and Joost-Pieter Katoen.

Principles of model checking . MIT Press, 2008. Borja Balle and Mehryar Mohri. Learning weighted automata. In

CAI 2015 , pages 1–21, 2015. Borja Balle and Mehryar Mohri. On the rademacher complexity of weighted automata. In

ALT 2015 , pages 179–193, 2015. Borja Balle and Mehryar Mohri. Generalization bounds for learning weighted automata.

Theor.Comput. Sci. , 716:89–106, 2018. Borja Balle, Prakash Panangaden, and Doina Precup. A canonical form for weighted automataand applications to approximate minimization. In

LICS 2015 , pages 701–712, 2015. Amos Beimel, Francesco Bergadano, Nader Bshouty, Eyal Kushilevitz, and Stefano Varricchio.Learning functions represented as multiplicity automata.

Journal of the ACM , 47, 10 1999. Michael Benedikt, Gabriele Puppis, and Cristian Riveros. Regular repair of speciﬁcations. In

LICS 2011 , pages 335–344, 2011. Udi Boker, Krishnendu Chatterjee, Thomas A. Henzinger, and Orna Kupferman. Temporalspeciﬁcations with accumulative values.

ACM Trans. Comput. Log. , 15(4):27:1–27:25, 2014. Benedikt Bollig, Peter Habermehl, Carsten Kern, and Martin Leucker. Angluin-style learningof NFA. In

IJCAI 2009 , pages 1004–1009, 2009. Patricia Bouyer, Nicolas Markey, and Raj Mohan Matteplackel. Averaging in LTL. In

CONCUR2014 , pages 266–280, 2014. Pavol Cerný, Thomas A. Henzinger, and Arjun Radhakrishna. Quantitative abstractionreﬁnement. In

POPL 2013 , pages 115–128, 2013. Krishnendu Chatterjee, Laurent Doyen, and Thomas A. Henzinger. Quantitative languages.

ACM TOCL , 11(4):23, 2010. Krishnendu Chatterjee, Thomas A. Henzinger, and Jan Otop. Quantitative automata underprobabilistic semantics. In

LICS 2016 , pages 76–85, 2016. Colin de la Higuera.

Grammatical Inference: Learning Automata and Grammars . CambridgeUniversity Press, New York, NY, USA, 2010. Manfred Droste, Werner Kuich, and Heiko Vogler.

Handbook of Weighted Automata . Springer,1st edition, 2009. W. Feller.

An introduction to probability theory and its applications . Wiley, 1971. Dana Fisman. Inferring regular languages and ω -languages. J. Log. Algebr. Meth. Program. ,98:27–49, 2018. E Mark Gold. Complexity of automaton identiﬁcation from given data.

Information andcontrol , 37(3):302–320, 1978. Amaury Habrard and José Oncina. Learning multiplicity tree automata. In

ICGI 2006 , pages268–280, 2006. Thomas A. Henzinger and Jan Otop. Model measuring for discrete and hybrid systems.

Nonlinear Analysis: Hybrid Systems , 23:166 – 190, 2017. Jeﬀrey C Jackson. An eﬃcient membership-query algorithm for learning dnf with respect tothe uniform distribution.

Journal of Computer and System Sciences , 55(3):414–440, 1997. Michael Kearns and Leslie Valiant. Cryptographic limitations on learning boolean formulaeand ﬁnite automata.

Journal of the ACM (JACM) , 41(1):67–95, 1994. Michael J Kearns, Umesh Virkumar Vazirani, and Umesh Vazirani.

An introduction tocomputational learning theory . MIT Press, 1994. Kevin J. Lang. Random DFA’s can be approximately learned from sparse uniform examples.In

COLT 1992 , pages 45–52, 1992. Ines Marusic and James Worrell. Complexity of equivalence and learning for multiplicity treeautomata.

Journal of Machine Learning Research , 16:2465–2500, 2015. Jakub Michaliszyn and Jan Otop. Non-deterministic weighted automata on random words. In

CONCUR 2018 , volume 118 of

LIPIcs , pages 10:1–10:16, 2018. Pauli Miettinen, Taneli Mielikäinen, Aristides Gionis, Gautam Das, and Heikki Mannila. Thediscrete basis problem. In

European Conference on Principles of Data Mining and KnowledgeDiscovery , pages 335–346. Springer, 2006. Joshua Moerman, Matteo Sammartino, Alexandra Silva, Bartek Klin, and Michal Szynwelski.Learning nominal automata. In

POPL 2017 , pages 613–625, 2017. José Oncina and Pedro Garcia. Inferring regular languages in polynomial updated time. In

Pattern recognition and image analysis: selected papers from the IVth Spanish Symposium ,pages 49–61. World Scientiﬁc, 1992. Leonard Pitt. Inductive inference, DFAs, and computational complexity. In

InternationalWorkshop on Analogical and Inductive Inference , pages 18–44. Springer, 1989. Leonard Pitt and Manfred K Warmuth. The minimum consistent DFA problem cannot beapproximated within any polynomial.

Journal of the ACM (JACM) , 40(1):95–142, 1993. Leslie G Valiant. A theory of the learnable. In

Proceedings of the sixteenth annual ACMsymposium on Theory of computing , pages 436–445. ACM, 1984.

AppendixA NP-completeness of approximate minimization (cid:73)

Theorem 12.

The approximate minimization problem is NP -complete. Proof.

The problem is contained in NP as we can non-deterministically pick an automaton A with n states and check whether it (cid:15) -approximates A . It can be done in polynomialtime in the following way. For all bottom SCCs C from A and D from A , we compute theprobability p C,D of the set of inﬁnite words w such that A reaches C over w and A reaches D over w . Let x C be the expected value of C in A , which is attained by almost all wordsthat reach C and y D be the corresponding value for D in A . The values p C,D , x C , y D canbe computed in polynomial time [4]. Then, the expected diﬀerence E ( |L A − L A | ) is the sumover all bottom SCCs C from A and D from A of p C,D · | x C − y D | .For NP -hardness, consider the following problem. Given a vector ~v ∈ R m , we deﬁne k ~v k = P mi =1 | ~v [ i ] | . Binary k-Median Problem (BKMP) : given numbers n, m, k, t ∈ N and a set of Booleanvectors C = { ~v , . . . , ~v n } ⊆ { , } m , decide whether there are vectors ~u , . . . , ~u k ∈ { , } m and a partition D , . . . , D k of C such that P kj =1 P ~v ∈D j k ~v − ~u j k ≤ t .BKMP has been shown NP -complete in [30]. To ease the reduction, we consider a variantof BKMP. Modiﬁed Binary k-Median Problem (RBKMP) : given numbers n, m, h, k, t ∈ N , with h > max (2 t, m, n ), and a set of Boolean vectors C = { ~v , . . . , ~v n } ⊆ { , } m +2 h , whichsatisﬁes the following conditions: (C1) C contains vectors ~ ~ (C2) each vector ~v ∈ C \ { ~ ,~ } has as many 0’s as 1’s, (C3) for every vector ~v ∈ C \ { ~ ,~ } we have ~v [2 m + 1] = . . . = ~v [2 m + h ] = 1 and ~v [2 m + h + 1] = . . . = ~v [2 m + 2 h ] = 0.decide whether there are vectors ~u , . . . , ~u k ∈ { , } m +2 h , including ~ ~ D , . . . , D k of C such that P kj =1 P ~v ∈D j k ~v − ~u j k ≤ t . akub Michaliszyn and Jan Otop 17 We reduce BKMP to Modiﬁed BKMP. Consider an instance ( C , k, t ) of BKMP. For everyvector ~v ∈ { , } m we deﬁne ~v ∗ ∈ { , } m +2 h such that ~v ∗ [1 , m ] = ~v , ~v ∗ [ m + 1 , m ] = ~ − ~v , ~v ∗ [2 m + 1 , m + h ] = 1 ~v ∗ [2 m + h + 1 , m + 2 h ] = 0.Then, we deﬁne C ∗ = { ~v ∗ | ~v ∈ C} ∪ { ~ ,~ } and an instance ( C ∗ , k + 2 , t ) of BKMP. Observethat if ( C , k, t ) has a solution, then ( C ∗ , k + 2 , t ) has a solution as well, which satisﬁesadditional assumptions as above. Conversely, if ( C ∗ , k +2 , t ) has a solution, then this solutioncontains ~ ~ ~v ∗ ∈ C ∗ we have (cid:13)(cid:13) ~v ∗ − ~ (cid:13)(cid:13) = (cid:13)(cid:13) ~v ∗ − ~ (cid:13)(cid:13) = m + h > t . It alsofollows that in the partition D , . . . , D k +2 two sets that correspond to ~ ~ C ∗ ∪ { ~ ,~ } , k + 2 , t ) can be transformed to asolution of ( C , k, t ).Let C = { ~v , . . . , ~v n } ⊆ { , } m +2 h . Let M = 2 m + 2 h , N = M − n ( M = n + 2 N ) andassume that n is even. We know that h > t and hence N > t .We deﬁne L C over the alphabet Σ = { a , . . . , a M } as follows: for all a i , a j ∈ Σ, w ∈ Σ ω if i ≤ n , we set L C ( a i a j w ) = ~v i [ j ].if i ∈ { n + 1 , n + N } , we set L C ( a i a j w ) = 0.if i ∈ { n + N, M } , we set L C ( a i a j w ) = 1.This language can be deﬁned with a tree-like LimAvg -automaton A C . In this automatonstates reached over a , . . . , a n , i.e., the states δ ( q , a ) , . . . , δ ( q , a n ) correspond to vectors ~v , . . . , ~v n via E ( A C | a i a j ) = ~v i [ j ]. The vectors ~ ~ N times.First, consider a solution ~u , . . . , ~u k and D , . . . , D k for ( C , k, t ), and ~u = ~ ~u = ~

1. Weconstruct an automaton A D with an initial state q and k states q , . . . , q k First, for all i, j ∈ { , . . . , M } we set δ ( q , a i ) = q j if ~v j ∈ D i , i.e., states q , . . . , q k correspond to vectors ~u , . . . , ~u k . Next, q , q correspond to vectors ~ ~ a ∈ Σwe set δ ( q , a i ) = q with the weight 0 and δ ( q , a i ) = q with the weights 1. Finally, forall i > a j ∈ Σ, we deﬁne δ ( q i , a j ) = q if ~u i [ j ] = 0 and δ ( q i , a j ) = q otherwise.Observe that P kj =1 P ~v ∈D j k ~v − ~u j k ≤ t implies that A D is tnM -approximation of L C .Conversely, assume that A is the optimal approximation of L C among automata with k + 1 states, i.e., A (cid:15) -approximates L C , where (cid:15) ≤ tnM , and no automaton with at most k + 1 states approximates L C with smaller (cid:15) .Consider vectors ~p , . . . , ~p M ∈ R M deﬁned as follows: for all i, j ∈ { , . . . , M } we put ~p i [ j ] = E ( A | a i a j Σ ω ). Since A tnM -approximates L C we know that n X i =1 k ~p i − ~v i k + n + N X i = n +1 (cid:13)(cid:13) ~p i − ~ (cid:13)(cid:13) + M X i = n + N +1 (cid:13)(cid:13) ~p i − ~ (cid:13)(cid:13) ≤ tMn (*)We need to show that there are at most k diﬀerent vectors ~p i and among these vectorsthere are ~ ~ A are reachable from the initial state with some word oflength at most 2. Indeed, we can change states, which are in distance 2 from the initial state,to bottom SCCs with the values being the expected value from the old state. Second, theinitial state cannot be a bottom SCC and hence all bottom SCCs are in distance 1 or 2 fromthe initial state. We can change values of all bottom SCCs to 0 or 1. To see that observethat we can take values of all bottom SCC and consider these values as variables x , . . . , x l .Then, these values appear as convex combinations in (*). However, since all constants in (*)are 0 and 1, then (*) is minimal under the substitution mapping x , . . . , x l to { , } . Knowing that all bottom SCC have value 0 and 1, we observe that we need to haveprecisely two bottom SCC s , s of the values 0 and 1. It only improves the approximationif s is the successor of the initial state over a n +1 , . . . , a n + N and s is the successor of theinitial state over a n + N +1 , . . . , a n +2 N . Therefore, all vectors ~p n +1 = . . . = ~p n + N = ~ ~p n + N +1 = . . . = ~p M = ~ q of A has a self-loop over a i . We know that i ≤ n asotherwise we have E ( A ) ∈ { , } and E ( L C ) = . A contradiction. Consider vector ~q deﬁnedas ~q [ i ] = E ( A | a i Σ ω ). Observe that due to (**) we have q [ n + 1] , . . . , q [ n + N ] = 0 and q [ n + N +1] , . . . , q [ n +2 N ] = 1. Condition (3) of Modiﬁed BKMP implies that k ~q − ~v i k ≥ h ,while condition (2) implies that (cid:13)(cid:13) ~ − ~v i (cid:13)(cid:13) = (cid:13)(cid:13) ~ − ~v i (cid:13)(cid:13) = M . Since h > n , we have 2 h > M and hence changing the self-loop ( q , a i , q ) to a transition ( q , a i , q ) to the state q thatcorresponds to ~ L C . A contradiction. It follows that q has at most k successors.Then, ~p , . . . , ~p n give us a solution to C . First, observe that we can approximate ~p , . . . , ~p n with vectors from { , } M without increasing the left hand side of (*). Hence, we assume thatthey belong to { , } M . Second, we select diﬀerent vectors among ~p , . . . , ~p n as ~u , . . . , ~u k and deﬁne D i = { ~v j | ~p j = ~u i } . The, (*) implies that this is a solution to ( C , k, t ). (cid:74) B NP-completeness of rigid approximate minimization (cid:73)

Theorem 18.

The rigid approximate minimization problem is NP -complete. Proof sketch.

The problem is in NP . We can non-deterministically pick an automaton A with n states and check whether it is a rigid (cid:15) -approximation of A . Observe that having twoautomata A , A , we can check whether A is a rigid (cid:15) -approximation of A in polynomialtime. Indeed, let P be a set of pairs of states deﬁned as follows: ( q, s ) ∈ P if and only if q is astate of A , s is a state of A and they are both reached in the respective automata from the(respective) initial states over a common word u . Observe that A is a rigid (cid:15) -approximationof A if and only if for every pair ( q, s ) the automaton A starting from q (cid:15) -approximates A starting from the state s . We have shown in the proof of Theorem 12 (in Appendix A)that we can decide (cid:15) -approximation in polynomial time. Therefore, we can decide rigid (cid:15) -approximation in polynomial time as well.We show that the problem is NP -hard. To ease the presentation, we deﬁne the following -vector cover problem, which is an intermediate step in our reduction. The -vector cover problem : Given n, k ∈ N and vectors C = { ~v , . . . , ~v n } ⊆ { , , } n , decidewhether there exist ~u , . . . , ~u k ∈ R n such that for every vector ~v ∈ C there is j such that k ~v − ~u j k ∞ ≤ .The -vector cover problem is related to BKMP presented in the proof of Theorem 12.We show two reductions, which together show NP -hardness. The -vector cover problem reduces to the rigid approximate minimization problem . Let C = { ~v , . . . , ~v n } ⊆ { , , } n . We deﬁne L C over the alphabet Σ = { a , . . . , a m } such thatfor all a i , a j ∈ Σ, w ∈ Σ ω we have L C ( a i a j w ) = ~v i [ j ]. Such L C can be deﬁned with a atree-like LimAvg -automaton A C deﬁned as follows. It has three single-state bottom SCCs: p , p . , p with the expected values 0 , . A C moves over akub Michaliszyn and Jan Otop 19 letters a i to diﬀerent states. Then, over any two-letter word a i a j the automaton A C ends upin a single-state bottom SCC of the value ~v i [ j ]. Therefore, this automaton has n + 4 states:the initial state q , n diﬀerent successors s , . . . , s n of q and the states p , p . , p .Let (cid:15) = . We show that an instance n, k, C of the -vector cover problem has a solutionif and only if A C has a rigid (cid:15) -approximation with k + 3 states.First, assume that there is A has at most k + 3 states and for all words u ∈ Σ ∗ wehave E ( |L A C − L A | | u Σ ω ) ≤ (cid:15) . Observe that the initial state of A has at most k diﬀerentsuccessors. To see that consider functions L u ( w ) = L A C ( uw ) deﬁned for u ∈ Σ ∗ . If for some u , u , u ∈ Σ ∗ we have E ( |L u u − L u u | ) > (cid:15) , then words u and u lead (from the initialstate) to diﬀerent states in A . Using this observation, we can show that A has at least 3states that cannot be successors of the initial state: the initial state and (at least two) statesthat corresponding to bottom SCCs.Finally, we deﬁne vectors ~u , . . . , ~u k from the successors of the initial state of A . Formally,for i ∈ { , . . . , n } we deﬁne a vector ~y i as ~y i [ j ] = E ( L A | a i a j Σ ω ). Note that there are atmost k diﬀerent vectors among ~y , . . . , ~y n and we take these distinct vectors as ~u , . . . , ~u k .The condition E ( |L A C − L A | | u Σ ω ) ≤ (cid:15) implies that ~u , . . . , ~u k satisfy the -vector coverproblem.Conversely, assume that the instance n, k, C has a solution. Observe that w.l.o.g we canassume that the solution vectors ~u , . . . , ~u k belong to { . , . } n . Based on this solution wedeﬁne an automaton with k + 3 states such that the successors of the initial state correspondto vectors ~u , . . . , ~u k and there are two bottom SCC of the values 0 .

25 and 0 . A as follows. Let f : { , . . . , n } → { , . . . , k } be the mapping ofvectors ~v i ∈ C to ~u j such that k ~v i − ~u j k ∞ ≤ (and then f ( i ) = j ). We deﬁne A withstates q , q , . . . , q k , s , s such that q is the initial state, q , . . . , q k are the successors ofthe initial state, and s , s are single-state SCCs of the values 0 .

25 and 0 .

75 respectively.We deﬁne the transition function as follows. For all a i ∈ Σ we set δ ( q , a i ) = q f ( i ) . Thenfor every i, j ∈ { , . . . , n } we set δ ( q i , a j ) = s if ~u i [ j ] = 0 .

25 and δ ( q i , a j ) = s otherwise( ~u i [ j ] = 0 . ~u , . . . , ~u k is the solution to the -vector cover problemimplies that A is a rigid -approximation of A C . The dominating set problem reduces to the -vector cover problem . Consider a graph G =( V, E ) with V = { b , . . . , b n } and k ∈ N . We assume that G has no cycles of length less than5; this restriction does not inﬂuence NP -hardness of the problem, since each edge can besubstituted with a path of odd length greater than 5. Denote by N Gk ( b ) the set of all nodesof G connected to b with a path of length at most k ; b is connected with itself with a pathof length 0 and hence b ∈ N Gk ( b ). We deﬁne vectors ~v , . . . , ~v n ∈ { , , } n as follows. For i, j ∈ { , . . . , n } we set ~v j [ i ] = 1 if i = j , ~v j [ i ] = 0 .

5, if i = j , but b i ∈ N G ( b j ), and ~v j [ i ] = 0 otherwise.We claim that there exist ~u , . . . , ~u k ∈ R n as in the problem statement if and only if G has adominating set of the size k .Assume that G has a dominating set d , . . . d k . Consider ~u , . . . , ~u k such that for all i ∈ { , . . . , k } we have ~u i [ j ] = if i = j or ( d i , b j ) ∈ E , and ~u i [ j ] = otherwise.Observe that for all i ∈ { , . . . , k } and j ∈ { , . . . , n } , if i satisﬁes i = j or ( b j , d i ) ∈ E , then k ~v j − ~u i k ∞ ≤ . Therefore, the vectors ~u , . . . , ~u k solve the -vector cover problem. Conversely, assume that there exist ~u , . . . , ~u k ∈ R n that solve the -vector cover problem.Let ~u j be a vector such that for ~v m [1] , . . . , ~v m [ l ] we have (cid:13)(cid:13) ~u j − ~v m [ i ] (cid:13)(cid:13) ∞ ≤ . Note that we canassume that all coeﬃcients of ~u j are and . We claim that for some i nodes b m [1] , . . . , b m [ l ] of G belong to N G ( b m [ i ] ) for some i . To see that observe that the distance between anytwo nodes among b m [1] , . . . , b m [ l ] is at most 2. Indeed, for all k the component m [ k ] of ~v m [ k ] is 1 and hence this component of ~u j is . That in turn implies that the component m [ k ] of ~v m [1] , . . . , ~v m [ l ] is 1 or and hence b m [ k ] belongs to N G ( b m [1] ) , . . . , N G ( b m [ l ] ). Sincethere are no short cycles in G and all distances are bounded by 2, there has to i such that b m [1] , . . . , b m [ l ] of G belong to N G ( b m [ i ] ). Then, we deﬁne d j as b i .Note that d j dominates all nodes b m [1] , . . . , b m [ k ] , which correspond to vectors ~v m [1] , . . . , ~v m [ l ] .Therefore, nodes d , . . . , d k picked as above form a dominating set in G . (cid:74) C Estimating minimal sample size in passive learning

The probability that a single words of length l is not generated by a random sample S with s examples, generated w.r.t. a distribution G ( λ ), can be bounded by (cid:18) − | Σ | l (1 − λ ) l λ (cid:19) s We want to compute a sample size s such that the probability that there is a word of size l not in this sample is at most (cid:15) . This can be, very roughly, represented by the followinginequality: | Σ | l (cid:18) − | Σ | l (1 − λ ) l λ (cid:19) s < (cid:15) which we can conveniently rewrite as (cid:18) − | Σ | l (1 − λ ) l λ (cid:19) | Σ | l (1 − λ ) l λ · s/ ( | Σ | l (1 − λ ) l λ ) < (cid:15) | Σ | l By the fact that (1 − x ) x < e − , the above equality is a consequence of the following one e − s/ ( | Σ | l (1 − λ ) l λ ) < (cid:15) | Σ | l which is equivalent to e s/ ( | Σ | l (1 − λ ) l λ ) > | Σ | l (cid:15) Now we apply the natural logarithm. s/ ( | Σ | l (1 − λ ) l λ ) > ln | Σ | l (cid:15) so s > | Σ | l (1 − λ ) l λ · ln | Σ | l (cid:15) For (