Algorithmic identification of probabilities is hard
Laurent Bienvenu, Santiago Figueira, Benoit Monin, Alexander Shen
aa r X i v : . [ m a t h . L O ] O c t Algorithmic identification of probabilities is hard
Laurent Bienvenu , Santiago Figueira , Benoˆıt Monin , and Alexander Shen LIRMM, CNRS & Universit´e de Montpellier, France Universidad de Buenos Aires and CONICET, Argentina LACL, Universit´e Paris 12, France
Abstract.
Suppose that we are given an infinite binary sequence which is random for aBernoulli measure of parameter p . By the law of large numbers, the frequency of zeros inthe sequence tends to p , and thus we can get better and better approximations of p as weread the sequence. We study in this paper a similar question, but from the viewpoint ofinductive inference. We suppose now that p is a computable real, and one asks for more:as we are reading more and more bits of our random sequence, we have to eventually guessthe exact parameter p (in the form of its Turing code). Can one do such a thing uniformlyfor all sequences that are random for computable Bernoulli measures, or even for a ‘largeenough’ fraction of them? In this paper, we give a negative answer to this question.In fact, we prove a very general negative result which extends far beyond the class ofBernoulli measures. We do however provide a weak positive result, by showing thatlooking at a sequence X generated according to some computable probability measure,we can eventually guess a sequence of measures with respect to which X is random inMartin-L¨of’s sense. The study of learnability of computable sequences is concerned with the following problem.Suppose we have a black box that generates some infinite computable sequence of bits X = X (0) X (1) X (2) , . . . We do not know the program running in the box, and want to guess it bylooking at finite prefixes X ↾ n = X (0) . . . X ( n − n . There could be different programs that produce the same sequence,and it is enough to guess one of them (since there is no way to distinguish between them byjust looking at the output bits). The more bits we see, the more information we have about thesequence. For example, it is hard to say something about a sequence seeing only that its firstbit is a 1, but looking at the prefix110010010000111111011010101000one may observe that this is a prefix of the binary expansion of π , and guess that the machineinside the box does exactly that (though the machine may as well produce the binary expansionof, say, 47627751 / eventually figure out howthe sequence X is generated. More precisely, we hope to have a total computable function A from strings to integers such that for every computable X , the sequence A ( X ↾ , A ( X ↾ , A ( X ↾ , . . . onverges to a program (= index of a computable function) that computes X . This is referredto as identification in the limit , and can be understood in (at least) two ways. Indeed, assumingthat we have a fixed effective enumeration ( ϕ e ) e ∈ N of partial computable functions from N to { , } , we can define two kinds of success for an algorithm A on a computable sequence X : – Strong success: the sequence e n = A ( X ↾ n ) converges to a single value e such that ϕ e = X (i.e., ϕ e ( k ) = X ( k ) for all k ). – Weak success: the sequence e n = A ( X ↾ n ) does not necessarily converge, but ϕ e n = X forall sufficiently large n .Here we assume that A ( X ↾ n ) is defined for all n or at least for all sufficiently large n .The strong type of success is often referred to as explanatory (EX), see, e.g., Defini-tion VII.5.25 in [Odi99, p. 116]. The second type is referred (see Definition VII.5.44, p. 131in the same book) as behaviorally correct (BC). Note that it is obvious from the definition thatstrong success implies weak success.It would be nice to have an algorithm that succeeds on all computable sequences. However,it is impossible even for weak success: for every (total) algorithm A , there is a computable X such that A does not weakly succeed on X . The main obstacle is that certain machines are nottotal (produce only finitely many bits), and distinguishing total machines from non-total onescannot be done computably.However, some classes of computable sequences can be learned, i.e., there exists a totalalgorithm that succeeds on all elements of the class. Consider for example the class of primitiverecursive functions. This class can be effectively enumerated, i.e., there is a total computablefunction f such that ( ϕ f ( e ) ) e ∈ N is exactly the family of primitive recursive functions. Nowconsider the algorithm A such that A ( σ ) returns the smallest e such that ϕ f ( e ) ( i ) = σ ( i ) forall i < | σ | (such an e always exists, since every string is a prefix of a primitive recursivesequence). It is easy to see that if X is primitive recursive, A succeeds on X , even in the strongsense (EX).The theory of learnability of computable sequences (or functions) is precisely about de-termining which classes of functions can be learned. This depends on the learning model, thetype of success, of which there are many variants. We refer to the survey by Zeugman andZilles [ZZ08] and to [Odi99, Chapter VII] for a panorama of the field. Recently, Vit´anyi and Chater [VC17] proposed to study a related problem. Suppose that insteadof a sequence that has been produced by a deterministic machine, we are given a sequence thathas been generated by a randomized algorithmic process, i.e., by a Turing machine that hasaccess to a fair coin and produces some output sequence on the one-directional write-onlyoutput tape. The output sequence is therefore a random variable defined on the probabilisticspace of fair coin tossings. We assume that this machine is almost total . This means that thegenerated sequence is infinite with probability 1.Looking at the prefix of the sequence, we would like to guess which machine is producingit. For example, for the sequence000111111110000110000000001111111111111 This requirement may look unnecessary. Still the notion of algorithmic randomness needed for ourformalization is well-defined only for computable measures, and machines that are not almost totalmay not define a computable measure.
2e may guess that it has been generated via the following process: start with 0 and then chooseeach output bit to be equal to the previous one with probability, say, 4 / / So what should count as a good guess for some observed sequence? Again, there is no hopeto distinguish between two processes that have the same output distribution. So our goal shouldbe to reconstruct the output distribution and not the specific machine.But even this is too much to ask for. Assume that we have agreed that some machine M with output distribution µ is a plausible explanation for some sequence X . Consider anothermachine M ′ that starts by tossing a coin and then (depending on the outcome) either generatesan infinite sequence of zeros or simulates M . If X is a plausible output of M , then X is also aplausible output of M ′ , because it may happen (with probability 1 /
2) that M ′ simulates M .A reasonable formalization of a ‘good guess’ is provided by the theory of algorithmic ran-domness. As Chater and Vit´anyi recall, there is a widely accepted formalization of “plausibleoutputs” for an almost total probabilistic machine with output distribution µ : the notion ofMartin-L¨of random sequences with respect to µ . These are the sequences that pass all effectivestatistical tests for the measure µ , also known as µ -Martin-L¨of tests. (We assume that thereader is familiar with algorithmic randomness and Kolmogorov complexity. The most usefulreferences for our purposes are [G´ac05] and [LV08]). Having this notion in mind, the naturalway to extend learning theory to the probabilistic case is as follows: A class of computable measures M is learnable if there exists a total algorithm A suchthat for every sequence X that is Martin-L¨of random for some measure in M , thesequence A ( X ↾ , A ( X ↾ , A ( X ↾ , . . . identifies in the limit a measure µ ∈ M such that X is Martin-L¨of random with respectto µ . Like in the classical case, there are several ways one can interpret the notion of ‘identifying inthe limit. We will come back to this after having introduced some basic notation and terminologyrelated to computable measures (for now one may think of a computable measure as an outputdistribution of an almost total probabilistic machine).
We denote by 2 ω the set of infinite binary sequences and by 2 <ω the set of finite binary sequences(or strings ). The length of a string σ is denoted by | σ | . The empty string (string of length 0)is denoted by Λ . For two strings σ , τ we write σ (cid:22) τ if σ is a prefix of τ . The n -th element ofa sequence X (0) X (1) . . . is the value X ( n −
1) (assuming that the length of X is at least n );the string X ↾ n = X (0) X (1) . . . X ( n −
1) is the n -bit prefix of X . We write σ (cid:22) X if the string σ is a prefix of the infinite sequence X (i.e., X ↾ | σ | = σ ). The space 2 ω is endowed with thedistance d defined by d ( X, Y ) = 2 − min { n : X ( n ) = Y ( n ) } . This distance is compatible with the product topology generated by cylinders [ σ ] = { X ∈ ω : σ (cid:22) X } . The probability 4 / clopen ). Thus, any finite union of cylinders is also clopen.It is easy to see, by compactness, that the converse holds: every clopen subset of 2 ω is a finiteunion of cylinders. We say that a clopen set C has granularity at most n if C is a finite unionof some cylinders [ σ ] with | σ | = n . We denote by Γ n the family of clopen sets of granularity atmost n .We now give a brief review of the “computable analysis” aspects of the space of probabilitymeasures. For a more thorough exposition of the subject, the main reference is [G´ac05].The space of Borel probability measures over 2 ω is denoted by P . In the rest of the paper,when we talk about a ‘measure’, we mean an element of the space P . This space is equippedwith the weak topology, that is, the weakest topology such that for every σ , the application µ µ ([ σ ]) is continuous as a function from P to R . Several classical distances are compatiblewith this topology; for example, one may use the distance ρ constructed as follows. For µ, ν ∈ P ,let ρ n ( µ, ν ) (for an integer n ) be the quantity ρ n ( µ, ν ) = max C ∈ Γ n | µ ( C ) − ν ( C ) | and then set ρ ( µ, ν ) = X n − n ρ n ( µ, ν ) . The open (resp. closed ) ball B of center µ and radius r is the set of measures ν such that ρ ( µ, ν ) < r (resp. ρ ( µ, ν ) ≤ r ). In the space of measures, the closure B of the open ball B ofcenter µ and radius r is the closed ball of center µ and radius r .The space P is separable, i.e., has a countable dense set of points. An easily describable oneis the set I consisting of measures { δ σ } σ ∈ <ω , where δ σ is the Dirac measure concentrated on thepoint σ ω , and all rational convex combinations of such measures. Note that every member of I has a finite description: it suffices to give the list of σ ’s together with the rational coefficientsof the linear combination. Thus one can safely talk about computable functions from/to I .The set I , together with the distance ρ , make P a computable metric space [G´ac05]. Eachpoint µ ∈ P can be written as the limit of a sequence ( q , q , . . . ) of points in I where ρ ( q i , q j ) ≤ − i for i < j . Such a sequence is called a fast Cauchy name for µ . We say that a measure µ is computable if there is a total computable function ϕ e : N → I such that ( ϕ e ( n )) n ∈ N is a fastCauchy name for µ . Such an e is called an index for µ .At this point the way we view measures — as points of the space P — does not match thepresentation of the introduction, where we asked the learning algorithm to guess, on prefixesof input X , a sequence of probabilistic machines M i such that for almost all i , the machine M i is almost total and X is a plausible output for M i . The reason is that in fact there are threeways one can think of measures, which are equivalent for our purposes:(a) A measure is a point of P .(b) By Caratheodory’s theorem, a measure µ can be identified with the function σ µ ([ σ ]):for every function f : 2 <ω → [0 ,
1] such that f ( Λ ) = 1 and f ( σ
0) + f ( σ
1) = f ( σ ) there is aunique measure µ such that µ ([ σ ]) = f ( σ ) for all σ . For example, the uniform measure λ is the unique measure such that λ ([ σ ]) = 2 −| σ | for all σ , and the Bernoulli measure β p ofparameter p ∈ [0 ,
1] is the unique measure satisfying β p ([ σ p · β p ([ σ ]) for all σ .(c) Consider a Turing functional M , which one might think of as a Turing machine with aread-only input tape, a work tape and a write-only output tape. We say that M is definedon X if M prints an infinite sequence Y on the output tape given X on the input tape.When M is defined on λ -almost every X , where λ is the uniform Lebesgue measure on a4antor space that corresponds to the fair coin tossings, we say that M is almost total . Thenthe function µ M ( σ ) = λ { X : M ( X ) (cid:23) σ } defines a measure in the sense of item (b). This measure corresponds to the distribution ofa random variable that is the output of M on the sequence of uniform independent randombits.These approaches are equivalent both in the classical and effective realm, as is well known.The corresponding classes of measures coincide; moreover, one can computably convert analgorithm representing a computable measure according to one of the definition, into otherrepresentations. However, depending on the context one characterization may be much easierto handle than the others. And indeed, the techniques of the next section where we will proveour main negative result are of analytic nature, so characterization (a) will be more conve-nient, while the positive result of the last section has a more ‘algorithmic flavor’, for whichcharacterization (c) will be better suited.The randomness deficiency function d is the largest, up to additive constant, function f : 2 ω × P → N ∪ {∞} such that – f is lower semi-computable (i.e., f − (( k, ∞ ]) is an effectively open subset of the productspace 2 ω × P , uniformly in k ); – for every µ ∈ P , for every integer k , the inequality µ { X : f ( X, µ ) > k } < − k holds.We use the usual notation d ( X | µ ) instead of d ( X, µ ). We say that X is (uniformly) randomrelative to measure µ if d ( X | µ ) < ∞ . For computable measures this notion coincides with theclassical notion of Martin-L¨of randomness.We end this introduction with a discussion on a concept we will need to state the maintheorem of Section 2: orthogonality. Two measures µ, ν ∈ P are said to be orthogonal if thereis a Borel set X ⊆ ω such that µ ( X ) = 1 and ν ( X ) = 0 (taking the complement of X , wesee that ortogonality is a symmetric relation). This is equivalent to the following condition: foreach ε > X ε such that µ ( X ε ) ≥ − ε and ν ( X ε ) < ε (indeed, one can then take X = T i S j X − i − j ).The class of Bernoulli measures provides an easy example of orthogonality: if p = q , theBernoulli measures β p and β q (see the definition above) are orthogonal (by the law of largenumbers, taking for X the set of sequences with a limit frequency of ones equal to p , we have β p ( X ) = 1 and β q ( X ) = 0).The important fact we need is that when two computable measures µ and ν are orthogonal,they share no random element, i.e, d ( X | µ ) and d ( X | ν ) cannot both be finite for any X . For aproof of this result, see for example [BM09]. Most classical learning models for computable sequences can be adapted to our probabilisticsetting. For example, the EX and BC models mentioned have the following natural counterparts(we give them the same names, as this should create no confusion). This version of randomness deficiency function is sometimes called “uniform probability-boundedrandomness deficiency”; however, we do not use the other versions and call it just “randomnessdeficiency”. An effectively open set is a union of a computably enumerable set of rational balls (or products ofballs, since we consider a product space). efinition 1. Let X ∈ ω and A : 2 <ω → N a total algorithm. We say that: – A EX-succeeds on X if A ( X ↾ n ) converges to a value e that is an index for a computablemeasure µ with respect to which X is Martin-L¨of random. – A BC-succeeds on X if there exists a computable measure µ such that for almost all n , A ( X ↾ n ) is an index for µ and X is Martin-L¨of random with respect to µ . There are also some natural learning models we can define that are more specific to theprobabilistic setting. As we discussed, for a given X that is Martin-L¨of random with respect tosome computable measure, there are several (actually, infinitely many) computable measureswith respect to which X is Martin-L¨of random. Thus we could allow the learner to proposedifferent measures at each step and not converge to a specific measure, as long as almost all ofthem are good explanations for the observed X . To measure how good an explanation is, weuse the randomness deficiency, thus it makes sense to make the distinction between learningwith bounded randomness deficiency and with unbounded randomness deficiency. Definition 2.
Let X ∈ ω and let A : 2 <ω → N be a total algorithm. We say that: – A BD-succeeds on X if there exists a constant d such that for almost all n , A ( X ↾ n ) isan index for a computable measure with respect to which X is Martin-L¨of random, withrandomness deficiency at most d . – A UD-succeeds on X if for almost all n , A ( X ↾ n ) is an index for a computable measure withrespect to which X is Martin-L¨of random. (‘BD’ and ‘UD’ stand for ‘bounded deficiency’ and ‘unbounded deficiency’). Our four learningmodels are by no means an exhaustive list of possibilities. Just like the classical learning theoryoffers a wide variety of models, one could define a wealth of alternative models (partial learning,team learning, etc.) in our setting. This would take us far beyond the scope of the present paperand we leave this for further investigation.Let us note in passing that the four learning models we have presented form a hierarchy,namely: EX-success ⇒ BC-success ⇒ BD-success ⇒ UD-successThe fact that EX-success implies BC-success and that BD-success implies UD-success isimmediate from the definition. To see that BC-success implies BD-success, recall that in ourdefinition the randomness deficiency depends only on the measure but not on the algorithmthat computes it. So if the learning algorithm BC-succeeds on some sequence X , i.e., outputsthe same measure (its code) for all sufficiently large prefixes of X , then the deficiency of X with respect to this measure will be a constant and therefore the algorithm BD-succeds on X . Now that we have given a precise definition of various learning models for computable probabil-ity measures, there are some obvious questions we need to address, the first of which is: For eachof the above learning models, is there a single algorithm A that succeeds on all sequences X that are random with respect to some computable measure? This measure can be different fordifferent X .) And if not, are there natural classes of measures for which there is an algorithmwhich succeeds on all X that are random with respect to some measure in this class?The starting point of this paper was a claim made in a preprint of Vit´anyi and Chater [VC13],where it was stated that there exists an algorithm A that EX-succeeds on every X that isMartin-L¨of random with respect to a Bernoulli measure β p for some computable p (differentfor different X ). Our results (Theorems 4 and 5) imply that there is in fact no such algorithm.Vit´anyi and Chater later corrected this claim and proved the following weaker statement.6 heorem 3 (Vit´anyi–Chater [VC17]). Let ( p e ) be a partial enumeration of computablereals in [0 , . If E ⊆ N is c.e. or co-c.e., and for all e ∈ E , p e is defined, then there existsan algorithm A that EX-succeeds on every X that is random with respect to some β p e forsome e ∈ E . This result implies, for example, that there is an algorithm that EX-succeeds on all X thatare random with respect to some β q with q a rational number.We prove that this result cannot be extended to all computable parameters p : Theorem 4.
No algorithm A can BD-succeed on every sequence X that is random with respectto some Bernoulli measure β p for computable p . A fortiori, there is no algorithm A can BD-succeeds on every sequence X that is random with respect to some computable measure. We will in fact prove a more general theorem, replacing the class of Bernoulli measures byany class of measures having some “reasonable” structural properties, and allowing the learningalgorithm to succeed on a fraction of sequences only.
Theorem 5.
Let M be a subset of P with the following properties : – M is effectively closed, i.e., its complement is effectively open: one can enumerate a sequenceof rational open balls in P whose union is the complement of M . – M is computably enumerable, i.e., one can enumerate all rational open balls in P thatintersect M . – for every computable measure ν , and every non-empty open subset of M ( i.e., a non-emptyintersection of an open set in P with M ) there is a computable µ in this open subset thatis orthogonal to ν .Let also δ be a positive number. Then there is no algorithm A such that for every computable µ ∈ M , the µ -measure of sequences X on which A BD-succeeds is at least δ . The notion of a computably (= recursively) enumerable closed set is standard in computableanalysis, see [Wei00, Definition 5.1.1].Note that the hypotheses on the class M are not very restrictive: many standard classes ofprobability measures have these properties. In particular, the class { β p : p ∈ [0 , } of Bernoullimeasures is such a class, which is why. So we get Theorem 4 as a corollary: there is no algorithmthat can learn all Bernoulli measures (not to speak about all Markov chains). To see that thethird condition is true for the class of Bernoulli measures, note that only countably manyBernoulli measures may be non-orthogonal to a given measure µ : the sets L p of sequences withlimit frequency p are disjoint, so only countably many of them may have positive µ -measure.It remains to note that every open non-empty subset of the class of Bernoulli measures has thecardinality of the continuum.Let us give another example (beyond Bernoulli measures and Markov chains) that satisfiesthe requirements of Theorem 5. In this example, the probability of the next bit to be 1 maydepend on many of the previous bits. For every parameter p ∈ [0 , µ p associated to the stochastic process that generates a binary sequence bit by bit as follows: thefirst bit is 1, and the conditional probability of 1 after σ k is p/ ( k + 1). One can check thatthe class P = { µ p : p ∈ [0 , } satisfies the hypotheses of the theorem (observe that p can easilybe reconstructed from the sequence that is random with respect to µ p ).Note also that these hypotheses are not added just for convenience: although they might notbe optimal, they cannot be outright removed. If we do not require the class M to be effectivelyclosed, compactness, then the class of Bernoulli measures β p with rational parameter p would7ualify, but Vit´anyi and Chater’s theorem tells us that there is an algorithm that correctlyidentifies each of the measures in the class with probability 1. The third condition is important,too. Consider the measures β and β concentrated on the sequences 0000 . . . and 1111 . . . respectively. Then the class M = { pβ + (1 − p ) β : p ∈ [0 , } is indeed effectively closedand computably enumerable, but it is obvious that there is an algorithm that succeeds withprobability 1 for all measures of that class (in the strongest sense: the first bit determines theentire sequence). For the second condition we do not have a counterexample showing that it isreally needed, but it is true for all the natural classes (and it is guaranteed to be true if M hasa computable dense sequence).The rest of this section is devoted to the proof of Theorem 5.Fix a subset M of P satisfying the hypotheses of the theorem, and some δ >
0. Assumefor the sake of contradiction that there is a total algorithm A such that for every computable µ ∈ M , the µ -measure of sequences X on which A BD-succeeds is at least δ . In the rest of theproof, by “success” we always mean BD-success.We may assume without loss of generality that our algorithm A , on an input σ , outputs aninteger e which is a code for a partial computable function ϕ e from N to I (our set of rationalpoints in P , described above) that is defined on the entire N or at some initial segment of N , and ρ ( ϕ e ( n ) , ϕ e ( n + 1)) < − n − when both ϕ e ( n ) and ϕ e ( n + 1) are defined. When this sequenceis total, it converges to a measure µ with computable speed: ρ ( ϕ e ( n ) , µ ) ≤ − n .This is not guaranteed by the definition of BD-success, but we may “trim” the algorithm byensuring that indeed the sequence ϕ e (0) , ϕ e (1) , . . . , whether finite or infinite, contains elementsof I and satisfies the distance conditions where defined (by waiting until the conditions arechecked; note that we have a strict inequality which will manifest itself at some moment, iftrue).Suppose now that for some index e , we do not know whether ϕ e is total, but we see that ϕ e ( n ) is defined for some n , and d ( X | ν ) > d holds for some X and for all measures ν atdistance ≤ − n of ϕ e ( n ). Then we already know, should ϕ e be total and converge to some µ ,that d ( X | µ ) > d . Thus we use the following notation: if A ( σ ) returns e , then d ( X | A ( σ )) is thequantity sup { d |∃ n ϕ e ( n ) ↓ and d ( X | ν ) > d for all ν at distance ≤ − n from ϕ e ( n ) } . The supremum of an empty set (that appears, for example, if A ( σ ) is a code of an emptysequence) is considered to be 0. Our function d ( X | A ( σ )) has two key properties which areessential for the rest of our proof:(a) If A ( σ ) = e , and ϕ e is total and converges to µ , then d ( X | A ( σ )) = d ( X | µ ).(b) d ( X | A ( σ )) is lower semi-computable, uniformly in ( X, σ ).Let us first prove that property (a) holds. Assuming A ( σ ) = e , and ϕ e is total and convergingto µ , we know that µ is at distance ≤ − n of ϕ e ( n ) for all n . Thus, in the definition of d ( X | A ( σ )) every member of the set of d on the right-hand side is at most d ( X | µ ). Thus d ( X | A ( σ )) ≤ d ( X | µ ). On the other hand, if d ( X | µ ) > d , by lower semicontinuity of thefunction d , there is n such that ρ ( ν, µ ) ≤ − n implies d ( X | ν ) > d , and thus d ( X | A ( σ )) > d .Property (a) is proven.For property (b), we use the fact that the space P is effectively compact (one can effectivelyenumerate all covers of P consisting of a finite union of open rational balls), together with thefact that the infimum of a lower semicomputable function on an effectively compact set is lowersemicomputable, uniformly in a code for the effectively compact set (see [G´ac05] for both facts).8hus, the predicate “ d ( X | ν ) > d for all ν at distance ≤ − n from ϕ e ( n )” is computably enu-merable uniformly in e, n, X, d (or said otherwise, the set of ( e, n, X, d ) satisfying this propertyis an effectively open subset of N × N × ω × N ). Property (b) follows.Thanks to property (a), when A ( σ ) = e and ϕ e is total and converges to a measure µ , we cansafely identify A ( σ ) with µ and write A ( σ ) = µ . Additionally, we say that “ A ( σ ) is a measure”when A ( σ ) = µ for some measure µ .Now, for every pair of integers ( N, d ), we define the set
Wrong ( N, d ) = (cid:8) X (cid:12)(cid:12) ( ∃ n ≥ N ) d ( X | A ( X ↾ n )) > d (cid:9) . This is the set of sequences X on which the algorithm is “visibly wrong” at some prefix oflength n ≥ N , for the deficiency level d .Note that Wrong ( N, d ), understood in this way, is effectively open uniformly in (
N, d ) andis non-increasing in each of its parameters. The intersection of sets
Wrong ( N, d ) for all N and d is some set Wrong ; as the name says, the algorithm A cannot BD-succeed on any sequencein this set. (Note that other reasons for failure are possible, e.g., A may not provide a measureon prefixes of some sequence.)It is technically convenient to combine the two parameters N and d into one (even they areof different nature) and consider a decreasing sequence of sets Wrong ( N ) = Wrong ( N, N )whose intersection is
Wrong .We also consider a set
Succ ( N, d ) of all sequences X such that A BD-succeeds on X atlevel N with deficiency d , i.e., Succ ( N, d ) = { X : ( ∀ n ≥ N ) [ A ( X ↾ n ) is a measure, d ( X | A ( X ↾ n )) ≤ d ] } . The set
Succ ( N, d ) is a closed set. Indeed, it is an intersection of sets indexed by n , so we needto show that each of them is closed. For each n there are finitely many possible prefixes of length n , so the first condition (“ A ( X ↾ n ) is a measure”) defines a clopen set. The second conditiondefines an effectively closed subset in each cylinder where A ( X ↾ n ) is a measure. (Note that wedo not claim that Succ ( N, d ) is effectively closed, since the condition “to be a measure” is onlya Π -condition.) By definition, the set Succ ( N, d ) does not intersect the set
Wrong ( N, d ).The set
Succ ( N, d ) increases as N or d increase; the union of these sets is the set of all X where A BD-succeeds; we denote it by
Succ . Again we may combine the parameters andconsider an increasing sequence of sets
Succ ( N ) = Succ ( N, N ) whose union in
Succ .All these considerations deal with the space of sequences. Now we switch to the space ofmeasures and the class M . We look what are the measures of sets Wrong ( N ) according todifferent µ ∈ M . Consider some threshold x ∈ [0 , – there exist some number N , and some non-empty open set U ⊆ M such that µ ( Wrong ( N )) ≤ x for all µ ∈ U . – for every N the set of points µ ∈ M where µ ( Wrong ( N )) > x is dense in M .There is some threshold where we switch from one case to the other, so let use take close valuesof p < q (i.e., we take the difference q − p to be much smaller than δ from the statement of thetheorem; it would be enough to require that q − p < δ/
10) such that the first case happens for q and the second one happens for p .Choose some N and an open ball B that has a non-empty intersection with M such that µ ( Wrong ( N )) ≤ q for all µ ∈ B ∩ M (this is possible since the first case happens for q ).9 emma 6. There exists a computable measure µ ∗ ∈ B ∩ M such that µ ∗ ( Wrong ) ≥ p .Proof. Since the second case happens for p , we can find some µ ∈ B ∩M such that µ ( Wrong (0)) >p . Since Wrong (0) is open, the same is true for some its clopen subset C , i.e., µ ( C ) > p . Notethat µ ( C ) is a continuous function of µ for fixed clopen C , so we can find a smaller ball B ⊆ B intersecting M such the µ ( Wrong (0)) ≥ µ ( C ) > p for all µ ∈ B ∩M . Then, repeating the sameargument, we find an even smaller ball B ⊆ B intersecting M such that µ ( Wrong (1)) > p for all µ ∈ B ∩ M , then some B ⊆ B such that µ ( Wrong (2)) > p for all µ ∈ B ∩ M , etc.Using the completeness of the space of measures, consider the intersection point µ ∗ of all B i (we may assume that their radii converge to 0 and that B i +1 ⊆ B i , and this guarantees theexistence and the uniqueness of the intersection point). We have µ ∗ ( Wrong ( i )) > p for all i (but µ ∗ ( Wrong ( N )) ≤ q ; the same in true for all subsequent sets Wrong ( i ) for all i ≥ N ).The continuity property for measure µ ∗ then guarantees that µ ∗ ( Wrong ) ≥ p .Refining this argument, we can get a computable measure µ ∗ with this property. Indeed, wemay choose B i +1 in such a way that even the closed ball B i +1 of the same radius is contained in B i ; this property is enumerable. “To have a non-empty intersection with M ” is also an enumer-able property (by assumption), and “ µ ( Wrong ( i )) > p for all µ ∈ B i +1 ” is also an enumerableproperty (we may assume without loss of generality that p is rational, and Wrong ( i ) is effec-tively open uniformly in i ). So we can perform a search until B i +1 is found, and the sequenceof B i is computable, so µ ∗ is computable. ⊓⊔ Now the argument goes as follows. Since µ ∗ is computable, the set Succ should have µ ∗ -probability at least δ by assumption. Success means that (at least) some measures are providedby the learning algorithm A for prefixes of sufficiently large length M . There are finitely manypossible prefixes, and they correspond to finitely many computable measures µ , . . . , µ s . Thenwe choose a measure µ ′ orthogonal to all these measures and very close to µ ∗ . We get thecontradiction showing that µ ′ ( Wrong ( N )) is almost p + δ (or more) and therefore exceeds q which is not possible due to the choice of B . To get the δ -increase we use the fact that sequencesthat are µ ′ -random cannot be µ i -random and should therefore have infinite deficiency. Let usnow explain this argument in details.Recall that we have chosen N in such a way that µ ( Wrong ( N )) ≤ q for all µ sufficientlyclose to µ ∗ . On the other hand, µ ∗ ( Wrong ( M )) ≥ µ ∗ ( Wrong ) ≥ p for all M .Since µ ∗ ( Succ ) ≥ δ , the continuity property of measures guarantees that µ ∗ ( Succ ( M )) & δ for sufficiently large M , where & means inequality up to an additive error term that is very smallcompared to δ (in fact, δ/
10 would be small enough; we do not add more than 10 inequalities ofthis type). Fix some M that is large enough and greater than N from the previous paragraph.The set Wrong ( M ) is open and has µ ∗ -measure at least p . Therefore, there exist a clopenset C ⊆ Wrong ( M ) such that µ ∗ ( C ) & p . Since the set C is clopen, there exists some K ≥ M such that the K -bit prefix determines whether a sequence belongs to C (the granularity of C is at most K ).Now the Cantor space is split into 2 K intervals that correspond to different prefixes of length K . Some of these intervals form the set C (and belong to Wrong ( M ) entirely). Among therest, we distinguish good and bad intervals; good intervals correspond to prefixes for whichthe learning algorithm A produces a measure (whatever this measure is). Let µ , . . . , µ s be allmeasures that are produced by A for all good intervals (we have at most 2 K of them).Note that Succ ( M ) is covered by the good intervals. Indeed, it is disjoint with Wrong ( M )and therefore with C , and also is disjoint with bad intervals by definition (since K ≥ M , thealgorithm A should produce a measure when applied to K -bit prefix).10ow consider a measure µ ′ that is very close to µ and orthogonal to all µ i . (Our assumptionallows us to get a measure very close to µ and orthogonal to a given computable measure;now we have several measure µ , . . . , µ s instead of one, but this does not matter since we mayconsider their average: any measure orthogonal to the average is orthogonal to all µ i .)Since the µ ∗ -measure of Succ ( M ) is almost δ (or more), and it is covered by good intervals,then µ ∗ -measure of the union of good intervals is also almost δ (or more). The same is true forevery measure µ ′ sufficiently close to µ ∗ since the union of good intervals is a clopen set.No µ ′ -random sequences can be µ i -random since the measures are orthogonal. This impliesinfinite deficiency, so all µ ′ -random sequences in good intervals belong to Wrong ( M ). Sothe µ ′ -measure of the part of Wrong ( M ) outside C is almost δ (or more), and the part of Wrong ( M ) inside C has µ ′ -measure almost p or more (this was true for µ , and µ ′ is closeto µ ). Together we get lower bound close to p + δ for µ ′ ( Wrong ( M )). And this gives us acontradiction, since µ ′ ( Wrong ( M )) ≤ µ ′ ( Wrong ( N )), and the latter should be at most q forall µ ′ close to µ . (Recall that we have chosen q − p much smaller than δ .)This contradiction finishes the proof of Theorem 5. We have established in Theorem 4 that, unsurprisingly, there is no total algorithm A thatBD-succeeds on all sequences X that are random with respect to some computable probabilitymeasure. After proving this theorem, the authors conjectured that one could even prove thesame result for UD-success. But it turns out that the situation is drastically different for UD-learning: we will show in this section that there is a uniform learning algorithm in this model. Theorem 7.
There exists a total algorithm A that UD-succeeds on every X that is Martin-L¨ofrandom with respect to some computable probability measure. Recall that this means that for large enough n , A ( X ↾ n ) is a (code of a) measure with respectto which X is random. However, ( a ) A ( X ↾ n ) may be different for different values of n , and ( b )the randomness deficiency of X with respect to A ( X ↾ n ) is unbounded.The proof of this result is inspired by a result of Harrington (reported in [CS83, Theorem3.10] or [Odi99, Theorem VII.5.55, p. 139]) which states that there exists an algorithm tolearn — in the classical sense — all computable sequences up to finitely many errors . Moreprecisely, there is a total algorithm A such that for every computable sequence X , for almostall n , A ( X ↾ n ) is a program for an almost everywhere defined function that differs from X onlyin finitely many places. Indeed, let A ( σ ) be the program that, given input m , spends time m searching for the minimal program computing some extension of σ and then runs this program,if found, on m (and returns, say, 0 if no such program is found). Let e be the minimal programthat computes X . All smaller programs fail to compute X on some k (by either being undefinedor giving a wrong answer). If n is greater than all these k , then none of the smaller programs(than e ) would qualify as a candidate for any m , and for large enough m the program e will beapproved. (Note that A ( σ ) may be a non-total program even if σ is long: we know only that itis defined on large enough values of m .) Proof.
We will use an argument somewhat similar to Harrington’s to prove Theorem 7. Inthis section, it is more convenient to consider measures as functions by identifying µ with thefunction σ µ ( σ ) (here and in the rest of the section, µ ( σ ) is the abbreviation of µ ([ σ ])).11t is also more convenient to use an alternative characterization of Martin-L¨of randomness,via the Schnorr–Levin theorem, which states that if µ is a computable measure, a sequence X is Martin-L¨of random with respect to µ if and only if the prefix complexity of its prefixes is big:( ∃ c )( ∀ n ) K( X ↾ n ) > − log µ ( X ↾ n ) − c (see for example [LV08]). We say that measure µ is exactly computable when the function σ µ ( σ ) is rational-valued and computable as a function from 2 <ω to Q . Of course, not allcomputable measures are exactly computable, but the following fact holds: Lemma 8. If X ∈ ω is random with respect to a computable probability measure µ , it is ran-dom with respect to some exactly computable probability measure ν . Moreover, one can supposethat ν ( σ ) > for all strings σ . See [JL95] for a proof (essentially we approximate the given computable measure withenough precision using positive rational numbers).This lemma is convenient because it is possible to effectively list the family F of partialcomputable functions µ from 2 <ω to Q such that – µ ( Λ ) = 1; – for every n , either µ ( σ ) is defined for all strings σ of length n , or is undefined for all strings σ of length n ; – if µ ( σ
0) and µ ( σ
1) are defined, µ ( σ ) is defined and is equal to µ ( σ
0) + µ ( σ – µ ( σ ) > σ on which µ is defined.Let ( µ e ) e ∈ N be an effective enumeration of all the functions in F . It is among these functionsthat, given a sequence X , we are going to look for the ‘best candidate’ µ e such that µ e is ameasure (i.e., is total) and X is random relative to µ e . Suppose we are given a prefix σ of X .What is a good candidate µ e for this σ ? For this, we use the same approach as algorithmicstatistics: a good explanation µ e for a string σ should ( a ) be defined on σ , ( b ) be simple, whichis measured by the prefix complexity K( e ), and ( c ) give σ a small ‘local’ randomness deficiency,which we measure by the quantity ld ( e, σ ) = max τ (cid:22) σ [ − log µ e ( τ ) − K( τ )], with the conventionthat ld ( e, σ ) = ∞ when µ e ( τ ) is undefined for some prefix τ of σ . The Schnorr–Levin theoremmentioned above now can be reformulated as follows: the value d ( X | µ e ) = sup τ (cid:22) X [ − log µ e ( τ ) − K( τ )] = lim n ld ( e, X ↾ n ) = sup n ld ( e, X ↾ n )is finite if and only if µ e is a measure and X is Martin-L¨of random with respect to µ e . In fact, d ( X | e ) is a version of randomness deficiency; for a fixed measure µ e this quantity is equal tothe deficiency d of the previous section up to logarithmic precision (see, e.g., [BGH +
11] fordetails).Returning to algorithmic statistics, we combine the two quantities into a score function score ( e, σ ) = K( e ) + ⌈ ld ( e, σ ) ⌉ , (as in golf, ‘score’ is meant in a negative sense: a high score ( e, σ ) means that µ e is not a goodcandidate for being a measure with respect to which σ looks random). Finally, we define thefunction Best such that
Best ( σ ) is the value of e that minimizes score ( e, σ ) (if there areseveral, we let Best ( σ ) be the smallest one). That is, Best ( σ ) = min { e | ( ∀ e ′ ) score ( e, σ ) ≤ score ( e ′ , σ ) } . Best is computable in the halting set ′ . Indeed, tocompute Best ( σ ), one can first find e such that s = score ( e, σ ) < ∞ (this can be donecomputably). Then, using ′ , one can find N such that K( e ) > s (and thus score ( e, σ ) > s )for all e > N . Finally, take Best ( σ ) to be the number e in [0 , N ] that minimizes score ( e, σ )(again taking the smallest one if there are several), which can be done effectively relative to ′ because score is itself computable relative to ′ .The core of the proof of Theorem 7 is the following lemma, which is of independent interest.It implies that learning measures in the EX sense, which we showed in the previous section tobe impossible, becomes possible if one is given access to oracle ′ . Lemma 9.
Let X be a sequence that is random with respect to some computable probabilitymeasure. The sequence of integers Best ( X ↾ n ) converges to a single value e ∗ such that µ e ∗ is ameasure, and X is random with respect to µ e ∗ .Proof. Fix such a sequence X . For each e , the sequence score ( e, X ↾ n ) is nondecreasing andtakes its values in N ∪ {∞} , thus converges to some S ( e ) ∈ N ∪ {∞} . As we have said, theSchnorr–Levin theorem guarantess that S ( e ) < ∞ if and only if µ e is a measure and X is Martin-L¨of random with respect to µ e . Thus we know that S ( e ) < ∞ for some e by our assumptionthat X is Martin-L¨of random with respect to some computable probability measure.Let e ∗ be the index such that S ( e ∗ ) is minimal among all S ( e ) (the smallest one if there areseveral). For any i such that K ( i ) > S ( e ∗ ), we have for any n : score ( i, X ↾ n ) > S ( e ∗ ) ≥ score ( e ∗ , X ↾ n )Thus Best ( X ↾ n ) = i for any n . In other words, only the j such that K ( j ) ≤ S ( e ∗ ) matterwhen selecting the best candidate for the sequence X . Those j form a finite set. For all such j ,we know that score ( j, X ↾ n ) is non-decreasing and eventually reaches its final value. After that,for all sufficiently large n , we have Best ( X ↾ n ) = e ∗ . ⊓⊔ At this point, we have seen that the function
Best does achieve the learning of measures wewant, but unfortunately this function is only ′ -computable. By Schoenfield’s limit lemma, thismeans that there exists a computable procedure which, given σ , generates a sequence e , e , . . . of integers that converges to e ∞ = Best ( σ ). There is in general no way to compute µ e ∞ fromthis sequence. However, what we can do is combine all the µ e i together (being cautious aboutthe fact that some µ e i may be partial) into a single computable measure ν such that ν > cµ e ∞ for some c >
0, and this, by the Schnorr–Levin theorem, guarantees that everything that israndom with respect to µ e ∞ , is also ν -random.More precisely, we have the following lemma. Lemma 10.
Let f : 2 <ω → N be a total ′ -computable function such that µ f ( σ ) is a measurefor all σ . Then there is a computable function g such that µ g ( σ ) is a measure for all σ , and µ g ( σ ) ≥ c σ µ f ( σ ) for some positive c σ .Proof. Consider the following effective procedure. On input σ , we use the Schoenfield limitlemma to effectively get a sequence e i converging to e ∞ = f ( σ ). Initially all e i are considered“candidates”. We then apply a filtering process that deletes some of these candidates. Recallthat the corresponding µ e i are elements of F . We compute in parallel all µ e i ( τ ) for all pairs( i, τ ) for which e i is still a candidate. If we find two candidates e i , e j and τ such that µ e i ( τ ) and µ e j ( τ ) are both defined and not equal different from each other, then we remove e i and e j fromthe list of candidates. This way we ensure, since the sequence converges, that from some pointon, for any candidate e i , the corresponding function µ e i is equal to µ e ∞ on its domain (but13 e i may be partial). Indeed, each bad candidate (i.e., an e i such that e i = e ∞ ) may destroy atmost one good candidate, and by assumption almost all candidates are good.Now we let µ g ( σ ) to be a computable measure ν constructed in the following way. First, let ν ( Λ ) = 1. Then we compute the conditional probabilities ν ( x /ν ( x ) and ν ( x /ν ( x ) level bylevel. When computing them on level N , we use for the computation the conditional probabilitiesfor some candidate that remains alive after N steps of filtering process. (Any of them could beused, for example, we may take the one with smallest computation time. Note the at least onegood candidate remains, so we will not wait forever.)As we have seen, starting from some level only good candidates remain, so the conditionalprobabilities above this level are the same for µ f ( σ ) and ν . Since by assumption all values of allmeasures are positive, this guarantees the required inequality. ⊓⊔ We can now put all pieces together to prove Theorem 7. Applying the previous lemma to f = Best , we have a computable function g such that for every σ , the measure µ g ( σ ) dominates,up to a multiplicative constant, the measure µ Best ( σ ) . For every X that is random with respectto some computable measure, we know, by Lemma 9, that µ Best ( X ↾ n ) is eventually constantand equal to a measure with respect to which X is random. This measure is dominated (up tomultiplicative constant) by µ g ( X ↾ n ) , thus X is also random with respect to µ g ( X ↾ n ) (change inthe measure increases the deficiency at most by O (1)). This finishes the proof. ⊓⊔ Acknowledgements . Laurent Bienvenu and Santiago Figueira acknowledge the support ofthe Laboratoire International Associ´e “INFINIS”. Laurent Bienvenu and Alexander Shen alsoacknowledge the support of ANR-15-CE40-0016-01 RaCAF grant.
References
BGH +
11. Laurent Bienvenu, Peter G´acs, Mathieu Hoyrup, Cristobal Rojas, and Alexander Shen.Algorithmic tests and randomness with respect to a class of measures.
Proceedings of theSteklov Institute of Mathematics , 274:34–89, 2011.BM09. Laurent Bienvenu and Wolfgang Merkle. Constructive equivalence relations for computableprobability measures.
Annals of Pure and Applied Logic , 160:238–254, 2009.CS83. John Case and Carl Smith. Comparison of identification criteria for machine inductiveinference.
Theoretical Computer Science , 25:193–220, 1983.G´ac05. Peter G´acs. Uniform test of algorithmic randomness over a general space.
TheoreticalComputer Science , 341(1-3):91–137, 2005.JL95. David Juedes and Jack Lutz. Weak completeness in E and E . Theoretical ComputerScience , 143(1):149–158, 1995.LV08. Ming Li and Paul Vit´anyi.
An introduction to Kolmogorov complexity and its applications .Texts in Computer Science. Springer-Verlag, New York, 3rd edition, 2008.Odi99. Piergiorgio Odifreddi.
Classical Recursion Theory: Volume II . Elsevier, 1999.VC13. Paul Vit´anyi and Nick Chater. Algorithmic identification of probabilities.https://arxiv.org/abs/1311.7385v1, 2013.VC17. Paul M.B. Vit´anyi and Nick Chater. Identification of probabilities.
Journal of MathematicalPsychology , 76(Part A):13–24, 2017.Wei00. Klaus Weihrauch.
Computable analysis . Springer, Berlin, 2000.ZZ08. Thomas Zeugmann and Sandra Zilles. Learning recursive functions: a survey.
TheoreticalComputer Science , 397:4–56, 2008., 397:4–56, 2008.