aa r X i v : . [ c s . L O ] M a y ABSOLUTELY NO FREE LUNCHES!
GORDON BELOT
Abstract.
This paper is concerned with learners who aim to learn patterns in infinitebinary sequences: shown longer and longer initial segments of a binary sequence, they eitherattempt to predict whether the next bit will be a 0 or will be a 1 or they issue forecastprobabilities for these events. Several variants of this problem are considered. In eachcase, a no-free-lunch result of the following form is established: the problem of learning is aformidably difficult one, in that no matter what method is pursued, failure is incomparablymore common that success; and difficult choices must be faced in choosing a method oflearning, since no approach dominates all others in its range of success. In the simplest case,the comparison of the set of situations in which a method fails and the set of situationsin which it succeeds is a matter of cardinality; in other cases, it is a topological matter(meagre vs. co-meagre), or a hybrid computational-topological matter (effectively meagrevs. effectively co-meagre, in the sense of Mehlhorn (1973)). Introduction
The various no-free-lunch theorems of statistical, computational, and formal learning the-ory offer ways to make precise the basic insight that there can be no optimal general-purposeapproach to learning. These theorems come in two main forms. Some show that there arecontexts in which certain approaches to learning succeed in each salient situation, but thateach such approach has the same expected performance across those possible situations. Results of this kind are measure-relative : in order for expectations to be defined, a measuremust be imposed on the space of situations that a learner might face—and the results inquestion only hold relative to some of the measures that one might impose. Results of asecond kind are absolute in the sense that they do not rely upon the choice of a measure onthe space of envisaged situations. Here are descriptions of two paradigmatic results of thiskind. Maybe there exists some kind of universal learner, that is, a learner who hasno prior knowledge about a certain task and is ready to be challenged by anytask? . . . The no-free-lunch theorem states that no such universal learnerexists. To be more precise, the theorem states that for binary classificationprediction tasks, for every learner there exists a distribution on which it fails.. . . In other words, the theorem states that no learner can succeed on alllearnable tasks—every learner has tasks on which it fails while other learnerssucceed. Let T be any learning machine. . . . [W]e will defeat the machine T. That is,
For helpful comments and discussion, thanks to: an anonymous referee, Josh Hunt, Thomas Icard, MikaylaKelley, Tom Sterkenburg, and Bas van Fraassen, and Francesca Zaffora Blando. For results of this kind, see Wolpert and Macready (1997), Wolpert (2002) and Ho and Pepyne (2002). Forfurther discussion, see von Luxburg and Sch¨olkopf (2011, § Shalev-Shwartz and Ben-David (2014, 36). e will have constructed a regularity, depending on T, which is beyond thepower of T to extrapolate. However . . . it is always possible to build anothermachine which can extrapolate every regularity that T can extrapolate andalso extrapolate the one that T can’t extrapolate. Thus, there cannot exist acleverest learning machine: for, for every learning machine T, there exists amachine T which can learn everything that T can learn and more besides. We can think of such results as encapsulating two facts about the predicament of learnerssituated in certain contexts. (a) They face a difficult problem: no approach they mightadopt succeeds across all envisaged situations. (b) Difficult choices must be made: differentapproaches succeed in different situations, with no approach dominating all others in itsrange of success.Here we assemble some more or less elementary results, some already well-known, thatcombine to give no-free-lunch results of the second, absolute variety, applicable to agentsattempting to learn patterns in infinite data streams of bits. Nature presents our agentswith one-way infinite binary sequences one bit at a time and after each bit is revealed eachagent is asked to make a prediction about the next bit. We will consider five models oflearning, differing from one another as to what sort of predictions our agents are required tomake or as to the criterion of success. And for each model, we will consider variants in whichneither the agent nor Nature is required to follow a computable strategy, in which the agentis required to follow a computable strategy but Nature is not, and in which both the agentand Nature are required to follow computable strategies. For each variant of each model,we establish both elements required for a no-free-lunch result. (i) Difficult choices must befaced in selecting a method of learning: we will show that no approach dominates all itsrivals, either by showing that for each method there is another that succeeds in a disjoint setof situations ( evil twin results ) or by showing that for every method there is another thatsucceeds in a strictly larger family of situations ( better-but-no-best results ). (ii) We also showour learners face a formidably difficult problem: for each of the problems we consider, thereis a sense in which for any method of addressing that problem, the situations in which itfails are incomparably more common than the situations in which it succeeds (so here we gobeyond the paradigm results mentioned above, which show only that for each method, thereexists a situation in which it fails). Following some preliminaries in Section 2, we investigate in Section 3 the predicament oflearners who must attempt to guess, before each bit is revealed, whether it will be a 0 ora 1. We will consider two criteria of success for such next-value learners: when facing agiven data stream they should eventually predict each new bit correctly (NV-learning); orthe should predict each new bit correctly, except for a family of errors that has vanishingasymptotic density (weak NV-learning). In Section 3.1 we will see that for NV-learning of Putnam (1963b, 6f.)—see also Gold (1967, Theorem I.5). As noted by Case and Smith (1983, 208): “Thisappears to be the earliest result indicating that there may be some difficulty with the mechanization ofscience.” A problem that, according to Li and Vit´anyi (2019, 6), “constitutes, perhaps, the central task of inductivereasoning and artificial intelligence.” For other results of the sort developed here, see Fortnow et al. (1998). For early investigations of such agents, see Putnam (1963a) and Gold (1967). The notion of NV-learning is due to B¯arzdi¸nˇs (1972); see also Blum and Blum (1975). The notion of weakNV-learning is a near relative of the notion of coarse computability introduced in Jockusch and Schupp(2012). rbitrary sequences, failure is incomparably more common than success in the sense that anymethod for predicting bits succeeds for a countable family of binary sequences and fails for anuncountable family of binary sequences. In Section 3.2 we will see that for weak NV-learningof arbitrary sequences, any method succeeds for an uncountable set of sequences and failsfor an uncountable set of sequences, but the successes are always incomparably less commonthan the failures in a topological sense, forming a meagre set. In Section 3.3, we restrictattention to computable methods for the next-value learning of computable sequences andfind that for any method, the sets of success and failures are equivalent both from the point ofview of cardinality and the point of view of topology—but that the successes are nonethelessincomparably less common than the failures in the hybrid topological-computational sense(due to Mehlhorn (1973)) that they form an effectively meagre set. Along the way we willsee that the notion of weak NV-learning, while strictly more inclusive than the notion of NV-learning, is neither weaker nor stronger than two other variants of NV-learning, NV -learning(due to B¯arzdi¸nˇs and Freivalds (1972)) and NV -learning (due to Podnieks (1974)).In Section 4 we turn to agents who face a data stream sampled from a probability measurechosen by Nature and who are required to issue forecast probabilities for the next bit’s beinga 0 or a 1 just before it is revealed. We consider three criteria of success for agents engaged insuch next chance prediction: we can ask that for any event, the probabilities that our agentsassign to that event converge almost certainly to the true probability as they see larger andlarger data sets (strong NC-learning); we can ask that their forecast probabilities for thenext bit become arbitrarily accurate, almost certainly, in the limit of large data sets (NC-learning); or we can ask that they, almost certainly, meet the last-mentioned standard modulo a set of missteps of vanishing asymptotic frequency (weak NC-learning). For the problemof next-chance learning in the face of a data stream generated by an arbitrary measure, wesee in Sections 4.1 and 4.2, that for any of our criteria of success, each method fails for anuncountable set of measures that Nature might have chosen and succeeds for an uncountableset of such measures—but that the former set is always incomparably smaller than the latter,being meagre. In 4.3, we restrict attention to computable strategies for next-chance learningin contexts in which the data stream is generated by a computable measure and find, for eachof our three criteria of success, that the set of learnable measures is an effectively meagresubset of the family of computable measures. Section 5 provides a few concluding remarks.2.
Preliminaries
The Main Characters
We will be concerned below with a number of topological spaces.(i) The space of bits, B : “ t , u , equipped with the discrete topology (so every subset of B is open).(ii) Finite products of B with itself: for each n P N , the space of n -bit strings, B n , equippedwith the discrete topology (we count 0 as a natural number and use ∅ to denote eitherthe empty string of zero bits that is the sole member of B or the empty set, depending In the case of NV-learning, this result is due to Fortnow et al. (1998). For this model of learning agents, see Solomonoff (1964). These criteria of success derive from Blackwell and Dubins (1962), Kalai and Lehrer (1994), andLehrer and Smorodinsky (1996). The criteria of success employed in the literature on Solomonoff induc-tion differ in focussing on average or expected performance in the long run—on the relation between thosenotions and the notion of NC-learning, see Ryabko and Hutter (2007). n context). We will think of elements of B n as strings (concatenations of symbols)rather than as n -tuples. For w P B n and m ď n we write w p m q for the m th bit of w and write w r m s for the m -bit initial segment of w. For m, n P N with m ď n, we havethe natural projection map π nm : w P B n ÞÑ w r m s P B m . (iii) The space of binary strings, B ˚ : “ Ť n “ B n , also equipped with the discrete topology.If v and w are binary strings we write v.w for the string that results from concatenating v and w (in that order) and write v.w for the results of concatenating v with w andwith w, etc. We write | w | for the number of bits in binary string w. (iv) Cantor space, C , the set of all infinite binary sequences equipped with the producttopology (we take sequences to be indexed by positive natural numbers). We cancharacterize this topology as follows: if w is an n -bit string, then we use B w to denotethe set of sequences whose first n bits are given by w ; the set of all such B w (as w ranges over B ˚ ) is a basis for the product topology and we call the B w basic open sets.Illustration : the set of sequences that have 0 as their second bit is an open set becauseit is the union of the basic open sets B and B . For σ P C we write σ p m q for the m th bit of σ and write σ r m s for the m -bit string formed by concatenating the first m bits of σ. For each n P N we have the natural projection map π n : σ P C ÞÑ σ r n s P B n . A sequence of points σ , σ , σ , . . . in Cantor space converges to σ P C if and only iffor each k, there exists an N so that for n ě N, σ n p k q “ σ p k q . We use B to denotethe σ -algebra of Borel subsets of C . We use C to denote the subspace of C consisting ofcomputable sequences (i.e., the σ such that the map k P N ÞÑ σ r k s is computable).(v) For each k P N , the space P k of Borel probability measures on B k . Since | B k | “ k , wecan identify any µ P P k with a 2 k -tuple of real numbers in the closed unit interval thatsum to one. We take P k to be equipped with the topology that it inherits from beingembedded in this way as a closed subset of R k (which we take to be equipped withits standard topology). We call µ P P m and ν P P n with m ď n consistent if for eachsubset A of B m we have µ p A q “ ν p π ´ nm p A qq . (vi) The space P of Borel probability measures on C equipped with the weak topology, whichcan be characterized as follows. For each binary string w and each pair of numbers p and q in the closed unit interval with p ă q, let S w,p,q : “ t µ P P : p ă µ p B w q ă q u . The set of all such S w,p,q forms a sub-basis for the weak topology on C : the open setsof the weak topology are arbitrary unions of finite intersections of these sub-basic sets.Under the weak topology, a sequence t µ k u of measures in P converges to µ P P if andonly if lim k Ñ8 µ k p B w q “ µ p B w q for each w P B ˚ . For the weak topology on spaces of measures on metric spaces, see Parthasarathy (1967, chap. II) andBillingsley (1999, § P , see Kechris (1995, chap. 17) and Reimann (2008, § C by specifying the weight that it assigns to each binarystring, by fixing an enumeration of the binary strings we can identify each µ P P with a sequence of numbersin the closed unit interval that sum to one. In this way we identify P with a closed subset of the Hilbert cube( “ r , s ω equipped with the product topology). The weak topology is the topology that P inherits from thisembedding. Each B w is a clopen subset of C and so is a continuity set for any measure in P . So the PortmanteauTheorem implies that the above condition is necessary for weak convergence. And it is also sufficient, sincethe B w form a countable basis for C closed under finite intersections. See, e.g., Billingsley (1999, Theorems2.1 and 2.2). elow, in order to simplify notation, for µ a measure in P and w a binary string,we will write µ p w q in place of µ p B w q . Using this notation, the Carath´eodory ExtensionTheorem tells us that any map ¯ ν : B ˚ Ñ r , s such that ¯ ν p ∅ q “ ν p w q “ ¯ ν p w. q ` ¯ ν p w. q for each w P B ˚ induces a unique ν P P such that ν p w q “ ¯ ν p w q for all w P B ˚ . As usual, we consider ν P P to be computable if and only if there exists a computable F : B ˚ ˆ N Ñ Q such that | ν p w q ´ F p w, n q| ă ´ n for all w P B ˚ and n P N . We use P to denote subspace of P consisting of computable measures. Remark . B ˚ , C , P , each of the B k , and each of the P k are compact, separable andcompletely metrizable. In B , the other B n , and in B ˚ each point is isolated (i.e., for anypoint, the singleton set containing that point is open). There are no isolated points in C , the P k ( k ą P . Remark . C and the P k ( k ą
0) of course have cardinality c (the cardinality of thecontinuum). So does P : since P is non-empty, compact, and metrizable there is a continuousmap from C onto P ; since P is a non-empty, separable, and completely metrizable spacewithout isolated points, there is an embedding of C into P . The Meagre & the Co-Meagre
We are going to be interested in making comparisons of size for certain subsets of C and P . The most straightforward standard of comparison is cardinality: it natural to say that anyuncountable set is incomparably larger than any countable set.Below we will see examples where the set of learnable sequences or measures and the set ofunlearnable sequences or measures have the same cardinality—but in which it is intuitivelynatural to say that the unlearnable sequences or measures are incomparably more commonthan the learnable sequences or measures.The intuitive notions of size in play here correspond nicely with the topologists’ notions ofmeagre and co-meagre subsets of a topological space. Recall that a nowhere dense subset ofa topological space is one whose closure has empty interior—or, equivalently, a subset A of atopological space X is nowhere dense if and only if for any non-empty open set U Ă X, thereexists a non-empty open set U ˚ Ă U with A Ş U ˚ “ ∅ . And recall that a meagre subset of atopological space is one that can be written as a countable union of nowhere dense sets whilea co-meagre subset is one that is the complement of a meagre set.For any topological space X, the class of meagre subsets of X is closed under the operationsof taking subsets and taking countable unions. The Baire Category Theorem tells us that ina completely metrizable space, no non-empty open set is meagre. So, in particular, no non-empty completely metrizable space has any subsets that are both meagre and co-meagre. The results just mentioned motivate the standard practice in topology, analysis, and relatedmathematical fields of considering the elements of a meagre subset of a completely metrizablespace to be extremely rare and the elements of the complement of such a set to be exceedinglycommon—so that objects that form a co-meagre set are often referred to as being typical.
Illustration : one says that typical continuous functions on the unit interval are nowhere See, e.g., Nies (2009, § Kechris (1995, Theorems 4.18 and 6.2) If A Ă X were both meagre and co-meagre, then so would be its complement. But then X could be writtenas a union of two meagre sets—which is impossible if no non-empty subset of X is meagre. ifferentiable because the nowhere differentiable functions form a co-meagre subset of thespace of continuous functions under the uniform topology. Remark . Here is an additional compelling rationale for thispractice. Fix a subset S of C . An infinite two-player game is to be played. In the first round,Player I selects a non-empty binary string v , then Player II selects a non-empty binary string w ; and similarly in each subsequent round, Player I selects a non-empty binary string v k , then Player II selects a non-empty binary string w k . Player I wins the game if the infinitebinary sequence v .w .v .w . . . . is in S, otherwise Player II wins. Intuitively, if Player I has awinning strategy for the the Banach–Mazur game for S, then S must be overwhelmingly largeas a subset of X, while if Player II has a winning strategy, then S must be nigh ignorablysmall as a subset of X .The intuitive notions of small and large subsets appealed to here correspond precisely tothe notions of meagre and co-meagre subsets: Player I has a winning strategy if and only if S is co-meagre in some open subset of C ; Player II has a winning strategy if and only if S ismeagre as a subset of C . Extrapolation
Think of Nature as having chosen a binary sequence, which is now being revealed to alearning agent one bit at a time. After each new bit is presented, the agent attempts topredict what the next value will be on the basis of the data seen so far. The agent succeedsin this task if from a certain point onwards, the predictions made match reality (almostperfectly).
Definition . An extrapolator is a function m : B ˚ Ñ B . We denote the setof extrapolators by E . Definition . An extrapolating machine is a computable extrapolator—i.e., a computable function m : w P B ˚ ÞÑ m p w q P B . We denote the set of extrapolatingmachines by E . Definition . Let m be a extrapolator and σ a binary sequence. We saythat m NV-learns σ (or that σ is NV-learnable by m ) if there is an N such that for all n ą N,m p σ r n sq “ σ p n ` q . Jockusch and Schupp (2012, p. 472) remark that “In recent years, there has been a generalrealization that worst-case complexity measures, such as
P, N P, exponential time, and justbeing computable, often do not give a good overall picture of the difficulty of a problem.” Asan example, they observe that although there exist finitely presented groups with unsolvableword problems, in every such group the words expressing the identity have vanishing asymp-totic density, when words are enumerated in lexicographic order. So the linear-time algorithmthat on the input of any word guesses that that word does not express the identity wouldmake a negligible set of errors if fed all words in lexicographic order. If we demand perfection, Here we have described a special version of the game adapted to C . A more general version makes sense inany topological space X and we always have the connection between meagreness and winning strategies forPlayer II; the connection between co-meagreness and winning strategies for Player I requires some additionalhypotheses in the general setting. See Oxtoby (1957) and Kechris (1995, §§
8H and 21.C). The notion of NV-learning for extrapolating machines is due to B¯arzdi¸nˇs (1972); see also Blum and Blum(1975). hen the word problem is impossibly hard—but if we can live with making mistakes a negli-gible fraction of the time, it is as easy as could be. This motivates Jockusch and Schupp tointroduce a generalization of the notion of computability: a sequence is coarsely computable if it differs from some computable sequence in a set of bits of vanishing asymptotic density.We will consider here a notion of learning that stands to NV-learning as coarse computabilitystands to computability (see Remark 3.4 for more on the relation to coarse computability). Definition . We say that extrapolator m weakly NV-learns binarysequence σ (or that σ is weakly NV-learnable by m ) if:lim n Ñ8 |t k ď n : m p σ r k sq “ σ p k ` qu| n “ . For m an extrapolator and σ a binary sequence, we say that according to m, n P N correspondsto a good bit of σ if m p σ r n sq “ σ p n ` q and corresponds to a nasty bit of σ if m p σ r n sq ‰ σ p n ` q . To say that m NV-learns σ is to say that according to m, σ eventually consistsof nothing but good bits. To say that m weakly NV-learns σ is to say that according to m, although σ may contain infinitely many nasty bits, these have vanishing limiting relativefrequency. We will consider each of these two criteria of learning in turn.3.1. NV-Learning
Officially, the job of an extrapolator is to predict the next bit on the basis of the currentdata set. But we can also think of an extrapolator m as a means of guessing the entire datasequence on the basis of any initial segment. Definition . For m P M and w P B n , we use σ wm to denote the sequence defined as follows:– For k “ , . . . , n, σ wm p k q “ w p k q (i.e., σ wm r n s “ w ).– σ wm p n ` q “ m p w q ;– σ wm p n ` ℓ q “ m p w.σ wm p n ` q . . . . .σ wm p n ` ℓ ´ qq ( ℓ “ , , . . . ).We say that m guesses σ wm on input w. Note that if m P E then for any w P B ˚ , σ wm P C : on any input, an extrapolating machineguesses a computable sequence.Trivially, there is an equivalence between the sequences NV-learned by an extrapolatorand the sequences guessed by it. Proposition . Extrapolator m NV-learns sequence σ if and only if σ “ σ wm for some w P B ˚ . Proof.
Suppose that m NV-learns σ. Then there is an n such that for all n ě n , m p σ r n sq “ σ p n ` q . So m guesses σ on input w “ σ r n s . Suppose, on the other hand, there is an n such that m guesses σ on input w “ σ r n s . Then m NV-learns σ, since for all n ą n ,m p σ r n sq “ σ p n ` q . (cid:3) So asking that m eventually correctly predict next bits is equivalent to asking that m even-tually be able to answer correctly all questions about the data stream. On this point, see, e.g., Angluin and Smith (1983, § This is related to the deeper fact that
N V “ PEX —see Theorem 2.19 of Case and Smith (1983), wherethe result is attributed to private communications from van Leeuwen and from B¯arzdi¸nˇs. roposition . For any extrapolator m, the sequences NV-learnable by m form a countablyinfinite set dense in C while the sequences not NV-learnable by m form a dense subset of C of cardinality c . Proof.
On the one hand, B ˚ is a countable set and the preceding proposition tells us thatthe map w P B ˚ ÞÑ σ wm has as its range the set of sequences NV-learnable by m. So this setis countable. And since for any w P B ˚ , σ wm P B w the set of NV-learnable sequences is densein C (and is therefore infinite). On the other hand, each B w has cardinality c but containsonly countably many binary sequences NV-learnable by m. (cid:3) Corollary . The set t σ P C | D m P E such that m NV-learns σ u is countable.As usual, we call a sequence t σ i u i P N of elements of C uniformly computable in i if there isa computable f : N ˆ N Ñ B such that f p i, j q “ σ i p j q , for all i, j P N . Proposition . (a) Let m P E and S be a countable subset of C . Then there is an m ˚ P E that NV-learns every σ P S as well as everything NV-learned by m. (b) Let m P E and let S “ t σ i u i P N be a family of elements of C uniformly computable in i. Then there is an m ˚ P E that NV-learns every σ P S as well as everything NV-learned by m. Proof.
We present the argument for (b)—essentially the same argument works for (a).Define ˜ m P E as follows: on input of w P B n , ˜ m finds K “ t k P N | ď k ď n, σ k r n s “ w u ; if K ‰ ∅ , then ˜ m p w q “ σ ℓ p n ` q , where ℓ is the least element of K ; otherwise, ˜ m p w q “ m p w q . Define m ˚ as follows: m ˚ has a counter that keeps tally of how many incorrect predictionhave been made in the course of processing a given data stream; in processing input w P B ˚ ,m ˚ simulates m if an even number of incorrect predictions have been made and simulates ˜ m if an odd number have been made.Clearly, m ˚ is an extrapolating machine. Suppose that m ˚ is shown a data stream σ that itdoes not NV-learn. Then m ˚ must make infinitely many incorrect predictions in processing σ. So σ cannot be a sequence NV-learned by m : any such sequence is guessed by m whenit sees sufficiently long initial segments. Similarly, σ cannot be any of the σ k , since each ofthese is guessed by ˜ m when it see sufficiently long initial segments. (cid:3) Proposition . Let m be an extrapolator and let S Ă C be the set of sequences that it NV-learns. Then there is an extrapolator m : such that the set S : of sequences that it NV-learnsis disjoint from S —and where m : is in E if m is. Proof.
Define m : by setting m : p w q “ ´ m p w q for each w P B ˚ . (cid:3) So we have both elements required for the sort of no-free-lunch result we seek. The problemof NV-learning is a formidably difficult one: each (computable) extrapolator fails to NV-learnincomparably more sequences that it NV-learns: the set on which it succeeds is countable(and hence meagre), so the set on which it fails is uncountable (indeed, co-meagre). Andthere are hard choices to be made: for any (computable) extrapolator, there is anotherthat NV-learns sequences that the first cannot NV-learn. There is no optimal method ofextrapolation.3.2.
Weak NV-Learning
If an extrapolator NV-learns a sequence, then it also weakly NV-learns it. But the converseis not true. xample . Consider the extrapolating machine m that outputs 1 on any input. Thismachine NV-learns all and only sequences that are eventually all 1’s—a countably infiniteset. But m weakly NV-learns continuum-many sequences. For, let ˆ σ be an arbitrary binarysequence and let σ be the sequence defined as follows: for n “ , , . . . , if k “ n , then σ p k q “ ˆ σ p n q ; otherwise, σ p k q “ . According to m , the nasty bits of σ have vanishingasymptotic density, so m weakly NV-learns σ. And there are continuum-many ˆ σ we coulduse as input for this construction, each determining a distinct sequence weakly NV-learnedby m. Note that there is no input on which m guesses a sequence that contains infinitelymany 0’s, although it weakly NV-learns uncountably many sequences with this feature. Notealso that although m is computable, it weakly NV-learns uncountably many uncomputablesequences and weakly NV-learns sequences of arbitrary Turing degree. Proposition . Each extrapolator weakly NV-learns a dense set of sequences of cardinality c and fails to weakly NV-learn a dense set of sequences of cardinality c . Proof.
Let m be an extrapolator, w an n -bit binary string, and ˆ σ an arbitrary sequence. Weconstruct sequences σ ˚ and σ : as follows:– For k “ , . . . , n, σ ˚ p k q “ σ : p k q “ w p k q . – For k “ n ` ℓ ( ℓ “ , , . . . ), σ ˚ p k q “ σ : p k q “ ˆ σ p ℓ q . – For all other k, σ ˚ p k q “ m p σ ˚ p q .σ ˚ p q . . . . .σ ˚ p k ´ qq and σ : p k q “ ´ σ ˚ p k q . According to m, any nasty (good) bits in σ ˚ ( σ : ) occur with indices of the form n ` ℓ . So m weakly NV-learns σ ˚ and fails to weakly NV-learn σ : . By varying w, we obtain weaklyNV-learnable and not weakly NV-learnable sequences in each basic open set of C . And byvarying ˆ σ we obtain continuum-many sequences of each type. (cid:3) So for any extrapolator, there are continuum-many sequences that it can weakly NV-learnand continuum-many sequences that it cannot weakly NV-learn. But, intuitively, there isa sense in which it is much more difficult to construct a sequence weakly NV-learnable bya given extrapolator than it is to construct a sequence that is not weakly NV-learnable bythat extrapolator. Consider again the extrapolator m that outputs 1 on any input. Inorder to construct a sequence that this extrapolator weakly NV-learns, you begin with theall 1’s sequence, then sprinkle in some 0’s, subject to the constraint that the set of indicesof the slots containing 0’s has vanishing asymptotic density in N . In order to construct asequence that this extrapolator can’t weakly NV-learn, you begin with the all 1’s sequenceand sprinkle in as many 0’s as you like, just being careful to make sure that the set of indicesof the slots containing 0’s doesn’t have vanishing asymptotic density. The latter task, isintuitively, easier: e.g., because there are a lot more densities not equal to zero than equalto zero. This intuition is borne out by the following result.
Proposition . Let m be any extrapolator. The sequences weakly NV-learnable by m forma meagre subset of C . Proof.
Let us say that binary string w is wicked according to m if at least half of the bits of w are nasty according to m. For each n P N , let A n be the set of sequences that do not haveat least n initial segments that are wicked according to m. We claim that each A n is nowhere dense. To establish this, it suffices to show that for anybinary string w, there is another, w ˚ , depending on n and w, such that w ˚ extends w and B w ˚ Ş A n “ ∅ . To this end, let w be a string and let w ˚ be the result of extending w by | w | bits that are nasty according to m, then tacking on n more nasty bits. Every sequence in w ˚ then has at least n initial segments that are wicked according to m. So A : “ Ť n “ A n is a meagre subset of C . And any sequence σ weakly NV-learnable by m must be in A —for otherwise, σ would have the feature that for each k, it contained at least k initial segments wicked according to m, which would mean that the asymptotic relativefrequency of nasty bits in σ could not vanish. So the set of sequences weakly NV-learnableby m, being a subset of a meagre set, is meagre. (cid:3) Corollary . The set t σ P C | D m P E such that m weakly NV-learns σ u is meagre in C . So the problem of weakly NV-learning sequences is formidably difficult. And difficultchoices must be made in the face of this intractability—there can be no optimal extrapolatorfor weak NV-learning.We have the following better-but-no-best result. Proposition . (a) Let m P E and let S be a countable subset of C . Then there is an m ˚ P C that NV-learns every σ P S and also weakly NV-learns everything that m does. (b) Let m P E and let S “ t σ i u i P N be a family of elements of C that is uniformly computable in i. Then there is an m ˚ P C that NV-learns every σ P S and also weakly NV-learns everythingthat m does. Proof.
We present the argument for (b)—essentially the same argument works for (a).Define m ˚ as follows: on input of w P B n , m ˚ finds K “ t k P N | ď k ď log n, σ k r n s “ w u ; if K ‰ ∅ , then m ˚ p w q “ σ ℓ p n ` q , where ℓ is the least element of K ; otherwise, m ˚ p w q “ m p w q . Clearly, m ˚ is an extrapolating machine and NV-learns each σ k (since each σ k has an initialsegment on which m ˚ guesses σ k ). And if σ P C is weakly NV-learned by m then it is alsoweakly NV-learned by m ˚ : in processing the first 2 n bits of σ, m ˚ can disagree with m atmost n times; so the asymptotic density of bits on which m ˚ and m disagree in processing σ is zero.We also have the usual sort of evil-twin result. Proposition . Let m be an extrapolator and let S Ă C be the set of sequences that itweakly NV-learns. Then there is an extrapolator m : such that the set S : of sequences thatit weakly NV-learns is disjoint from S —and where m : is in E of m is. Proof.
Define m : by setting m : p w q “ ´ m p w q for each w P B ˚ . According to either m or m : , in any sequence that the other weakly NV-learns, the good bits have asymptotic densityzero. (cid:3) Extrapolation of Computable Sequences
While it is plausible that every method of learning implementable by a natural or artificiallearning agent is computable, the data streams that our agents face may or may not becomputable. Still, there are many settings in which we can be confident that our agentsface computable data streams. So let us specialize to the setting in which computable ex-trapolators attempt to (weakly) NV-learn computable sequences and see how the landscapesurveyed above is transformed. Remark 3.4 below will show that this strengthens the observation of Jockusch and Schupp (2012, p. 438)that the set of coarsely computable sequences is meagre in C . Thanks here to Tom Sterkenburg and to an anonymous referee for helpful suggestions. Unless, that is, physical reality itself is fundamentally computational in nature—for a range of views of thistopic, see the papers collected in Zenil (2013). or m P E , we denote by N V p m q the set of computable sequences that are NV-learned by m. We use
N V to denote: t S Ă C | D m P E with S Ď N V p m qu . We likewise use
N V w p m q to denote the set of computable sequences weakly NV-learned byan extrapolating machine m and use N V w to denote: t S Ă C | D m P E with S Ď N V w p m qu . Proposition . N V is a proper subset of
N V w . Proof.
Clearly
N V Ď N V w . We give an example of a set in
N V w ´ N V . Consider again the extrapolating machine m of Example 3.1 above that outputs 1 on everyinput. Let U “ N V w p m q , the set of computable binary sequences in which 0’s have vanishingasymptotic density. We are going to show that U is not in N V . Suppose that there is an extrapolating machine m that NV-learns each sequence in U. Noticethat for any w P B ˚ , the sequence w. ω is in U —so for sufficiently large ℓ P N , we musthave m p w. ℓ q “ . Let σ P C be the sequence of the form 1 n . . n . . n . . . . where each n j is chosen to be the smallest n larger than 2 j such that m p n . . n . . . . . . n j ´ . . n q “ . Then σ P U : since m is computable, so is σ ; and by construction, 0’s occur with vanishingasymptotic density in σ. But m does not NV-learn σ, since σ contains infinitely many 0’seach of which m predicts will be a 1. (cid:3) Remark . We mention two of the most fundamental variations on
N V . A partial extrapo-lating machine is a partial recursive function m : B ˚ Ñ B . We say that a partial extrapolatingmachine m NV -extrapolates σ P C if: (i) m p σ r k sq is defined for all k P N ; and (ii) D N P N such that for all n ą N, m p σ r n sq “ σ p n ` q . We write
N V for the set of S Ă C such thatthere is a partial extrapolating machine that NV -extrapolates each σ P S. We say that a partial extrapolating machine m NV -extrapolates σ P C if: D N P N such thatfor all n ą N, m p σ r n sq is defined and equal to σ p n ` q . We write
N V for the set of S Ă C such that there is a partial extrapolating machine that NV -extrapolates each σ P S. Obviously,
N V Ď N V Ď N V . In fact,
N V Ă N V Ă N V . The proof of Proposition 3.9above carries over essentially unchanged (except that dove-tailing is required) to show that
N V w is not contained in N V (let alone in N V ). We will see below in Remark 3.3 that N V w does not contain N V (let alone N V ).Via Propositions 3.1 and 3.2, we know that every extrapolating machine NV-learns a count-ably infinite subset of C . But there there can be no best extrapolating machine: Propositions3.4 and 3.8 tells us each extrapolation machine has an evil twin that (weakly) NV-learns adisjoint set of computable sequences; and Propositions 3.3(b) and 3.7(b) tell us that eachextrapolating machine is dominated by another that (weakly) NV-learns everything it canwhile also NV-learning every member of a uniformly computable family of elements of C . In this setting, what comparative judgements can we make about the sets of computablesequences that a given extrapolating machine (weakly) NV-learns and doesn’t (weakly) NV-learn? This notion is due to B¯arzdi¸nˇs and Freivalds (1972). This notion is due to Podnieks (1974). That
N V Ă N V is due to B¯arzdi¸nˇs and Freivalds (1972); that N V Ă N V is due to Podnieks (1974). SeeCase and Smith (1983, Corollary 2.29, Corollary 2.31, Theorem 3.1, and Theorem 3.5). roposition . For any m P E , following are dense subsets of C :(a) the set of computable sequences NV-learnable by m ;(b) the set of computable sequences not NV-learnable by m ;(c) the set of computable sequences weakly NV-learnable by m ;(d) the set of computable sequences not weakly NV-learnable by m. Straightforward adaptations of the proofs of Propositions 3.2 and 3.5 yield that (a) and (d)are dense. And (b) and (c) are super-sets of (d) and of (a), respectively. (cid:3)
It follows that the set of computable sequences (weakly) NV-learned by an extrapolatingmachine m and the set of computable sequences not (weakly) NV-learned by an extrapolat-ing machine m are both countably infinite subsets of C —so we have parity at the level ofcardinality. A classical result implies that we also have parity at the level of topology. Proposition . Any two countable dense subsets of C are homeomorphic. Proof.
See, e.g., Dasgupta (2014, chapter 17). (cid:3)
So we can say: for any computable method of extrapolating computable sequences, failureand success are equally common—and difficult choices must be made in selecting a com-putable method of extrapolation, since no method dominates all its rivals in its range ofsuccess.But, intuitively, we ought to be able to say something stronger. After all, B¯arzdi¸nˇs and Freivalds(1972) showed that if S is a set of total recursive functions, then following are equivalent :(i) each member of S is NV-learnable; (ii) S is a subclass of a recursively enumerable set oftotal recursive functions; (iii) S is a subclass of an abstract complexity class. So only veryspecial subsets of C are NV-learnable—which means that generic subsets should not be in N V . Indeed, there is a natural hybrid computational-topological notion of that underwrites theconclusion that failure is incomparably more common than success for computable extrap-olation of computable sequences. Mehlhorn (1973) introduced the important notion of aneffectively meagre subset of the set of total recursive functions. We specialize this apparatusto the C . By way of motivation, note that in any topological space X with basis of open sets W , asubset A is nowhere dense if and only if for every non-empty U P W there is a non-empty U ˚ P W with U ˚ Ă U such that A Ş U ˚ “ ∅ . So a subset A Ă C is nowhere dense if and onlyif there is a function f : B ˚ Ñ B ˚ such that for each binary string w : (i) f p w q extends w ; and(ii) A Ş B f p w q “ ∅ . And A Ă C is meagre if and only if there is a function F : N ˆ B ˚ Ñ B ˚ such that: (i) for each n P N there is an A n Ă C such that f n “ F p n, ¨q is a witness to thefact that A n is nowhere dense in C ; and (ii) A “ Ť n P N A n . Definition . Let A be a subset of C and let f : B ˚ Ñ B ˚ be a computablefunction. Then A is effectively nowhere dense via f if for each w P B ˚ :i) f p w q extends w ;ii) A Ş B f p w q “ ∅ . See also Blum and Blum (1975, 127), who attribute the complexity-theoretic condition independently toAdleman. As Blum and Blum remark, this result shows “in essence, that the extrapolable sequences are theones that can be computed rapidly.” efinition . A subset A of C is effectively meagre if there is a computablefunction F : N ˆ B ˚ Ñ B ˚ such that:i) for each n P N , there is an A n Ă C such that A n is effectively nowhere dense via f n “ F p n, ¨q ;ii) A Ť n P N A n . The complement in C of an effectively meagre subset of C is called effectively co-meagre . Proposition . The family of effectively meagre subsets of C is closed underthe following operations:a) taking subsets;b) taking finite unions;c) taking effective unions. Proof.
The first claim is immediate from the definition and the second follows from the third.So suppose that that M is a subset of C such that there exist a computable H : N ˆ N ˆ B ˚ anda decomposition M “ Ť N i , such that for each k P N , H p k, ¨ , ¨q is a witness to the fact that N k is effectively meagre. There exists, then, for each i P N , a decomposition N i “ Ť N ij suchthat each N ij is effectively nowhere dense in virtue of H p i, j, ¨q . Fix a computable bijection π : N ˆ N Ñ N and let p and p be the computable components of the inverse of π (sothat π p p p k q , p p k qq “ k for all k P N ). Set M k : “ N p p k q ,p p k q and for each w P B ˚ , set F p k, t q : “ H p p p k q , p p k q , w q . Then F : N ˆ B ˚ is computable, M “ Ť M k , and each M k iseffectively nowhere dense in virtue of F p k, ¨q . So M is effectively meagre. (cid:3) Crucially, the set of effectively meagre subsets of C is not closed under arbitrary countableunions due to an effective analog of the Baire Category Theorem. Proposition . Let w be a binary string. Then B w Ş C is not effectivelymeagre. Proof.
Let M “ Ť M k be an effectively meagre set with witness F : N ˆ B ˚ . We constructstrings w , w , . . . inductively: w : “ w ; and w k ` “ F p k, w k q . . By construction, each w k is a proper initial segment of w k ` . Let σ “ lim n Ñ8 w n . Then σ P B w Ş C . But for each k,σ R M k (since σ begins with w k ` ), so σ R M. (cid:3) In light of these results, it is natural to think of elements of effectively meagre subsets of C as being incomparably less common than elements of effectively co-meagre subsets of C , even when the meagre and co-meagre sets in question are both dense as subsets of C . Remark . A further reason for this standard practice (due to Lisagor 1981): a subset S of C is effectively meagre if and only if when the Banach–Mazur game (described in Remark2.3 above) is played for S, Player II has a winning strategy that is computable. Anotherreason (due, again, to Mehlhorn 1973): each abstract complexity class is effectively meagreas a subset of the family of total recursive functions.
Example . Fix an enumeration M , M , . . . of the Turing ma-chines, with associated acceptable programming system φ , φ , . . . (so that φ k is the partialrecursive function computed by M k ). Following Blum and Blum (1975, 132), we call a se-quence σ P C self-describing if σ has an initial segment of the form 1 k M k . As Blum and Blum note the set S of self-describing sequences is non-trivial: it followsfrom the Recursion Theorem that each computable binary sequence is a finite variant of a elf-describing sequence—so there are arbitrarily complex sequences in S . Fortnow et al. (1998) observe that S is not effectively meagre. For, consider any computablestrategy β : B n Ñ B n that Player II could use to play the Banach–Mazur game for S . For each k P N , let α k be the following strategy that Player I might adopt: on the first turn, play 1 k α k and Player IIplays strategy β determines a unique sequence σ k P C . The map F : p k, ℓ q P N ÞÑ σ k p ℓ q P B is computable. So by the Recursion Theorem, there is a k P N such that σ k is computedby M k . That is: there exists a strategy (namely, α k ) via which Player I can defeat β. So S is not in effectively meagre. Proposition . Let m be an extrapolating machine. N V w p m q (the set of computablesequences weakly NV-learnable by m ) is an effectively meagre subset of C . Proof.
A straightforward adaptation of the proof of Proposition 3.6, appealing to the factthat when m is computable, the map p n, w q ÞÑ w ˚ R A n used there is computable. (cid:3) Corollary et al. (1998)) . Let m be an extrapolating machine. N V p m q (the setof computable sequences NV-learnable by m ) is an effectively meagre subset of C . So there is a natural sense in which, for any computable extrapolator m, among computablesequences, those (weakly) NV-learnable by m are incomparably less common than those not(weakly) NV-learnable by m. The problem of (weakly) NV-learning computable sequences isformidably difficult.
Corollary . No extrapolating machine can (weakly) NV-learn each self-describing se-quence.
Remark . It is illuminating to situate these results with respect to a couple of results ofFortnow et al. (1998). Fortnow et al. are concerned with the problem of identifying com-putable binary sequences—so they specialize the standard notions of the theory of inductiveidentification of functions to the case of sequences. i) They show via an effective Banach–Mazur argument, that any S P PE X is effectivelymeagre.ii) They observe that the set of self-describing functions (see Example 3.2 above) is in
E X . So any identification class that contains
E X has members that are not effectively meagre.They remark: “Since virtually every inference class is either a subset of PE X or a supersetof
E X the results here settle virtually all open questions that could be raised.” Contact can be made with the present approach by recalling that
PE X “ N V and that
E X Ă N V . So the first result of Fortnow et al. noted above is our Corollary 3.3: theset of sequences NV-learnable by an extrapolating machine is effectively meagre. And sinceevery set in
N V w is effectively meagre, neither N V nor N V is a subset of N V w . In N V w wehave an example of a natural inference class that is neither a subset of PE X nor a supersetof
E X . In this discussion, I will assume that the reader is familiar with the standard notions and notation of workon inductive learning—see, e.g., the classic survey Case and Smith (1983). Consider a learner who is silent until a data set of the form 1 k M k . Fortnow et al. (1998, 145). See Case and Smith (1983): that
PEX “ N V is their Theorem 2.19 (attributed to private communicationsfrom van Leeuwen and B¯arzdi¸nˇs); that EX Ă N V is their Theorem 2.28. n the present setting, in which computable extrapolators (i.e., extrapolating machines) at-tempt to learn computable sequences, we find that generalizing our basic model by allowingmerely partially defined extrapolating machines allows us to crash through a size barrierin a way that loosening our criterion of success by allowing infinitely many errors in thesense of weak NV-learning does not—since every set in N V or in
N V w is effectively meagre,whereas this is not the case for every set in N V or N V . This is the reverse of what wefind if we challenge (possibly computable) extrapolators to NV-learn arbitrary sequences. Inthat setting, in the basic model every learner masters only countably many sequences. Andthis is unchanged if we countenance merely partially defined learners. But if we loosen ourcriterion of success to weak NV-learnability we crash through a cardinality barrier, as eachlearner weakly NV-learns uncountably many sequences.
Remark . Coarse computability implies computable weak NV-learnability. If σ P C differsfrom σ ˚ P C only in bits of vanishing asymptotic density, then the extrapolating machinethat assumes it is being shown σ ˚ on any input weakly NV-learns σ. Weak NV-learnability does not imply coarse computability. Let σ be uncomputable.Construct a sequence σ as follows: begin with two copies of the first bit of σ , followed byfour copies of the second bit of σ , . . . followed by 2 k copies of the k th bit of σ , . . . . Supposethat σ is coarsely computable. Then there must be a computable sequence σ that differsfrom σ only in a set of bits of vanishing asymptotic density. Define a new sequence σ asfollows: make the first bit of σ a 0 if at least one of the first two bits of σ is a 0, otherwisemake it a 1; make the second bit of σ a 0 if at least two of the next four bits of σ are 0,otherwise make it a 1; . . . ; make the k th bit of σ a 0 if at least 2 k ´ of the next 2 k bits of σ are 0, otherwise make it a 1; . . . . Since σ is computable (by assumption), so is σ . But σ is a finite variant of σ and so must be uncomputable. So there can be no such σ : σ is not coarsely computable. But σ is weakly NV-learned by the extrapolating machine thatpredicts the first bit will be a 1 then subsequently predicts that each bit will be the same asthe last bit seen. 4. Forecasting
So far we have set our learners the problem of recognizing which binary sequence is beingrevealed in the data stream—where such recognition consists in becoming good at predictingfuture bits. In effect, we are picturing that in generating new bits, Nature simply consultsa lookup table chosen in advance and that the learner’s job is to attempt to guess which ofthe possible such tables is being used (or, in the case of weak learning, to attempt to comeclose to guessing the right table, in a certain sense).We might instead picture a different sort of procedure. Suppose that what Nature haschosen in advance is not a sequence but, rather, a Borel probability measure λ on C and thatthe learner’s data stream is randomly sampled from λ. So we now picture Nature as beingequipped with a complete set of biased coins and an instruction manual that says which cointo toss to generate the next bit, given the bits that have been generated so far. To mentionjust some of the tamest possibilities: Nature may have chosen a Bernoulli measure, in whichcase the instruction will be to use the same coin to generate each new bit; or Nature mayhave chosen a measure corresponding to a Markov chain, in which case the coin chosen togenerate a new bit will depend only on some fixed finite number of immediately preceding Thanks to an anonymous referee for pointing this out. its; or Nature could have chosen a delta-function measure concentrated on a single sequence,in which case only a maximally biased coin will ever be used. Definition . A source is a Borel probability measure λ on C . In what follows, we will think of Nature as having chosen a source from which our learner’sdata stream is sampled. Recall for w P B ˚ , we write λ p w q in place of λ p B w q . Similarly, for s “ , w P B n , we will write λ p s | w q for the conditional probability λ gives for the p n ` q st bit to be s given that the first n bits were given by w. How should a learner proceed in the setting where the data stream is given by a probabilisticsource? In the setting of Section 3, where we were thinking of new bits as being generated bya deterministic process, we asked learners to choose an extrapolator that would allow them todefinitively predict at each stage what the next bit would be, given the data seen so far. Thatapproach would be suboptimal in the present setting: if Nature is using the fair coin measure(the Bernoulli measure of bias .5) to generate the data stream, then (with probability one)no extrapolator will do better (or worse) than random in its predictions of the next bit—butthe fact that Nature is using this procedure seems like a paradigm example of the sort ofthing that we ought to be able to detect by looking at data. This will be possible if weask agents to choose a forecasting procedure that allows them to issue a forecast probabilitybefore each bit is revealed, rather than choosing an extrapolator that at each stage issuesdefinitive predictions regarding the next bit.A natural way to encode such a strategy for learning would be as a forecasting function :a map ˜ µ : B ˚ ˆ B Ñ p , q with the feature that for all w P B ˚ , ˜ µ p w, q ` ˜ µ p w, q “ µ p w, s q is to be interpreted as the forecast probability that ˜ µ issues for the next bit to be s ( s “ , w ). In fact, it is more convenient to employ a slightlydifferent representation. Note that any ˜ µ of the above form induces, for each n, a probabilitymeasure µ n on B n . Further, for any ˜ µ and m ď n, µ m and µ n will be consistent. So by theKolmogorov Consistency Theorem, ˜ µ induces a measure µ on C (with ˜ µ computable if andonly if µ is). Example . Define ˜ µ : B ˚ ˆ B Ñ p , q as follows: if w P B n contains k µ p w, q “ k ` n ` and ˜ µ p w, q “ n ´ k ` n ` . This map satisfies the condition that for all w P B ˚ , ˜ µ p w, q ` ˜ µ p w, q “ P correspond in this way to such ˜ µ : a measure µ P P is associated in thisway with a ˜ µ of the above form if and only if it is a measure of full support (i.e., it assignspositive weight to each open set—or, equivalently, to each basic open set B w ). Definition . A forecaster is a Borel probability measure on C of full support.We denote the family of forecasters by F . Definition . A forecasting machine is a computable Borel proba-bility measure on C of full support. We denote the family of forecasting machines by F . The aim of a forecaster is to give faithful estimates of the chance of the next bit’s beinga 0 or a 1, given the data seen so far. We are going to distinguish three criteria for the See item (v) of Section 2.1 for the relevant notion of consistency. For a treatment adapted to the special case of measures on C , see B´aez-Duarte (1970). For a generaltreatment, see, e.g., Parthasarathy (1967, chap. V). uccessful next-chance learning. The most restrictive one, due to Blackwell and Dubins(1962), requires that the forecaster eventually offer answers arbitrarily similar to those of thesource concerning any (measurable) question that might be asked about the data stream. The intermediate one, due to Kalai and Lehrer (1994), requires that the forecaster’s proba-bilisitic predictions concerning the next bit eventually approach the true values arbitrarilyclosely. The least restrictive one, due to Lehrer and Smorodinsky (1996), relaxes this lastrequirement by allowing errors, so long as they eventually become arbitrarily rare.
Definition . We say that forecaster µ strongly NC-learns source λ (or that λ is strongly NC-learnable by µ ) if with λ -probability 1 the data stream σ P C satisfies: lim n Ñ8 sup A P B | µ p A | σ r n sq ´ λ p A | σ r n sq| “ B denotes the family of Borel subsets of C ). Definition . We say that forecaster µ NC-learns source λ (or that λ is NC-learnable by µ ) if with λ -probability 1 the data stream σ P C satisfieslim n Ñ8 µ p s | σ r n sq ´ λ p s | σ r n sq “ s “ , . Definition . We say that forecaster µ weakly NC-learns source λ (orthat λ is weakly NC-learnable by µ ) if with λ -probability 1, the data stream σ P C satisfieslim n P K Ñ8 µ p s | σ r n sq ´ λ p s | σ r n sq “ s “ , K Ă N with asymptotic density one. Example . If µ is a forecaster, then µ is also a source and it is immediate that µ stronglyNC-learns, NC-learns, and weakly NC-learns µ. Proposition . For any source λ and any forecaster µ, strong NC-learnability of λ by µ implies NC-learnability (but notconversely) and NC-learnability of λ by µ implies weak NC-learnability (but not conversely). Proof.
Strong NC-learnability implies NC-learnability: in the definition of strong NC-learnability,for each n take A to be the event of the p n ` q st bit being a 1. To see that the converse isnot true, consider the family t λ p | p P r , su of Bernoulli measures and let µ be the Laplace–Bayes prior. It is a basic fact about µ that it is statistically consistent for the problem ofidentifying the bias of a coin from knowledge of outcomes of a sequence of tosses. It followsthat the forecaster µ NC-learns each λ p . But µ does not strongly NC-learn any λ p : let E p bethe event that the limiting relative frequency of 1’s in the data stream is p ; then λ p p E p q “ µ p E p q “
0; so for any w for which λ p p w q ‰ , | µ p E p | w q ´ λ p p E p | w q| “ . Investigation of inductive learning as next-chance learning traces back to Solomonoff (1964). Severalcriteria of success are prevalent in the literature on Solomonoff induction—see Solomonoff (1978), Hutter(2007), Li and Vit´anyi (2019, § Note that in the deterministic setting of Section 3 above, the distinction between eventually becominggood at answering all questions and eventually becoming good at predicting the next bit collapsed—recallProposition 3.1 above. For relations between this criterion of success and those alluded to in fn. 34 above, see Ryabko and Hutter(2007). See, e.g., Freedman (1963). learly, NC-learnability implies weak NC-learnability. To see that the converse is not true,take µ to be the fair coin measure and take λ to be the source that generates bits as follows:for k “ m , s k is the m th bit in the binary expansion of π ; all other s k are generated byflipping a fair coin. The forecaster µ weakly NC-learns this λ but does not NC-learn it, sincethere are large discrepancies between the forecast probabilities and the true probabilities atarbitrarily late times. (cid:3) We are going to see that relative to each of these three criteria, the problem of next-chancelearning is formidably difficult and involves hard choices.4.1.
Strong NC-learning
A famous result and its converse give a necessary and sufficient condition for a source to bestrongly NC-learnable by a forecaster.
Proposition . If source λ is absolutely continuous with re-spect to forecaster µ (i.e., λ p A q ą µ p A q ą A P B ), then µ stronglyNC-learns λ. Proposition . If forecaster µ strongly NC-learns source λ, then λ is absolutely continuous with respect to µ. Proposition . Let µ be a forecaster. The sources strongly NC-learnable by µ form a densesubset of P of cardinality c . Proof.
Let w , . . . , w n be binary strings such that C is a disjoint union of the B w k . And let p , . . . p n P p , q with ř nk “ p k “ . Since the B w k partition C , each w P B is either one of the w k , or a proper prefix of some of the w k , or a proper extension of one of the w k . We define amap ¯ λ : B ˚ Ñ r , s as follows:(a) If w “ w k for some k, then ¯ λ p w q “ p k . (b) If w is a prefix of w j , . . . , w j ℓ , then ¯ λ p w q “ ř ℓk “ p j k . (c) If w is of the form w k .v for some binary string v, then ¯ λ p w q “ p k ¨ µ p v | w k q . It is immediate that ¯ λ p ∅ q “ λ p w q “ ¯ λ p w. q ` ¯ λ p w. q for each w P B ˚ . So bythe Carath´eodory Extension Theorem, ¯ λ extends uniquely to a measure λ P C such that λw “ ¯ λ p w q for each w P B ˚ . The source λ is strongly NC-learnable by µ. For suppose that A is a Borel subset of C with λ p A q ą . Since λ p A q “ n ÿ k “ λ p A | w k q λ p w k q , there must be some 1 ď ℓ ď n such that λ p A | w ℓ q ą . And, since p ℓ ¨ λ p A | w ℓ q “ µ p A | w ℓ q ,µ p A | w ℓ q must likewise be positive. Since µ p A q “ n ÿ k “ µ p A | w k q µ p w k q and since µ p w ℓ q ą µ being a forecaster) this tells us that µ p A q ą . So λ is absolutelycontinuous with respect to µ and Proposition 4.2 tells us that µ strongly NC-learns λ. Andsince the w k and the p k can be chosen arbitrarily, we construct in this way continuum-manysuch sources in any finite intersection of sub-basic open sets of P . (cid:3) he next result follows from the stronger Proposition 4.9 below, but we include it here inorder to indicate an independent route to establishing it. Proposition . For any µ P F , the set S p µ q Ă P of sources stronglyNC-learned by µ is meagre in P . Proof.
A classical result tells us that for any measure in P , there is some meagre subset of C to which that measure assigns probability 1. And Proposition 1 of Dekel and Feinberg(2006) tells us that for any meagre subset of C , the set of probability measures that assignthat set positive probability is meagre in P . So let A be a meagre subset of C such that µ p A q “ P A Ă P be the, necessarily meagre, subset of measures that assigns A positive probability. By Proposition 4.3, if λ P S µ , then λ P P A . So S µ , being a subset of ameagre set, is meagre. (cid:3) Corollary . The set t λ P P | D µ P F such that µ strongly NC-learns λ u is meagre in P . Proposition . For any µ P F , the set J µ Ă F that strongly NC-learn at least one sourcein common with µ is meagre. Proof.
Let N µ be the set of ν P P such that there is no λ P P that is absolutely continuouswith respect to both µ and ν. Note that J µ Ď N µ Ş F . So it suffices to show that N µ and F are both co-meagre subsets of P . Let A and P A be as in the proof of the preceding proposition. Consider an arbitrary ν P P . If there is a λ P P that is absolutely continuous with respect to both µ and ν, then λ mustassign the complement of A zero probability (since µ does), which means that ν must assign A positive probability (since λ does). So the complement of N µ is meagre, being a subset ofthe meagre set P A . And F is co-meagre: the forecasters form a dense G δ subset of P ; and in any completelymetrizable space (such as P ), any dense G δ subset is co-meagre. (cid:3) Corollary . For any forecaster, there is another, such that the sets of sources stronglyNC-learned by the two forecasters are disjoint.As usual, we call a sequence t λ i u i P N of elements of P uniformly computable in i if there isa computable F : N ˆ B ˚ ˆ N Ñ Q such that | λ i p w q ´ F p i, w, n q| ď ´ n , for all i, n P N and w P B ˚ . Proposition . (a) Let µ P F and let S be a countable subset of P . Then there is a µ ˚ P F that strongly NC-learns every source in S as well as every source strongly NC-learned by µ. (b) Let µ P F and let S “ t λ i u i P N be a sequence of measures in P that is uniformlycomputable in i. Then there is a µ ˚ P F that strongly NC-learns every source in S as well asevery source strongly NC-learned by µ. Szpilrajn (1934, Th´eor`eme 1) shows that any non-atomic Borel probability measure on a separable metricspace assigns measure 0 to some co-meagre set. Marczewski and Sikorski (1949, fn. 3) observe that thisresult implies that in a separable metric space without isolated points, every Borel probability measureassigns probability 0 to some co-meagre set (in fact, the hypothesis of separability can be dropped—see, e.g.,Zindulka 1999, Corollary 3.7). Note that Szpilrajn = Marczewski. This is a special case of a result of Koumoullis (1996). See Dubins and Freedman (1964, § roof. For part (a), enumerate the members of S : λ , λ , . . . and set µ ˚ “ µ ` ÿ k “ k λ k . Each λ k is absolutely continuous with respect to µ ˚ , so by Proposition 4.2, µ ˚ strongly NC-learns every source in S. And if ν is a source strongly NC-learned by µ, then by Proposition4.3, ν must be absolutely continuous with respect to µ and hence also with respect to µ ˚ —soby Proposition 4.2, µ ˚ strongly NC-learns ν. For part (b), we can proceed in the same way. The only thing to check is that if µ P P iscomputable and t λ i u Ă P is uniformly computable in i, then the measure µ ˚ as defined aboveis also computable. To this end, suppose that F : B ˚ ˆ N Ñ Q and F : N ˆ B ˚ ˆ N Ñ Q arecomputable, with | µ p w q ´ F p w, n q| ď ´ n and | λ i p w q ´ F p i, w, n q| ď ´ n , for all w P B ˚ and i, n P N . We define F ˚ : B ˚ ˆ N Ñ Q as follows: F ˚ p n, w q : “ F p w, n ` q ` n ` ÿ k “ k ` F p k, w, n q Then for any given w P B ˚ and n P N , we define α, β, γ P R : α : “ p µ p w q ´ F p w, n ` qq β : “ n ` ÿ k “ k ` p λ k p w q ´ F p k, w, n qq γ : “ ÿ k “ n ` k ` λ k p w q . Note that each of | α | , | β | , and | γ | is no greater than 2 ´p n ` q . In the case of | α | , this followsfrom what we know about F . For | β | , we have: | β | ď n ` ÿ k “ k ` | λ k p w q ´ F p k, w, n q|ă n ` ÿ k “ | λ k p w q ´ F p k, w, n q|ă n ` ÿ k “
14 2 ´ n “ n ` ´ n ´p n ` q , and for any n ě n ` ď n . And since for each k we have 0 ď λ k p w q ď , wehave that | γ | ď ř k “ n ` k ` . Now, F ˚ is computable and we have: | µ ˚ p w q ´ F ˚ p w, n q| “ | α ` β ` γ |ď | α | ` | β | ` | γ |ă ´ n . o µ ˚ is computable. (cid:3) Thus we have a no-free-lunch result for strong NC-learning: every forecaster strongly NC-learns an uncountable and dense but meagre set of sources; for every (computable) forecasterthere is another (computable) forecaster that strongly NC-learns everything it does, plus afurther countably infinite set of sources; and for every forecaster there is another that stronglyNC-learns a disjoint set of sources (indeed, typical forecasters have this feature).4.2.
NC-Learning and Weak NC-Learning
Proposition 4.5 above tells us that each forecaster strongly NC-learns a dense and uncountablebut meagre set of sources. This implies that the sets of sources NC-learned and weakly NC-learned by any forecaster are also dense and uncountable. Strong NC-learning is, intuitively, amuch more restrictive notion than NC-learning: being able to accurately answer all questionsabout the data stream, including questions about the infinite future, is much a much moredemanding standard than being able to accurately estimate the chances for the next bit. Similarly, NC-learning is, intuitively, a much more restrictive notion than weak NC-learning:we saw in Section 3 above that weakening NV-learning by allowing an infinite number oferrors (so long as they were of asymptotic density zero) made a marked difference to the sizeof the set of sequences that a given extrapolator could learn—any extrapolator NV-learns acountable set of sequences but weakly NV-learns an uncountable set of sequences. So it isnot obvious that the set of sources (weakly) NC-learned by a given forecaster should alwaysbe meagre. Not obvious—but, as we will see, nonetheless true.We begin by introducing a basis W for the weak topology on P . First, for each k P N , wefix a metric on P k compatible with its topology: we take the distance between λ, µ P P k tobe given by: d p µ, λ q : “ max w P B k | µ p w q ´ λ p w q| . In terms of our identification of P k with a closed subset of R k , this is the metric inducedby the ℓ norm on R k . For µ P P k and ε ą , we write B p µ, ε q for the open metric ball ofradius ε centred at µ : B p µ, ε q : “ t λ P P k | d p µ, λ q ă ε u . We call B p µ, ε q Ă P k rational if ε P Q and µ p w q P Q for each w P B k . For each k P N , let Π k : P Ñ P k be the restriction map: for µ P P , Π k p µ q is the measurein P k such that Π k p µ qp w q “ µ p w q for each w P B k . We now take W to comprise the inverseimages under the Π k of the open rational metric balls in the various P k : W : “ t W “ Π ´ k p B p µ, ε qq | k P N , µ P P k , µ p w q P Q @ w P B k , ε ą , ε P Q u . Proposition . W is a basis for the weak topology on P . Indeed, there is a sense in which strong NC-learning implies rapid NC-learning, and a sense in which theconverse implication holds—see Sandroni and Smorodinsky (1999, Propositions 2 and 3). Noguchi (2015, 433), after discussing the results cited in fn. 41 above, remarks that: “These results leadus to conjecture that, in general, a merged set (of probability measures) [i.e., a set of sources strongly NC-learned by a given forecaster] may be much smaller than a weakly merged set [i.e, a set of sources NC-learnedby a given forecaster].” He then goes on to observe that each forecaster strongly NC-learns a meagre set ofsources—so it is natural to read him as conjecturing that the set of sources NC-learned by a forecaster neednot be meagre. roof. It suffices to show: (i) that each W P W is open; and (ii) that for any open set U Ă P and for any ν P U, there is a W P W with ν P W Ă U. (i) Fix W P W of the form W “ Π ´ k p B p µ, ε qq . Let w , . . . , w k be an enumeration of the k -bitstrings. And for each 1 ď j ď k , let p j : “ max t , µ p w j q ´ ε u and q j : “ min t , µ p w j q ` ε u . Then we have: W “ ! λ P P | max ď j ď k | µ p w j q ´ λ p w j q| ă ε ) “ k č j “ S w j ,p j ,q j , where each S w j ,p j ,q j is a sub-basic open subset of P (as in item (vi) of Section 2.1 above). So W is an open subset of P . (ii) It suffices to consider an open set U Ă P that is a finite intersection of sub-basic opensets. Let S w ,p ,q , . . . , S w n ,p n ,q n be arbitrary sub-basic open subsets of P and suppose that U : “ Ş nk “ S w k ,p k ,q k ‰ ∅ . Let N “ max t| w | , . . . , | w n |u and let ν P U. Note that eachΠ N p S w j ,p j ,q j q is an open subset of P N : each condition of the form p j ă Π N p µ qp w j q ă q j justimposes an inequality on (sums of) differences of coordinate relative to our identification of P N with a subset of R N . So we can find a rational open metric ball B contained in Π N p U q with Π n p ν q P B. Letting W : “ Π ´ N p B q P W , we have ν P W Ă U. (cid:3) Proposition . Let µ be a forecaster. The sources weakly NC-learnable by µ form a meagresubset of P . Proof.
For any source λ and k P N let us say that p µ, λ q considers k bad if for each w P B k we have | µ p s | w q ´ λ p s | w q| ě s “ , . And let us say that p µ, λ q considers k super-bad if p µ, λ q considers more than half of the m ď k to be bad. And, by extension, for any subset S Ă P , let us say that p µ, S q considers k P N (super-)bad if p µ, λ q does for each λ P S. For each n P N , let F n be the set of λ P P such that p µ, λ q considers at least n naturalnumbers to be super-bad. And let A n be the complement of F n in P . Note that µ cannotNC-learn λ if there are infinitely many k P N that p µ, λ q considers bad and that µ cannotweakly NC-learn λ if there are infinitely many k P N that p µ, λ q considers super-bad. Soif µ weakly NC-learns λ, then p µ, λ q can consider only finitely many natural numbers tobe super-bad, which means that there will be an N such that λ R F N , which implies that λ P A : “ Ť n P N A n . So in order to establish our proposition, it suffices to show that each A n is nowhere dense in P . The first step is to suppose that we are given a set W P W of the form Π ´ k p B p λ, ε qq andto show how to find W P W of the form W “ Π ´ k ` p B p λ , ε qq , such that W Ă W and p µ, W q considers k ` λ . For each w P B k , if µ p w. q ě µ p w. q , we set λ p w. q “ ¨ λ p w q and λ p w. q “ ¨ λ p w q ;otherwise we set λ p w. q “ ¨ λ p w q and λ p w. q “ ¨ λ p w q . his gives us a well-defined λ P P k ` that assigns rational values to each string in B k ` . We now select ε ą m large enough so that ε “ ´ m is smallenough so that for any λ in B p λ , ε q , for each w P B k , if µ p w. q ě µ p w. q , then λ p w. q ą ¨ λ p w q , and if µ p w. q ă µ p w. q then, λ p w. q ą ¨ λ p w q . This process can be iterated. In particular, if we are given W P W of the form Π ´ k p B p λ, ε qq , we can run the process once to construct W P W with W Ă W such that p µ, W q considers k ` W if place of W ) yields a W P W with W Ă W such that p µ, W q considers k ` W P W of the form Π ´ k p B p λ, ε qq , we can run the process k ` n times toyield W ˚ : “ W k ` n P W such that W ˚ Ă W and p µ, W ˚ q considers at least n numbers to besuper-bad, so that W ˚ Ş A n “ ∅ . So A n is nowhere dense in P . (cid:3) Corollary . The set t λ P P | D µ P F such that µ weakly NC-learns λ u is meagre in P . It of course follows that the set of sources (strongly) NC-learnable by a given forecaster arelikewise meagre—and that set of all sources collectively (strongly) learnable by forecastingmachines are likewise meagre.
Remark . The set of sources not even weakly NC-learnable by a given forecaster µ is aco-meagre subset of P , and so is uncountable. A variant on the proof of the above propo-sition shows that even if the continuum hypothesis fails, this set has the cardinality of thecontinuum. Fix a metric on P compatible with the weak topology, such as the Prokhorovmetric. And let us amend the iterative procedure of the proof of the preceding propositionso that the diameter of W k ` relative to this metric is no more than half of the diameter of W k . Then if we are given W P W and repeatedly apply our revised iterative procedure, wewill construct a sequence W , W , . . . of nested W -sets such that Ş k “ W k contains a singlesource, which is not even weakly NC-learnable by µ. There are continuum-many distinct W we could use to initiate this procedure—these determine continuum-many distinct sourcesnot even weakly NC-learnable by µ. Proposition . Let µ be a forecaster. Then there is a second forecaster µ : such that thesets of sources weakly NC-learned by µ and by µ : are disjoint. If µ is computable, we cantake µ : to be likewise computable. Proof.
Let µ be given. We construct a map ν : B ˚ Ñ r , s inductively as follows:a) ν p ∅ q “ ν p w q is given, we define ν p w. q and ν p w. q as follows:i) if µ p | w q ď µ p | w q , then ν p w. q “ { ¨ ν p w q and ν p w. q “ { ¨ ν p w q ;ii) if µ p | w q ą µ p | w q , then ν p w. q “ { ¨ ν p w q and ν p w. q “ { ¨ ν p w q . Clearly, for any w P B ˚ , ν p w q “ ν p w. q` ν p w. q . So by the Carath´eodory Extension Theorem, ν extends to a unique Borel probability measure on C , which we take as our µ : . For details see, e.g., Billingsley (1999, 72 f.). or any non-empty w P B ˚ , | µ p | w q ´ µ : p | w q| ě . So for any λ P P , any σ P C , and n P N we have: max t| µ p | σ r n sq ´ λ p | σ r n sq| , | µ : p | σ r n sq ´ λ p | σ r n sq|u ě . So there can be no λ P P such that for every σ in a set of λ -measure one, there is a set K of natural numbers of asymptotic density 1, such that for sufficiently large n P K, µ p | σ r n sq and µ : p | σ r n sq are both arbitrarily close to λ p | σ r n sq —i.e., there is no source λ that isweakly NC-learned by both µ and µ : . (cid:3) Of course, it follows that µ and µ : also (strongly) NC-learn disjoint sets of sources.So we have no-free-lunch theorems for (weak) NC-learning: each forecaster, computable ornot, (weakly) NC-learns an uncountable and dense but meagre set of sources; and for each(computable) forecaster there is another that (weakly) NC-learns a disjoint set of sources. Remark . Lehrer and Smorodinsky (1996, Corollary 6) show that if µ P F NC-learns λ P P , then any nontrivial mixture of µ with any ν P P weakly NC-learns λ. Ryabko and Hutter(2007, Proposition 10) show that this result is sharp: they given an example of of measures µ, ν, and λ where µ NC-learns λ but any non-trivial mixture of µ and ν merely weaklyNC-learns λ. So there is no prospect of using the strategy of the proof of Proposition 4.7above to prove an analogous result for NC-learning.4.3.
Forecasting of Computable Sources
Let us now specialize to problem of (strong, weak) NC-learning for computable forecastersfacing data streams generated by computable sources.For µ P F , we denote by N C p µ q the set of computable sources that are NC-learned by µ. We use
N C to denote: t S Ă P | D µ P F with S Ď N C p µ qu . Let us likewise use
N C s p µ q and N C w p µ q to denote the set of computable sources stronglyNC-learned and weakly NC-learned by the forecasting machine µ and use N C s and N C w todenote the class of subsets of P that can be strongly/weakly NC-learned by some forecastingmachine. It is immediate from the definitions that N C s Ď N C and that
N C Ď N C w . Avariant on the proof of Proposition 3.9 shows that the latter containment is proper.
Proposition . N C Ă N C w . Proof.
Let V be the subset of P consisting of δ -function measures concentrated on computablebinary sequences in which 0’s have vanishing asymptotic density. Let µ P F be the measurethat on input of any w P B n , considers the chance of seeing a 0 next to be 2 ´ n ´ . We have V P N C w p µ q . But suppose that V Ď N C p ν q for some ν P F . We define σ P C as follows: σ is of the form 1 n . . n . . n . . . . where where each n j is chosen to be the smallest n largerthan 2 j such that ν p n . . n . . . . . . n j ´ . n q ą . n j must exist, since by assumption ν N C -learns each delta-function measure concentrated on a sequence containing only finitelymany 0’s). The delta-function measure concentrated on σ is in V. But ν R N C p ν q , since whenfed σ, there are infinitely many occasions on which ν issues forecast probabilities for seeinga 0 next of less than .1, when the true chance is 1. (cid:3) Proposition . For any µ P F , the following are dense subsets of P :(a) N C s p µ q . b) The complement of N C s p µ q in P . (c) N C p µ q . (d) The complement of N C p µ q in P . (e) N C w p µ q . (f) The complement of N C w p µ q in P . Proof.
The claim concerning (a) follows via straightforward adaptation of the proof of Propo-sition 4.4, while that of (f) follows from Proposition 4.15 below. The other sets listed aresupersets of (a) or (f). (cid:3)
For any of our senses of probabilistic learning, for any computable forecaster, that forecastersucceeds in learning a countable infinity of computable sources and fails to learn a countableinfinity of computable sources. So we have parity between the learnable and the unlearnableat the level of cardinality. And this parity again persists at the level of topology: a versionof Sierpi´nski’s Theorem tells us that, up to homeomorphism, there is only one countablemetrizable topological space without isolated points—see Neumann (1985, § P , with the elements of the basis W for P of Section 4.2 playing the role thatthe basic open sets B w played in our discussion of effectively meagre subsets of C . And, asin the case of next-value learning, we find that the for our species of next-chance learning,this notion allows us to isolate a sense in which failure is incomparably more common thansuccess.Recall that elements of W are specified by specifying a natural number k, a rational-valuedmeasure µ P P k (which is determined in turn by specifying the values that it assigns each w P B k ), and a rational ε ą . So the elements of W can be effectively represented by binarystrings. In the following definition we take such a coding scheme to be fixed. Definition . Let A be a subset of P . Let f : W Ñ W be a computable function. Then A is effectively nowhere dense via f if for each W P W :i) f p W q Ă W ;ii) A Ş f p W q “ ∅ . Definition . A subset A of P effectively nowhere dense if there is a computable F : N ˆ W Ñ W such that:i) for each n P N there is an A n that is effectively nowhere dense via F p n, ¨q : W Ñ W ;ii) A “ Ť A n . The complement of an effectively meagre subset of P is effectively co-meagre. The proofs of Propositions 3.12 and 3.13 are easily adapted to yield:
Proposition . The set of effectively meagre subsets of P is closed under the followingoperations:(i) taking subsets;(ii) taking finite unions;(iii) taking effective unions. Proposition . For any W P W , the set W Ş P is not effectively meagre.So it is again natural to consider the elements of effectively meagre subsets of P to beincomparably less common than the elements of effectively co-meagre subsets of P . roposition . For any µ P F , N C w p µ q is an effectively meagre subset of P . Proof.
A straightforward adaptation of the proof of Proposition 4.9, appealing to the factthat when µ is computable, the map p n, W q ÞÑ W ˚ R A n used there is computable. (cid:3) So there is a natural sense in which, for any computable forecaster µ, among computablesources, those weakly NC-learnable by µ are incomparably less common than those not NC-weakly learnable by µ. And, a fortiori, those computable sources (strongly) NC-learnable by µ are incomparably less common than those not (strongly) NC-learnable by µ. Learning inthis setting is formidably difficult. And hard choices must be made: the proof of Proposition4.10 above shows that each µ P F has an evil twin µ : P F such that the two weakly NC-learndisjoint sets of measures. Remark . In Remark 3.3 above, we saw above that liberalizing our notion of NV-learningof computable sequences by allowing merely partial recursive extrapolators made an inter-esting difference: while every set in
N V is effectively meagre, this is not true of every set in
N V (let alone N V ).So it is natural to consider enlarging the class of next-chance learners beyond the set of com-putable forecasters. For definiteness, let us consider the possibility of generalizing our notionof strong NC-learning. Let X be a class of objects that includes F as a subset and for whichthe definition of strong NC-learning above makes sense when we quantify over members of X rather than measures in F . For µ P X, let N C sX p µ q be the set of computable sources learnedin this generalized sense by µ and let S Ă P count as a member of N C sX just in case thereis some µ P X that learns every member of S in this sense. A trivial example: P P N C s F , since F , unlike F , includes a measure µ that is a linear superposition of all the computablemeasures (so that each computable measure is absolutely continuous with respect to µ ). Weare in search of more interesting examples.Recall that the Carath´eodory Extension Theorem allows us to identify probability measureson C with maps µ : B ˚ Ñ r , s satisfying µ p ∅ q “ µ p w q “ µ p w. q` µ p w. q for all w P B ˚ , so that µ P P is s computable if and only if there exists a computable F : B ˚ ˆ N Ñ Q suchthat | µ p w q ´ F p w, n q| ă ´ n for all w P B ˚ and n P N . One way to loosen our restriction to computable measures would to countenance com-putable semi-measures: computable maps µ : B ˚ Ñ r , s satisfying µ p ∅ q ď µ p w q ě µ p w. q ` µ p w. q for all w P B ˚ . But this gets us nowhere. Normalizing a computable semi-measure yields a computable measure—and any source strongly learned by the semi-measurewill be learned by its normalization. Another way to loosen our restrictions would be to allow merely semi-computable measures,where µ : B ˚ Ñ r , s is lower semi-computable if and only if there exists a partial recursive φ : B ˚ ˆ N Ñ r , s with µ p w q “ lim ℓ Ñ8 φ p w, ℓ q and φ p w, k q ď φ p w, k ` q , for all w P B ˚ and k P N . But this again leads nowhere: every lower semi-computable measure on C iscomputable. However, if we allow arbitrary lower semi-computable semi-measures to count as next-chance learners, then we make spectacular gains: Solomonoff (1978) constructed lower semi-computable semi-measures that strongly NC-learn every computable source. So if we let
Sol be the class of lower semi-computable semi-measures, then
N C sSol is the power set of P . But this vast gain in learning power comes comes at a price: although Solomonoff’s See the discussion of Solomonoff normalization in Li and Vit´anyi (2019, § Li and Vit´anyi (2019, Lemma 4.5.1). earners are lower semi-computable as maps that associate probabilities with binary strings,they are merely limit computable as maps that take as input a binary string and give asoutput the conditional probability for the next bit to be a 0 or a 1 (the ratio of two lowersemi-computable numbers need not be lower semi-computable). Indeed, Sterkenburg (2019)shows that if X contains only objects that induce semi-computable conditional probabilitymaps, then P R N C sX . Question: is it possible to generalize Proposition 4.15 to show thatfor such X, every set in N C sX is effectively meagre?5. Discussion
Over the course of the last century, it became widely understood that successful inductivelearning is possible only against a background of biases that favour some hypotheses overothers. No-free-lunch results substantiate this insight. If we don’t presuppose anythingabout the binary sequence being revealed to us, then we face a formidably difficult learningproblem: no matter what approach to learning we adopt, the situations that we might face,those in which we fail are incomparably more common than those in which we succeed. Andno approach dominates all rivals in its range of success: for any approach, there are othersthat succeed in situations in which the given one fails; indeed, for any approach, there isanother that succeeds in a disjoint set of situations. To adopt an approach to learning is tomake a bet about what the world is like.No-free-lunch results place upper bounds on our reasonable ambitions. Suppose thatone is interested in the question: Why should someone interested in arriving at the truthproceed inductively (expecting the future to be like the past) rather than counter-inductively?Consider how this question looks in the simplest of our contexts, in which an agent beingshown a binary sequence bit by bit aims to eventually be able to correctly predict each newbit on the basis of the bits seen so far. Here each method of learning succeeds on a countabledense subspace of the space of binary sequences. And all such subspaces are isomorphic(Sierpi´nski’s Theorem again). So unless we impose more structure on our problem, we haveparity between the set of possibilities in which a inductive extrapolator m is successful andthe set of possibilities in which counter-inductive extrapolator m : is successful. Of course, the results developed above presuppose that we are operating in an austeresetting—one in which we countenance arbitrary (computable) data streams or data streamsgenerated by sampling from arbitrary (computable) probability measures. In more tightlyconstrained settings, learning becomes tractable—e.g., if one knows that the data streamis generated by a Bernoulli measure, then it is a straightforward task to use the data tosuccessfully estimate the relevant parameter. But this observation illustrates rather thanundercuts the perspective of the preceding paragraphs, making the point that although auniversal learning algorithm is an impossibility, learning becomes possible when sufficientlystrong presuppositions are in play. Of course, one would ultimately like to know more about See Leike and Hutter (2015) and Sterkenburg (2019). See, e.g., Jeffreys (1933, 524 f.), Kuhn (1963, 3 ff.), Chomsky (1965, § § That is one of their uses. Of the results canvassed in Section 1 above, Putnam’s was designed to expose aserious flaw in objective Bayesian approaches in the tradition of Carnap (1945) while that of Shalev–Shwartzand Ben–David was devised to provide an elegant motivation for the definition of VC-dimension. Some formal learning theorists take the view that extrapolators prone to mind-changes are to be eschewed(thereby adding further structure to our sort of problem) and establish, in some contexts, a link betweencounter-inductive behaviour and mind changes—see, e.g., Kelly (2011) and Lin (2018). here the boundaries lie of the class of learning problems in which failure is typical and ofthe class of learning problems in which success is typically achievable. The results developed above are absolute in the sense that they do not presuppose thechoice of a privileged measure on Cantor space or on the space of probability measures onCantor space. But most of them do depend on the choice of topology. For the resultsconcerning (weak) learning of sequences by extrapolators, this is not very worrying. Inthe vast majority of applications in statistics, economics, and computer science, the spaceof binary sequences is equipped with the product topology. And with good reason: thistopology can be thought of as the topology of point-wise convergence and motivated bythinking of binary sequences as encoding real numbers in the usual way. The situation is notquite as straightforward with the space of probability measures on Cantor space. Certainly,the weak topology is extremely natural—but it is only one of several natural options. So itis natural to wonder whether the intractability of our learning problems would hold underother reasonable choices of topology. References
Angluin, Dana and Carl Smith (1983) “Inductive Inference: Theory and Methods.”
ACMComputing Surveys
15: 237–269.B´aez-Duarte, Luis (1970) “ C p X q ˚ and Kolmogorov’s Consistency Theorem for CantorSpaces.” Studies in Applied Mathematics
49: 401–403.B¯arzdi¸nˇs, J¯anis (1972) “Prognostication of Automata and Functions.” In C.V. Frieman (ed.),
Information Processing ’71 , North–Holland, volume 1, 81–84.B¯arzdi¸nˇs, J¯anis and R¯usi¸nˇs Freivalds (1972) “On the Prediction of General Recursive Func-tions.”
Soviet Mathematics Doklady
13: 1224–1228.Billingsley, Patrick (1999)
Convergence of Probability Measures , second edition. Wiley.Blackwell, David and Lester Dubins (1962) “Merging of Opinions with Increasing Informa-tion.”
The Annals of Mathematical Statistics
33: 882–886.Blum, Lenore and Manuel Blum (1975) “Toward a Mathematical Theory of Inductive Infer-ence.”
Information and Control
28: 125–155.Carnap, Rudolf (1945) “On Inductive Logic.”
Philosophy of Science
12: 72–97.Case, John and Carl Smith (1983) “Comparison of Identification Criteria for Machine Induc-tive Inference.”
Theoretical Computer Science
25: 193–220.Chomsky, Noam (1965)
Aspects of the Theory of Syntax . MIT Press.Dasgupta, Abhijit (2014)
Set Theory: With an Introduction to Real Point Sets . Springer.Dekel, Eddie and Yossi Feinberg (2006) “Non-Bayesian Testing of a Stochastic Prediction.”
The Review of Economic Studies
73: 893–906.Dubins, Lester and David Freedman (1964) “Measurable Sets of Measures.”
Pacific Journalof Mathematics
14: 1211–1222.Fortnow, Lance, R¯usi¸nˇs Freivalds, William Gasarch, Martin Kummer, Stuart Kurtz, CarlSmith, and Frank Stephan (1998) “On the Relative Sizes of Learnable Sets.”
TheoreticalComputer Science
The Annals of Mathematical Statistics
34: 1386–1403.Gold, E. Mark (1967) “Language Identification in the Limit.”
Information and Control See Fortnow et al. (1998) for some results of this kind for the problem of identification of sequences. empel, Carl G. (1966) Philosophy of Natural Science . Prentice–Hall.Ho, Yu-Chi and David Pepyne (2002) “Simple Explanation of the No-Free-Lunch Theoremand its Implications.”
Journal of Optimization Theory and Applications
TheoreticalComputer Science
Proceedings of theRoyal Society of London. Series A
Journal of the London Mathematical Society
85: 472–490.Kalai, Ehud and Ehud Lehrer (1994) “Weak and Strong Merging of Opinions.”
Journal ofMathematical Economics
23: 73–86.Kechris, Alexander (1995)
Classical Descriptive Set Theory . Springer.Kelly, Kevin (2011) “Simplicity, Truth, and Probability.” In Prasanta Bandyopadhyay andMalcolm Forster (eds.),
Philosophy of Statistics . Elsevier, 983–1024.Koumoullis, George (1996) “Baire Category in Spaces of Measures.”
Advances in Mathematics
The Structure of Scientific Revolutions . University of Chicago Press.Lehrer, Ehud and Rann Smorodinsky (1996) “Merging and Learning.” In Thomas Ferguson,Lloyd Shapley, and James MacQueen (eds.),
Statistics, Probability and Game Theory:Papers in honor of David Blackwell.
Institute of Mathematical Statistics, 147–168.Leike, Jan and Marcus Hutter (2015) “On the Computability of Solomonoff Induction andKnowledge-Seeking.” In Kamalika Chaudhuri, Claudio Gentile, and Sandra Zilles (eds.),
International Conference on Algorithmic Learning Theory , Springer, 364–378.Li, Ming and Paul Vit´anyi (2019)
An Introduction to Kolmogorov Complexity and its Appli-cations , fourth edition. Springer.Lin, Hanti (2018) “Modes of Convergence to the Truth: Steps Toward a Better Epistemologyof Induction.” Forthcoming in
The Review of Symbolic Logic.
Lisagor, L.R. (1981) “The Banach-Mazur Game.”
Mathematics of the USSR-Sbornik
Gottfried Wilhelm Leibniz: Philosophical Papers and Letters ,second edition. D. Reidel.Marczewski, Edward and Roman Sikorski (1949) “Remarks on Measure and Category.”
Col-loquium Mathematicum
2: 13–19.Mehlhorn, Kurt (1973) “On the Size of Sets of Computable Functions.” In Ronald Book,Allan Borodin, Forbes Lewis, Amar Mukhopadhyay, Arnold Rosenberg, Raymond Strong,and Jeffrey Ullman (eds.), ,IEEE Computer Society Publications Office, 190–196.Neumann, Peter (1985) “Automorphisms of the Rational World.”
Journal of the LondonMathematical Society
2: 439–448.Nies, Andr´e (2009)
Computability and Randomness . Oxford University Press.Noguchi, Yuichi (2015) “Merging with a Set of Probability Measures: A Characterization.”
Theoretical Economics
10: 411–444.Oxtoby, John (1957) “The Banach-Mazur Game and Banach Category Theorem.” In MelvinDresher, Albert Tucker, and Philip Wolfe (eds.),
Contributions to the Theory of Games ,volume III. Princeton University Press, 159–163. xtoby, John (1980) Measure and Category: A Survey of the Analogies between Topologicaland Measure Spaces , second edition. Springer.Parthasarathy, Kalyanapuram (1967)
Probability Measures on Metric Spaces . AcademicPress.Podnieks, K¯arlis (1974) “Comparing Various Concepts of Function Prediction.”
Theory ofAlgorithms and Programs
The Philosophy of Rudolf Carnap . Open Court, 761–783.Putnam, Hilary (1963b)
Probability and Confirmation . US Information Agency.Reimann, Jan (2008) “Effectively Closed Sets of Measures and Randomness.”
Annals of Pureand Applied Logic , IEEE, 2346–2350.Sandroni, Alvaro and Rann Smorodinsky (1999) “The Speed of Rational Learning.”
Inter-national Journal of Game Theory
28: 199–210.Shalev-Shwartz, Shai and Shai Ben-David (2014)
Understanding Machine Learning: FromTheory to Algorithms . Cambridge University Press.Solomonoff, Ray (1964) “A Formal Theory of Inductive Inference. I.”
Information and Control
7: 1–22.Solomonoff, Ray (1978) “Complexity-Based Induction Systems: Comparisons and Conver-gence Theorems.”
IEEE Transactions on Information Theory
24: 422–432.Sterkenburg, Tom (2019) “Putnam’s Diagonal Argument and the Impossibility of a UniversalLearning Machine.”
Erkenntnis
84: 633–656.Szpilrajn, Edward (1934) “Remarques sur les fonctions compl`etement additives d’ensembleet sur les ensembles jouissant de la propri´et´e de Baire.”
Fundamenta Mathematicae
Inductive Logic . Elsevier, 651–706.Wolpert, David (2002) “The Supervised Learning No-Free-Lunch Theorems.” In RajkumarRoy, Mario K¨oppen, Seppo Ovaska, Takeshi Furuhashi, and Frank Hoffmann (eds.),
SoftComputing and Industry . Springer, 25–42.Wolpert, David and William Macready (1997) “No Free Lunch Theorems for Optimization.”
IEEE Transactions on Evolutionary Computation
1: 67–82.Zenil, Hector (ed.) (2013)
A Computable Universe: Understanding and Exploring Nature asComputation . World Scientific.Zindulka, Ondˇrej (1999) “Killing Residual Measures.”
Journal of Applied Analysis
5: 223–238.