aa r X i v : . [ c s . L G ] A ug Identification of Probabilities
Paul M.B. Vit´anyi and Nick Chater
Abstract
Within psychology, neuroscience and artificial intelligence, there has been increasing interest in theproposal that the brain builds probabilistic models of sensory and linguistic input: that is, to infer aprobabilistic model from a sample. The practical problems of such inference are substantial: the brainhas limited data and restricted computational resources. But there is a more fundamental question: is theproblem of inferring a probabilistic model from a sample possible even in principle? We explore thisquestion and find some surprisingly positive and general results. First, for a broad class of probabilitydistributions characterised by computability restrictions, we specify a learning algorithm that will almostsurely identify a probability distribution in the limit given a finite i.i.d. sample of sufficient but unknownlength. This is similarly shown to hold for sequences generated by a broad class of Markov chains,subject to computability assumptions. The technical tool is the strong law of large numbers. Second, fora large class of dependent sequences, we specify an algorithm which identifies in the limit a computablemeasure for which the sequence is typical, in the sense of Martin-L¨of (there may be more than onesuch measure). The technical tool is the theory of Kolmogorov complexity. We analyse the associatedpredictions in both cases. We also briefly consider special cases, including language learning, and widertheoretical implications for psychology.Keywords: learning, Bayesian brain, identification, computable probability, Markov chain, computablemeasure, typicality, strong law of large numbers, Martin-L¨of randomness, Kolmogorov complexity
I. I
NTRODUCTION
Bayesian models in psychology and neuroscience postulate that the brain learns a generative proba-bilistic model of a set of perceptual or linguistic data ([12], [46], [51], [63]). Learning is therefore often
Vit´anyi is with the National Research Institute for Mathematics and Computer Science in the Netherlands (CWI) and theUniversity of Amsterdam. Address: CWI, Science Park 123, 1098 XG, Amsterdam, The Netherlands. Email: [email protected] .Chater is with the Behavioural Science Group. Address: Warwick Business School, University of Warwick, Coventry, CV47AL, UK. Email:
[email protected] . Chater was supported by ERC grant 295917-RATIONALITY, the ESRCNetwork for Integrated Behavioural Science [grant number ES/K002201/1], the Leverhulme Trust [grant number RP2012-V-022], and Research Councils UK Grant EP/K039830/1.
DRAFT iewed as an inverse-problem. Some aspect of the world is presumed to contain a probabilistic model,from which data is sampled; the brain receives a sample of such data, e.g., at its sensory surfaces, and hasthe task of inferring the probabilistic model. For example, the brain has to infer an underlying probabilitydistribution, from a sample from that distribution.This theoretical viewpoint is implicit in a wide range of Bayesian models in cognitive science, whichcapture experimental data across many domains, from perception, to categorization, language, motorcontrol, and reasoning (e.g., [11]). It is, moreover, embodied in a wide range of computational modelsof unsupervised learning in machine learning, computational linguistics, computer vision (e.g., [1], [44],[65]). Finally, the view that the brain recovers probabilistic models from sensory data is both theoreticallyprevalent and has received considerable empirical support in neuroscience ([36]).The idea that the brain may be able to recover a probabilistic process from a sample of data from thatprocess is an attractive one. For example, a recovered probabilistic model might potentially be used toexplain past input or to predict new input. Moreover, sampling new data from the recovered probabilisticmodel could be used in the generation of new data from that probabilistic process, from creating mentalimages [59] or producing language [13]. Thus, from a Bayesian standpoint, one should expect that theability to perceive should go alongside the ability to create mental images; and the ability to understandlanguage should go alongside the ability to produce language. Thus, the Bayesian approach is part ofthe broader psychological tradition of analysis-by-synthesis, for which there is considerable behaviouraland neuroscientific evidence with a large amount of evidence, in perceptual and linguistic domains ([48],[65]).Yet, despite its many attractions, the proposal that the brain recovers probabilistic processes fromsamples of data faces both practical and theoretical challenges. The practical challenges include the factthat the available data may be limited (e.g., children learn the probabilistic model of highly complexlanguage using only millions of words). Moreover, the brain faces severe computational constraints: eventhe limited amount of data encountered will be encoded imperfectly and may rapidly be lost ([17], [27]).The brain has limited processing resources to search and test the vast space of possible probabilisticmodels that might generate the data available.In this paper we explore the conditions under which exactly inferring a probabilistic process from astream of data is possible even in principle, with no restrictions on computational resources like timeor storage or availability of data. If it turns out that there is no algorithm that can learn a probabilistic2tructure from sensory or linguistic experience when no computational or data restrictions are imposed,then this negative result will still hold when more realistic settings are examined.Our analysis differs from previous approaches to these issues by assuming that the probabilistic processto be inferred is, in a way that will be made precise later, computable. Roughly speaking, the assumptionis that the data to be analysed is generated by a process that can be modelled by a computer (e.g., aTuring machine or a conventional digital computer) combined with a source of randomness (for example,a fair coin that can generate a limitless stream of random 0s and 1s that could be fed into the computer).There are three reasons to suppose that this focus on computable processes is interesting and not overlyrestrictive. First, some influential theorists have argued that all physical processes are computable in this,or stricter, senses (e.g., [20]). Second, most cognitive scientists assume that the brain is restricted tocomputable processes, and hence can only represent computable processes (e.g., [55]). According to thisassumption, if it turns out that some aspects of the physical world are uncomputable, these will triviallybe unlearnable simply because they cannot be represented; and, conversely, all aspects of learning ofrelevance to psychology, i.e., all aspects of the world that the brain can successfully learn, will be withinthe scope of our analysis. Third, all existing models of learning in psychology, statistics and machinelearning are computable (and, indeed, are actually implemented on digital computers) and fall within thescope of the present results.
A. Background: Pessimism about learnability
Within philosophy of science, cognitive science, and formal learning theory, a variety of considerationsappear to suggest that negative results are likely. For example, in the philosophy of science it is oftenobserved that theory is underdetermined by data ([21], [53]): that is, in infinite number of theories iscompatible with any finite amount of data, however large. After all, these theories can all agree on anyfinite data set, but diverge concerning any of the infinitely large set of possible data that has yet to beencountered. This might appear to rule out identifying the correct theory—and hence, a fortiori identifya correct probability distribution.Cognitive science inherits such considerations, to the extent that the learning problems faced by thebrain are analogous to those of inferring scientific theories (e.g., [26]). But cognitive scientists have alsoamplified these concerns, particularly in the context of language acquisition. Consider, for example, theproblem of acquiring language from positive evidence alone, i.e., from hearing sentences of the language,but with no feedback concerning whether the learners own utterances are grammatical or not (so-called3egative evidence). It is often assumed that this is, to a good approximation, the situation by the child.This is because some and perhaps all children receive little useful feedback on their own utterances andignore such feedback even when it is given ([7]). Yet, even without negative evidence, children nonethelesslearn their native language successfully. For example, an important textbook on language acquisition [19]repeatedly emphasises that the child cannot learn restrictions on grammatical rules from experience—andthat these must therefore somehow arise from innate constraints. For example, the English sentences whichteam do you want to beat , which team do you wanna beat , and which team do you want to win , whichwould seem naturally to imply that *which team do you wanna win is also a grammatical sentence. Asindicated by the asterisk, however, this sentence is typically rejected as ungrammatical by native speakers.According to classical linguistic theory (e.g., [15]), the contraction to wanna is not possible because itis blocked by a “gap” indicating a missing subject—a constraint that has sometimes been presumed tofollow from an innate universal grammar [14].The problem with learning purely from positive evidence is that an overgeneral hypothesis, which doesnot include such restrictions, will be consistent with new data; given that languages are shot through withexceptions and restrictions of all kinds, this appears to provide a powerful motivation for a linguisticnativism [14]. But this line of argument cannot be quite right, because many exceptions are entirelycapricious and could not possibly follow from innate linguistic principles. For example, the grammaticalacceptability of I like singing , I like to sing , and
I enjoy singing would seem to imply, wrongly, theacceptability *I enjoy to sing . But the difference between the distributional behaviour of the verbs like and enjoy cannot stem from any innate grammatical principles. The fact that children are able to learnrestrictions of this type, and the fact that they are so ubiquitous throughout language, has even led somescholars to speak of the logical problem of language acquisition ([3], [30]).Similarly, in learning the meaning of words, it is not clear how, without negative evidence, the childcan successfully retreat for overgeneralization. If the child initially proposes that, for example, dog refersto any animal, or that mummy refers to any adult female, then further examples will not falsify thisconjecture. In word learning and categorization, and in language acquisition, researchers have suggestedthat one potential justification for over-turning an overgeneral hypothesis is that absence-of-evidence cansometimes be evidence-of-absence ([32], [28]). That is, a child might take the absence of people usingthe word dog when referring to cats or mice; and the absence of
Mummy being used to refer to otherfemale friends or family members might lead to the child to be in doubt concerning their liberal use of4hese terms. But, of course, this line of reasoning is not straightforward—for example, when learning any category that may apply in an infinite number of situations, the overwhelming majority of these will nothave been encountered. It is not immediately clear how the child can tell the difference between benign,and genuinely suspicious, absence of evidence. The present results show that there is an algorithm that,under fairly broad conditions, can deal successfully with overgeneralization with probability 1, givensufficient data and computation time.Previous results in the formal analysis of language learnability have reached more pessimisticconclusions, using different assumptions ([25], [33]). For example, as quoted in [49], the pioneer offormal learning theory E. M. Gold points that “the problem with [learning only from] text is that if youguess too large a language, the sample will never tell you you’re wrong” ([25], p. 461). This is trueif we allow very few assumptions about the structure of the text—and indeed negative results in thisarea frequently depend on demonstrating the existence of texts (i.e., samples of the language) with ratherunnatural behavior precisely designed to mislead any putative learner. We shall see below that realistic,though still quite mild, assumptions, are sufficient to yield the opposite conclusion: that probabilitydistributions, including probability distributions over languages, can be identified from positive instancesalone.
B. Preview and examples
Consider, first, the case of independent, identical draws from a probability distribution. In many areasof psychology, the learning task is viewed as abstracting some pattern from a series of independent trialsrather than picking up sequential regularities (although the i.i.d. assumption is not necessarily explicit).The i.i.d. case is relevant to problems as diverse as classical conditioning ([56], where a joint distributionbetween conditioned and unconditioned stimuli must be acquired) category learning ([60], where a jointdistribution of category instances and labels is the target), artificial grammar learning or artificial languagelearning ([54], [57], where a probability distribution over strings of letters or sounds is to be learned).Similarly, the i.i.d. assumption is often implicit in learning algorithms in cognitive science and machinelearning, such as, for example, many Bayesian and neural network models in perception, learning andcategorization (e.g., [1])Learning such potentially complex patterns from examples may seem challenging. Yet even analysingperhaps the simplest case, learning the probability distribution of a biased coin is not straightforward.For concreteness, consider flipping a coin, with probability p of coming up heads. Suppose that we can5ip the coin endlessly, and can, at every point as the sequence of data emerges, guess the value of p ; wecan change our mind as often as we like. It is natural to wonder whether there is some procedure forguessing such that, after some point, we stick to our guess—and that this guess is, either certainly or withhigh probability, correct. So, for example, if the coin is a fair coin, such that p = 0 . , will we eventuallylock on to the conjecture that the coin is fair and, after some point, never change this conjecture howevermuch data we receive?The answer is by no means obvious, even for such simple case. After all, the difference between thenumber of heads and tails will fluctuate, and can grow arbitrarily large—and such fluctuations mightpersuade us, wrongly, that the coin is biased in favour, or against, heads. How sure can we be that,eventually, we will successfully identify the precise bias of a coin that is biased, e.g., where p = 3 / or p = 1 / ?Or, to step up the level of complexity very considerable, consider the problem of inferring a stochasticphrase structure grammar from an indefinitely large sample of i.i.d. sentences generated from thatgrammar. Or suppose the input is a sequence of images generated draw from a probabilistic imagemodel such as a Markov random field—can a perceiver learn to precisely identify the probabilistic modelof the image, given sufficient data?As we shall see in Section III, below, remarkably, it turns out that, with fairly mild restrictions (arestricted computability), with probability 1, it is possible to infer in the limit, the correct probabilitydistribution exactly, given a sufficiently large finite supply of i.i.d. samples. Moreover, it is possible tospecify a computable algorithm that will reliably find this probability distribution. A similar result holdsfor ergodic Markov chains, which broadens its application considerably.This result is unexpectedly strong, given mild restrictions on computability (which we describe indetail below). In particular, it shows that there is no logical problem concerning the possibility of learninglanguages, or other patterns, which contain exceptions, from positive evidence alone. As noted above, ithas been influentially argued in linguistics and the study of language acquisition that exceptions (examplesthat are not possible) cannot be learned purely by observing their non-occurrence, because there are, afterall, infinitely many linguistic forms which are possible but also have not been observed (e.g., [19]). A A stochastic phrase structure grammar is a conventional phrase structure grammar, with probabilities associated with each ofthe rewrite rules. For example, a noun phrase might sometimes expand to give a determiner followed by a noun, while sometimesexpanding to give a single proper noun; and individual grammatical categories, such as proper nouns, map probabilistically onspecific proper nouns. computable dependent probabilistic process (this will be7ade precise below). Unlike the i.i.d. or Markov case, different such processes could have generated thissample; but it turns out that, given a finite sample that is long enough and that is guaranteed to be theinitial segment of an infinite typical output of one of those computable dependent probabilistic processes,it is possible to infer a single process exactly (out of a number of such processes) according to whichthat sample is an initial segment of an infinite typical sample. We shall discuss these issues in SectionIV.Throughout this paper, we focus on learning probabilities themselves, rather than particular repre-sentations of probabilities. If there is at least one computer program representing a function, there are,of course, infinitely many such programs (representing the data in slightly different ways, incorporatingadditional null operations, and so on). The same is true for programs representing probability distributions.For some purposes, these differences in representation may be crucial. For example, psychologists andlinguists may be interested in which of an infinite number of equivalent grammars—from the point ofview of the sentences allowed—is represented by the brain. But, from the point of view of the problemof learning, we must treat them as equivalent. Indeed, it is clear that no learning method of probabilitiesalone could ever distinguish between models which generate precisely the same probability distributionover possible observations.Our discussion begins with an introduction of our formal framework, in the next section. We then turnto the case of i.i.d. draws from a computable mass function, and to runs of a computable ergodic Markovchain, using the strong law of large numbers as the main technical tool. The next section
ComputableMeasures considers learning material with computable sequential dependencies; here the main technicaltool is Kolmogorov complexity theory. We then briefly consider whether these results have implicationsfor the problem of prediction future data, based on past data, before we draw brief conclusions. Themathematical details and detailed proofs are relegated to Appendices.II. T
HE FORMAL FRAMEWORK
We follow in the general theoretical tradition of formal learning theory, where we abstract away fromspecific representational questions, and focus on the underlying abstract structure of the learning problem.One can associate the natural numbers with a lexicographic length-increasing ordering of finite stringsover a finite alphabet. A natural number corresponds to the string of which it is the position in the thusestablished order. Since a language is a set of sentences (finite strings over a finite alphabet), it can beviewed as a subset of the natural numbers. (In the same way, natural numbers could be associated with8mages or instances of a concept). The learnability of a language under various computational assumptionsis the subject of an immensely influential approach in [24] and especially [25], or the review [33]. Butsurely in the real world the chance of one sentence of a language being used is different from anotherone. For example, in general short sentences have a larger chance of turning up than very long sentences.Thus, the elements of a given language are distributed in a certain way. There arises the problem ofidentifying or approximating this distribution.Our model is formulated as follows: we are given an sufficiently long finite sequence of data consistingof elements drawn from the set (language) according to a certain probability, and the learner has to identifythis probability. In general, however much data been encountered, there is no point at which the learnercan announce a particular probability as correct with certainty. Weakening the learning model, the learnermight learn to identify the correct probability in the limit. That is, perhaps the learner might make asequence of guesses, finally locking on to correct probability and sticking to it forever—even thoughthe learner can never know for sure that it has identified the correct probability successfully. We shallconsider identification in the limit (following, for example, [25], [33], [49]). Since this is not enough weadditionally restrict the type of probability.In conventional statistics, probabilistic models are typically idealized as having continuous valuedparameters; and hence there is an uncountable number of possible probabilities. In general it is impossiblethat a learner can make a sequence of guesses that precisely locks on to the correct values of continuousparameters. In the realm of algorithmic information theory, in particular in Solomonoff induction [61]and here, we reason as follows. The possible strategies of learners are computable in the sense ofTuring [64], that is, they are computable functions. The set of these is discrete and thus countable. Thehypotheses that can be learned are therefore countable, and in particular the set of probabilities fromwhich the learner chooses must be computable . Indeed, this argument can be interpreted as showing thatthe fundamental problem is one of representation: the overwhelming majority of real-valued parameterscannot be represented by any computable strategy; and hence a fortiori cannot possible be learned.Our starting point is that it is only of interest to consider the identifiability of computable hypotheses—because hypotheses that are not computable cannot be represented, let alone learned. Making this preciserequires specifying what it means for a probability distribution to be computable. Moreover, it turns outthat computability is not enough, it is also necessary that the considered set of computable probabilitiesis computably enumerable (c.e.) and co-computable enumerable (co-c.e.) sets, all of which are explained9n the Appendix A. Informally, a subset of a set is c.e. if there is a computer which enumerates all theelements of the subset but no element outside the subset (but in the set). For example, the computableprobability mass functions (or computable measures) for which we know algorithms computing them canbe computably enumerated in lexicographic order of the algorithms. Hence they satisfy Theorem 1 (orTheorem 2.A subset is co-c.e. if all elements outside the subset (but in the set) can be enumerated by a computer.In our case the set comprises all computable probability mass functions, respectively, all computablemeasures. Since by Lemma 1 in Appendix A this set is not c.e. a subset that is c.e. (or co-c.e.) is aproper subset, that is, it does not contain all computable probability mass functions, respectively, allcomputable measures.In the exposition below, we consider two cases. In case 1 the data are drawn independent identicallydistributed (i.i.d.) from a subset of the natural numbers according to a probability mass function in a c.e.or co-c.e. set of computable probability mass functions, or consist of a run of a member of a c.e. or co-c.e.set of computable ergodic Markov chains. For this case, there is, as we have noted, a learning algorithmthat will almost surely identify a probability distribution in the limit. This is the topic of Section III,below.In case 2 the elements of the infinite sequence are dependent and the data sequence is typical fora measure from a c.e. or co-c.e. set of computable measures. For this more general case, we prove aweaker, though still surprising result: that there is an algorithm which identifies in the limit a computablemeasure for which that sequence is typical (in the sense introduced by Martin-L¨of). These results are thefocus of Section IV, below.
A. Preliminaries
Let N , Q , R , and R + denote the natural numbers, the rational numbers, the real numbers, and thenonnegative real numbers, respectively. We say that we identify a function f in the limit if we have analgorithm which produces an infinite sequence f , f , . . . of functions and f i = f for all but finitely many i . This corresponds to the notion of “identification in the limit” in [25], [33], [49], [66]. In this notion atevery step an object is produced and after a finite number of steps the target object is produced at everystep. However, we do not know this finite number. It is as if you ask directions and the answer is “at thelast intersection turn right,” but you do not know which intersection is last. The functions f we want toidentify in the limit are probability mass functions, Markov chains, or measures.10 EFINITION
1: Let L ⊆ N and its associated probability mass function p a function p : L → R + satisfying P x ∈ L p ( x ) = 1 . A Markov chain is an extension as in Definition 2. A measure µ is a function µ : L ∗ → R + satisfying the measure equalities in Appendix C. B. Related work
In [2] (citing previous more restricted work) a target probability mass function was identified in thelimit when the data are drawn i.i.d. in the following setting. Let the target probability mass function p bean element of a list q , q , . . . subject to the following conditions: (i) every q i : N → R + is a probabilitymass function; (ii) we exhibit a computable total function C ( i, x, ǫ ) = r such that q i ( x ) − r ≤ ǫ with r, ǫ > are rational numbers. That is, there exists a rational number approximation for all probabilitymass functions in the list up to arbitrary precision, and we give a single algorithm which for each suchfunction exhibits such an approximation. The technical means used are the law of the iterated logarithmand the Kolmogorov-Smirnov test. However, the list q , q , . . . can not contain all computable probabilitymass functions because of a diagonal argument, Lemma 1.In [4] computability questions are apparently ignored. The Conclusion states “If the true density [andhence a probability mass function] is finitely complex [it is computable] then it is exactly discoveredfor all sufficiently large sample sizes.”. The tool that is used is estimation according to min q ( L ( q ) +log(1 / Q ni =1 q ( X i )) . Here q is a probability mass function, L ( q ) is the length of its code and q ( X i ) is the q -probability of the i th random variable X i . To be able to minimize over the set of computable q ’s, one has to know the L ( q ) ’s. If the set of candidate distributions is countably infinite, then we cannever know when the minimum is reached—hence at best we have then identification in the limit. If L ( q ) is identified with the Kolmogorov complexity K ( q ) , as in Section IV of this reference, then it isincomputable as already observed by Kolmogorov in [39] (for the plain Kolmogorov complexity; thecase of the prefix Kolmogorov complexity K ( q ) is the same). Computable L ( q ) (given q ) cannot becomputably enumerated; if they were this would constitute a computable enumeration of computable q ’swhich is impossible by Lemma 1. To obtain the minimum we require a computable enumeration of the L ( q ) ’s in the estimation formula. The results hold (contrary to what is claimed in the Conclusion of [4]and other parts of the text) not for the set of computable probability mass functions since they are notc.e.. The sentence “you know but you don’t know you know” on the second page of [4] does not holdfor an arbitrary computable mass probability. 11n reaction to an earlier version of this paper with too large claims as described in Appendix E, in [6] itis shown that it is impossible to identify an arbitrary computable probability mass function (or measure)in the limit given an infinite sequence of elements from its support (which sequence is guarantied to betypical for some computable measure in the measure case).
C. Results
The set of halting algorithms for computable probabilities (or measures) is not c.e., Lemma 1 inAppendix A. This complicates the algorithms and analysis of the results. In Section III there is acomputable probability mass function (the target) on a set of natural numbers. We are given a sufficientlylong finite sequence of elements of this set that are drawn i.i.d. and are asked to identify the target.An algorithm is presented which identifies the target in the limit almost surely provided the targetis an element of a c.e. or co-c.e. set of halting algorithms for computable probability mass functions(Theorem 1). This also underpins the result announced in [31, Theorem 1 in the Appendix and appealsto it in the main text of the reference] with the following modification “computable probabilities” needto be replaced by “c.e. and co-c.e. sets of computable probabilities”. If the target is an element of ac.e. or co-c.e. set of computable ergodic Markov chains then there is an algorithm with as input asequence of states of a run of the Markov chain and as output almost surely the target (Corollary 1).The technical tool is in both cases the strong law of large numbers. In Section IV the set of naturalnumbers is also infinite and the elements of the sequence are allowed to be dependent. We are givena guaranty that the sequence is typical (Definition 4) for at least one measure from a c.e. or co-c.e.set of halting algorithms for computable measures. There is an algorithm which identifies in the limita computable measure for which the data sequence is typical (Theorem 2). The technical tool is theMartin-L¨of theory of sequential tests [45] based on Kolmogorov complexity. In Section V we considerthe associated predictions, and in Section VI we give conclusions. In Appendix A we review the usedcomputability notions, in Appendix B we review notions of Kolmogorov complexity, in Appendix C wereview the used measure and computability notions. We defer the proofs of the theorems to Appendix D.In Appendix E we give the tortuous genesis of the results.III. C
OMPUTABLE P ROBABILITY M ASS F UNCTIONS AND
I.I.D. D
RAWING
To approximate a probability in the i.i.d. setting is well-known and an easy example to illustrate ourproblem. One does this by an algorithm computing the probability p ( a ) in the limit for all a ∈ L ⊆ N x , x , . . . of data i.i.d. drawn from L according to p . Namely, for n = 1 , , . . . for every a ∈ L occurring in x , x , . . . , x n set p n ( a ) equal to the frequency of occurrencesof a in x , x , . . . , x n . Note that the different values of p n sum to precisely 1 for every n = 1 , , . . . . Theoutput is a sequence p , p , . . . of probability mass functions such that we have lim n →∞ p n = p almostsurely, by the strong law of large numbers (see Claim 1). The probability mass functions considered hereconsist of all probability mass functions on L —computable or not. The probability mass function p isthus represented by an approximation algorithm.In this paper we deal only with computable probability mass functions. If p is computable then itcan be represented by a halting algorithm which computes it as defined in Appendix A. Most knownprobability mass functions are computable provided their parameters are computable. In order that itis computable we only require that the probability mass function is finitely describable and there is acomputable process producing it [64].One issue is how short the code for p is, a second issue are the computability properties of the code for p , a third issue is how much of the data sequence is used in the learning process. The approximation of p above results in a sequence of codes of probabilities p , p , . . . which are lists of the sample frequenciesin an initial finite segment of the data sequence. The code length of the list of frequencies representing p n grows usually to infinity as the length n of the segment grows to infinity. The learning processinvolved uses all of the data sequence and the result is an encoding of the sample frequencies in the datasequence in the limit. The code for p is usually infinite. This holds as well if p is computable. Such anapproximation contrasts with identification in the following.T HEOREM
1: I.I.D. C
OMPUTABLE P ROBABILITY I DENTIFICATION
Let L be a set of natural numbersand p be a probability mass function on L . This p is described by an element of a c.e. or co-c.e. set ofhalting algorithms for computable probability mass functions. There is an algorithm identifying p in thelimit almost surely from an infinite sequence x , x , . . . of elements of L drawn i.i.d. according to p .The code for p via an appropriate Turing machine is finite. The learning process uses only a finite initialsegment of the data sequence and takes finite time.We do not know how large the finite items in the theorem are. The proof of the theorem is deferred toAppendix D. The intuition is as follows. By assumption the target probability mass function is a memberof a linear list of halting algorithms for computable probability mass functions listed as list A . By thestrong law of large numbers we can approximate the target probability mass function by the sample13eans. Since the members of A are linearly ordered we can after each new sample compute the leastmember which agrees best according to a certain criterion with the samples produced thus far. At somestage this least element does not change any more.E XAMPLE
1: Since the c.e. and co-c.e. sets strictly contain the computable sets, Theorem 1 is strictlystronger than the result in [2] referred to in Section II-B. It is also strictly stronger than [4] that does notgive identification in the limit for classes of computable functions.Define the primitive computable probability mass functions as the set of probability mass functionsfor which it is decidable that they are constructed from primitive computable functions. Since this set iscomputable it is c.e.. The theorem shows that identification in the limit is possible for members of thisset. Define the time-bounded probability mass functions for any fixed computable time bound as the setof elements for which it is decidable that they are computable probability mass functions satisfying thistime bound. Since this set is computable it is c.e.. Again, the theorem shows that identification in thelimit is possible for elements from this set.Another example is as follows. Let L = { a , a , . . . , a n } be a finite set. The primitive recursivefunctions f , f , . . . are c.e.. Hence the probability mass functions p , p , . . . on L defined by p i ( a j ) = f i ( j ) / P nh =1 f i ( h ) are also c.e.. Let us call these probability mass functions simple. By Theorem 1 theycan be identified in the limit. ♦ The class of probability mass functions for which the present result applies is very broad. Suppose, forexample, that we frame the problem of language acquisition in the following terms: a corpus is created byi.i.d. sampling from some primitive recursive language generation mechanism (for example, a stochasticphrase structure grammar [9] with rational probabilities, or an equivalent, but more cognitively motivatedformalism such as tree-adjoining grammar [34] or combinatory categorical grammar [62]). That is, thealgorithm described here will search possible programs which correspond to generators of grammars, andwill eventually find, and never change from, a stochastic grammar that precisely captures the probabilitymass function that generated the linguistic data. That is, the present result implies that there is a learningalgorithm that identifies in the limit the probability mass function according to which these sentencesare generated with probability 1. Of course, there may, in general, within any reasonably rich stochasticgrammar formalism, be many ways of representing the probability distribution over possible sentences(just as there are many computer programs that code for the same function). Of course, no learningprocess can distinguish between these, precisely because they are, by assumption, precisely equivalent in14heir predictions. Hence, an appropriate goal of learning can only be to find the underlying probabilitymass function, rather than attempting the impossible task of inferring the particular representation of thatfunction.The result applies, of course, not just to language but to learning structure in perceptual input, such asvisual images. Suppose that a set of visual images is created by i.i.d sampling from a Markov randomfield with rational parameters [43]; then there will be a learning algorithm which identifies in the limit theprobability distribution over these images with probability 1. The result applies, also, to the unsupervisedlearning of environmental structure from data, for example by connectionist learning methods [1] or byBayesian learning methods ([12], [47], [63]).
A. Markov chains
I.i.d. draws from a probability mass function is a special case of a run of a discrete Markov chain.We investigate which Markov chains have an equivalent of the strong law of large numbers. Theorem 1then holds mutatis mutandis for these Markov chains. First we need a few definitions.D
EFINITION
2: A sequence of random variables ( X t ) ∞ t =0 with outcomes in a finite or countable statespace S ⊆ N is a discrete time-homogeneous Markov chain if for every ordered pair i, j of statesthe quantity q i,j = Pr( X t +1 = j | X t = i ) called the transition probability from state i to state j , isindependent of t . If M is such a Markov chain then its associated transition matrix Q is defined as Q := ( q i,j ) i,j ∈N . The matrix Q is non-negative and its row sums are all unity. It is infinite dimensionalwhen the number of states is infinite.In the sequel we simply speak of “Markov chains” and assume they satisfy Definition 2.D EFINITION
3: A Markov chain M is ergodic if it has a stationary distribution π = ( π x ) x ∈ S satisfying πQ = π and for every distribution σ = π holds σQ = σ . This stationary distribution π satisfies π x > for all x ∈ S and P x ∈ S π x = 1 . With X t being the state of the Markov chain at epoch t starting from X = x ∈ S we have lim n →∞ n n X t =1 X t = E π [ X ] = X x ∈ S π x x, (III.1)approximating theoretical means by sample means. An ergodic Markov chain is computable if its transitionprobabilities and stationary distribution are computable.C OROLLARY
1: I
DENTIFICATION C OMPUTABLE E RGODIC M ARKOV C HAINS
Consider a c.e. or co-c.e. set of halting algorithms for computable ergodic Markov chains. Let M be an element of this set.15here is an algorithm identifying M in the limit almost surely from an infinite sequence x , x , . . . ofstates of M produced by a run of M . The code for M via an appropriate Turing machine is finite. Thelearning process uses only a finite initial segment of the data sequence and takes finite time.E XAMPLE
2: Let M be an ergodic Markov chain with a finite set S of states. There exists a uniquedistribution π over S with strictly positive probabilities such that lim s →∞ q si,j = π j , for all states i and j . In this case we have that π Q t → π pointwise as t → ∞ and the limit is independentof π . The stationary distribution π is the unique vector satisfying πQ = π , where P i π i = 1 . (Necessaryand sufficient conditions for ergodicity are that the chain should be irreducible , that is for each pair ofstates i, j there is an s ∈ N such that q si,j > (state j can be reached from state i in a finite number ofsteps); and aperiodic , the gcd { s : q si,j > } = 1 for all i, j ∈ T .Equation πQ = π is a system of N linear equations in N unknowns (the entries π j ). We can solve theunknowns by elimination of variables: in the first equation express one variable in terms of the others;substitute the expression into the remaining equations; repeat this process until the last equation; solveit and then back substitute until the total solution is found.Since π is unique the system of linear equations has a unique solution. If the original entries of Q are computable, then this process keeps the entries of π computable as well. Therefore, if the transitionprobabilities of the Markov chain are computable, then the stationary distribution π is a computableprobability mass function. We now invoke the Ergodic Theorem approximating theoretical means bysample means [23], [40] as in (III.1). ♦ IV. C
OMPUTABLE M EASURES
In the i.i.d. case we dealt with a process where the future was independent of the present or thepast, in the Markov case we extended this independence such that the immediate future is determinedby the present but not by the past of too long ago. What can be shown if we drop the assumption ofindependence altogether? Then we go to measures as defined in Appendix C. As far as the authors areaware, for general measures there exist neither an approximation as in Section III nor an analog of thestrong law of large numbers. However, there is a notion of typicality of an infinite data sequence for acomputable measure in the Martin-L¨of theory of sequential tests [45] based on Kolmogorov complexity,and this is what we use. 16et L ⊆ N and µ be a measure on L ∞ in a c.e. or co-c.e. set of halting algorithms for computablemeasures. In this paper instead of the common notation µ (Γ x ) we use the simpler notation µ ( x ) . We aregiven a sequence in L ∞ which is typical (Definition 4 in Appendix C) for µ . The constituent elementsof the sequence are possibly dependent. The set of typical infinite sequences of a computable measure µ have µ -measure one, and each typical sequence passes all computable tests for µ -randomness in thesense of Martin-L¨of. This probability model is much more general than i.i.d. drawing according to aprobability mass function. It includes stationary processes, ergodic processes, Markov processes of anyorder, and many other models. In particular, this probability model includes many of the models used inmathematical psychological and cognitive science.T HEOREM
2: C
OMPUTABLE M EASURE I DENTIFICATION
Let L be a set of natural numbers. We aregiven an infinite sequence of elements from L and this sequence is guarantied to be typical for at leastone measure in a c.e. or co-c.e. set of halting algorithms for computable measures. There is an algorithmwhich identifies in the limit (certainly) a computable measure in the c.e. or co-c.e set above for which thesequence is typical. The code for this measure is an appropriate Turing machine and finite. The learningprocess takes finite time and uses only a finite initial segment of the data sequence.The proof is deferred to Appendix D. We give an outline of the proof of Theorem 2. Let B be a listof a c.e. or co-c.e. set of halting algorithms for computable measures. Assume that each measure occursinfinitely many times in B . For a measure µ in the list B define σ ( j ) = log 1 /µ ( x . . . x j ) − K ( x . . . x j ) . By (A.2) in Appendix C, data sequence x , x , . . . is typical for µ iff sup j σ ( j ) = σ < ∞ . By assumptionthere exists a measure in B for which the data sequence is typical. Let µ h be such a measure. Sincehalting algorithms for µ h occur infinitely often in the list B there is a halting algorithm µ h ′ in the list B with σ h ′ = σ h and σ h < h ′ . This means that there exists a measure µ k in B for which the data sequence x , x , . . . is typical and σ k < k with k least.E XAMPLE
3: Let us look at some applications. Define the primitive recursive measures as the set ofobjects for which it is decidable that they are measures constructed from primitive recursive functions. Theorem 2 and Theorem 1 are incomparable although it is tempting to think the latter is a corollary of the former. Theinfinite sequences considered in Theorem 2 are typical for some computable measure. Restricted to i.i.d. measures (the case ofTheorem 1) such sequences are a proper subset from those resulting from i.i.d. draws from the corresponding probability massfunction. This is the reason why the result of Theorem 2 is “certain” and the result from Theorem 1 is “almost surely.” L be a finite set of cardinality l , and f , f , . . . be a c.e. set of the primitive recursive functions withdomain L . Computably enumerate the strings x ∈ L ∗ lexicographical length-increasing. Then every stringcan be viewed as the integer giving its position in this order. Let ǫ denote the empty word , that is, thestring of length 0. Confusion with the notation ǫ equals a small quantity is avoided by the context. Define µ i ( ǫ ) = f i ( ǫ ) /f i ( ǫ ) = 1 , and inductively for x ∈ L ∗ and a ∈ L define µ i ( xa ) = f i ( xa ) / P a ∈ L f i ( xa ) .Then µ i ( x ) = P a ∈ L µ i ( xa ) for all x ∈ L ∗ . Therefore µ i is a measure. Call the c.e. set µ , µ , . . . thesimple measures. The theorem shows that identification in the limit is possible for the set of simplemeasures. ♦ V. P
REDICTION
In Section III the data are drawn i.i.d. according to an appropriate probability mass function p on theelements of L . Given p , we can predict the probability p ( a | x , . . . , x n ) that the next draw results in anelement a when the previous draws resulted in x , . . . , x n . (The resulting measure on L ∞ is called an i.i.d.measure.) Once we have identified p , prediction is possible (actually after a finite but unknown runningtime of the identifying algorithm). The same holds for a ergodic Markov chains (Corollary 1). This isreassuring for cognitive scientists and neuroscientists who see prediction as fundamental to cognition([16], [22], [29], [35]).For general measures as in Section IV, allowing dependent data, the situation is quite different. Wecan meet the so-called black swan phenomenon of [50]. Let us give a simple example. The data sequenceis a, a, . . . is typical (Definition 4) for the measure µ defined by µ ( x ) = 1 for every data sequence x consisting of a finite or infinite string of a ’s and µ ( x ) = 0 otherwise. But a, a, . . . is also typical for themeasure µ defined by µ ( x ) = for every string x consisting of a finite or infinite string of a ’s, and µ ( x ) = for a string x consisting of initially a fixed number n of a ’s followed by a finite or infinitestring of b ’s, and 0 otherwise. Then, µ and µ give different predictions with an initial n -length sequenceof a ’s. But given a data sequence consisting initially of only a ’s, a sensible algorithm will predict a asthe most likely next symbol. However, if the initial data sequence consists of n symbols a , then for µ a with probability 1, and for µ the next symbol is a with probability and b with probability . Therefore, while the i.i.d. case allows us to predict reliably, in the dependent casethere is in general no reliable predictor for the next symbol. In [5], however, Blackwell and Dubin showthat under certain conditions predictions of two measures merge asymptotically almost surely.VI. C ONCLUSION
Many psychological theories see learning from data, whether sensory or linguistic, as a central functionof the brain. Such learning faces great practical difficulties—the space of possible structures is verylarge and difficult to search, and the computational power of the brain is limited, and the amount ofavailable data may be limited. But it is not clear under what circumstances such learning is possible evenwith unlimited data and computational resources. Here we have shown that, under surprisingly generalconditions, some positive results about identification in the limit in such contexts can be established.Using an infinite sequence of elements (or a finite sequence of large enough but unknown length) froma set of natural numbers, algorithms are exhibited that identify in the limit the probability distributionassociated with this set. This happens in two cases. (i) The underlying set is countable and the targetdistribution is a probability mass function (i.i.d. measure) in a c.e. or co-c.e. set of computable probabilitymass functions. The elements of the sequence are drawn i.i.d. according to this probability (Theorem 1).This result is extended to computable ergodic Markov chains (Corollary 1). (ii) The underlying set iscountable and the infinite sequence is possibly dependent and is typical for a computable measure in ac.e. or co-c.e. set of computable measures (Theorem 2).In the i.i.d. case and the ergodic Markov chain case the target is identified in the limit almost surely ,and in the dependent case the target computable measure is identified in the limit surely —however it isnot unique but one out of a set of satisfactory computable measures. In the i.i.d. case and Markov casewe use the strong law of large numbers. For the dependent case we use typicality according to the theorydeveloped by Martin-L¨of in [45] embedded in the theory of Kolmogorov complexity.In both the i.i.d., the Markovian, and the dependent settings, eventually we guess an index of thetarget (or one target out of some possible targets in the measure case) and stick to this guess forever.This last guess is correct. However, we do not know when the guess becomes permanent. We use only afinite unknown-length initial segment of the data sequence. The target for which the guess is correct isdescribed by a an appropriate Turing machine computing the probability mass function, Markov chain,or measure, respectively. 19hese results concerning algorithms for identification in the limit consider what one might term the“outer limits” of what is learnable, by abstracting away from computational restrictions and a finiteamount of data available to human learners. Nonetheless, such general results may be informative whenattempting to understand what is learnable in more restricted settings. Most straightforwardly, that whichis not learnable in the unrestricted case will, a fortiori , not be learnable when computational or datarestrictions are added. It is also possible that some of the proof techniques used in the present contextcan be adapted to analyse more restricted, and hence more cognitively realistic, settings.A
PPENDIX
A. Computability
We can interpret a pair of integers such as ( a, b ) as rational a/b . A real function f with rationalargument is lower semicomputable if it is defined by a rational-valued computable function φ ( x, k ) with x a rational number and k a nonnegative integer such that φ ( x, k + 1) ≥ φ ( x, k ) for every k and lim k →∞ φ ( x, k ) = f ( x ) . This means that f can be computably approximated arbitrary close from below(see [42], p. 35). A function f is upper semicomputable if − f is semicomputable from below. If a realfunction is both lower semicomputable and upper semicomputable then it is computable . A function f : N → R + is a probability mass function if P x f ( x ) = 1 . It is customary to write p ( x ) for f ( x ) ifthe function involved is a probability mass function.A set A ⊆ N is computable enumerable (c.e.) when we can compute the enumeration a , a , . . . with a i ∈ A ( i ≥ ). A c.e. set is also called recursively enumerable (r.e.). A co-c.e. set B ⊆ N is a setwhose complement N \ B is c.e.. (A set is c.e. iff it is at level Σ of the arithmetic hierarchy and it isco-c.e. iff it is at level Π .) If a set is both c.e. and co-c.e. then it is computable . A halting algorithm fora computable function f : N → R is an algorithm which given an argument x and any rational ǫ > computes a total computable rational function ˆ f : N × Q → Q such that | f ( x ) − ˆ f ( x, ǫ ) | ≤ ǫ .E XAMPLE
4: We give an example of the relation between co-c.e. and identification in the limit.Consider a c.e. set A of objects and the co-c.e. set B such that N \ B = A . We call the members of B the good objects and the members of A the bad objects. We do not know in what order the bad objectsare enumerated or repeated; however we do know that the remaining items are the good objects. Thesegood objects with possible repetitions form the enumeration B . It takes unknown time to enumerate aninitial segment of B , but we are sure this happens eventually. Hence to identify the k th element in the20numeration B requires identification of the first , . . . , k − elements. This constitutes identification inthe limit. ♦ E XAMPLE
5: It is known that the overwhelming majority of real numbers are not computable. If areal number a is lower semicomputable but not computable, then we can computably find nonnegativeintegers a , a , . . . and b , b , . . . such that a n /b n ≤ a n +1 /b n +1 and lim n →∞ a n /b n = a . If a is theprobability of success in a trial then this gives an example of a lower semicomputable probability massfunction which is not computable. ♦ Suppose we are concerned with all and only computable probability mass functions. There are countablymany since there are only countably many computable functions. But can we computably enumeratethem?L
EMMA
1: (i) Let L ⊆ N and infinite. The computable positive probability mass functions on L arenot c.e..(ii) Let L ⊆ N with | L | ≥ . The computable positive measures on L are not c.e..P ROOF . (i) Assume to the contrary that the lemma is false and the computable enumeration is p , p , . . . . Compute a probability mass function p with p ( a ) = p i ( a i ) for a i ∈ L is the i th element of L as follows.If i is odd then p ( a i ) := p i ( a i ) + p i ( a i ) p i +1 ( a i +1 ) and p ( a i +1 ) := p i +1 ( a i +1 ) − p i ( a i ) p i +1 ( a i +1 ) .By construction p is a computable positive probability mass function but different from any p i in theenumeration p , p , . . . .(ii) The set L ∗ is c.e.. Hence the set of cylinders in L ∞ is c.e.. Therefore (ii) reduces to (i). • R EMARK
1: Every probability mass function is positive on some support L = ∅ and 0 otherwise.Hence Lemma 1 holds for all probability mass functions. ✸ B. Kolmogorov Complexity
We need the theory of Kolmogorov complexity [42] (originally in [39] and the prefix version we usehere originally in [41]). A prefix Turing machine is a Turing machine with a one-way read-only inputtape with a distinguished tape cell called the origin , a finite number of two-way read-write working tapeson which the computation takes place, an auxiliary tape on which the auxiliary string y ∈ { , } ∗ iswritten, and a one-way write-only output tape. At the start of the computation the input tape is infinitelyinscribed from the origin onwards, and the input head is on the origin. The machine operates with abinary alphabet. If the machine halts then the input head has scanned a segment of the input tape fromthe origin onwards. We call this initial segment the program .21y the construction above, for every auxiliary y ∈ { , } ∗ , the set of programs is a prefix code: noprogram is a proper prefix of any other program. Consider a standard enumeration of all prefix Turingmachines T , T , . . . . Let U denote a prefix Turing machine such that for every z, y ∈ { , } ∗ and i ≥ we have U ( i, z, y ) = T i ( z, y ) . That is, for each finite binary program z , auxiliary y , and machine index i ≥ , we have that U ’s execution on inputs i and z, y results in the same output as that obtained by executing T i on input z, y . We call such a U a universal prefix Turing machine.However, there are more ways a prefix Turing machine can simulate other prefix Turing machines. Forexample, let U ′ be such that U ′ ( i, zz, y ) = T i ( z, y ) for all i and z, y , and U ′ ( p ) = 0 for p = i, zz, y forsome i, z, y . Then U ′ is universal also. To distinguish machines like U with nonredundant input fromother universal machines, Kolmogorov [39] called them optimal .Fix an optimal machine, say U . Define the conditional prefix Kolmogorov complexity K ( x | y ) for all x, y ∈ { , } ∗ by K ( x | y ) = min p {| p | : p ∈ { , } ∗ and U ( p, y ) = x } . (Here U has two arguments ratherthan three. We consider the first argument to encode the first two arguments of the previous three.) Forthe same U , define the time-bounded conditional prefix Kolmogorov complexity K t ( x | y ) = min p {| p | : p ∈ { , } ∗ and U ( p, y ) = x in t steps } . To obtain the unconditional versions of the prefix Kolmogorovcomplexities set y = ǫ where ǫ is the empty word (the word with no letters). It can be shown that K ( x | y ) is incomputable [39]. Clearly K t ( x | y ) is computable if t < ∞ . Moreover, K t ′ ( x | y ) ≤ K t ( x | y ) for every t ′ ≥ t , and lim t →∞ K t ( x | y ) = K ( x | y ) . C. Measures and Computability
Let L ⊆ N . Given a finite sequence x = x , x , . . . , x n of elements of L , we consider the set ofinfinite sequences starting with x . The set of all such sequences is written as Γ x , the cylinder of x . Weassociate a probability µ (Γ x ) with the event that an element of Γ x occurs. Here we simplify the notation µ (Γ x ) and write µ ( x ) . The transitive closure of the intersection, complement, and countable union ofcylinders gives a set of subsets of L ∞ . The probabilities associated with these subsets are derived from22he probabilities of the cylinders in standard ways [37]. A measure µ satisfies the following equalities: µ ( ǫ ) = 1 (A.1) µ ( x ) = X a ∈ L µ ( xa ) . Let x , x , . . . be an infinite sequence of elements of L . The sequence is typical for a computable measure µ if it passes all computable sequential tests (known and unknown alike) for randomness with respectto µ . These tests are formalized by Martin-L¨of [45]. One of the highlights of the theory of Martin-L¨ofis that the sequence passes all these tests iff it passes a single computable universal test, [42, Corollary4.5.2 on p 315], see also [45].D EFINITION
4: Let x , x , . . . be an infinite sequence of elements of L ⊆ N . The sequence is typical or random for a computable measure µ iff sup n { log 1 µ ( x . . . x n ) − K ( x . . . x n ) } < ∞ . (A.2)The set of infinite sequences that are typical with respect to a measure µ have µ -measure one. The theoryand properties of such sequences for computable measures are extensively treated in [42, Chapter 4].There the term K ( x . . . x n ) in (A.2) is given as K ( x . . . x n | µ ) . However, since µ is computable wehave K ( µ ) < ∞ and therefore K ( x . . . x n | µ ) ≤ K ( x . . . x n ) + O (1) .E XAMPLE
6: Let k be a positive integer and fix an a ∈ { , . . . , k } . Define measure µ k by µ k ( ǫ ) = 1 and µ k ( x . . . x n ) = 1 /k for n ≥ and x i = a for every ≤ i ≤ n , and µ k ( x . . . x n ) = ( k − n − − /k ) otherwise. Then K ( a . . . a ) (a sequence of n elements a ) equals K ( i, n ) + O (1) = O (log n + log k ) .(A sequence of n elements a is described by n in O (log n ) bits and a in O (log k ) bits.) By (A.2) wehave sup n ∈N { log 1 /µ k ( a . . . a ) − K ( a . . . a ) } < ∞ . Therefore the infinite sequence a, a, . . . is typicalfor every µ k . However, the infinite sequence y , y , . . . is not typical for µ k with y i ∈ { , . . . , k } ( i ≥ )and y i = y i +1 for some i . Namely, sup n ∈N { /µ k ( y y . . . y n ) − K ( y y . . . y n ) } = ∞ . ♦ Since k can be any positive integer, the example shows that an infinite sequence of data can be typicalfor more than one measure. Hence our task is not to identify a single computable measure according towhich the data sequence was generated as a typical sequence, but to identify a computable measure that could have generated the data sequence as a typical sequence.23 . Proofs of the Theorems P ROOF . OF T HEOREM
1: I.I.D. C
OMPUTABLE P ROBABILITY I DENTIFICATION . Let L ⊆ N , and X , X , . . . be a sequence of mutually independent random variables, each of which is a copy of asingle random variable X with probability mass function P ( X = a ) = p ( a ) for a ∈ L . Without loss ofgenerality p ( a ) > for all a ∈ L . Let a ( x , x , . . . , x n ) denote the number of times x i = a ( ≤ i ≤ n )for some fixed a ∈ L .C LAIM
1: If the outcomes of the random variables X , X , . . . are x , x , . . . , then almost surely forall a ∈ L we have lim n →∞ (cid:18) p ( a ) − a ( x , x , . . . , x n ) n (cid:19) = 0 . (A.3)P ROOF . The strong law of large numbers (originally in [38], see also [37] and [8]) states that if weperform the same experiment a large number of times, then almost surely the number of successes dividedby the number of trials goes to the expected value, provided the mean exists, see the theorem on top ofpage 260 in [23]. To determine the probability of an a ∈ L we consider the random variables X a withjust two outcomes { a, ¯ a } . This X a is a Bernoulli process ( q a , − q a ) where q a = p ( a ) is the probabilityof a and − q a = P b ∈ L \{ a } p ( b ) is the probability of ¯ a . If we set ¯ a = min ( L \ { a } ) , then the mean µ a of X a is µ a = aq a + ¯ a (1 − q a ) ≤ max { a, ¯ a } < ∞ . Thus, every a ∈ L incurs a random variable X a with a finite mean. Therefore, (1 /n ) P ni =1 ( X a ) i convergesalmost surely to q a as n → ∞ . The claim follows. • Let A be a list of a c.e. or co-c.e. set of halting algorithms for the computable probability massfunctions. If q ∈ A and q = p then for every ǫ > and a ∈ L holds p ( a ) − q ( a ) ≤ ǫ . By Claim 1, almostsurely lim n →∞ max a ∈ L (cid:18) q i ( a ) − a ( x , x , . . . , x n ) n (cid:19) = 0 . (A.4)If q ∈ A and q = p then there is an a ∈ L and a constant δ > such that | p ( a ) − q ( a ) | > δ . Again byClaim 1, almost surely lim n →∞ max a ∈ L (cid:12)(cid:12)(cid:12)(cid:12) q i ( a ) − a ( x , x , . . . , x n ) n (cid:12)(cid:12)(cid:12)(cid:12) > δ. (A.5)In the proof [23, p. 204] of the strong law of large numbers it is shown that if we draw x , x , . . . i.i.d. from a set L ⊆ N according to a probability mass function p then almost surely the size of the24uctuations in going to the limit (A.4) satisfies | np ( a ) − a ( x , x , . . . , x n ) | / p np ( a ) p (¯ a ) < √ λ lg n for every λ > and n is large enough, for all a ∈ L . Here lg denotes the natural logarithm. Since p ( a ) p (¯ a ) ≤ and λ = √ it suffices that | p ( a ) − a ( x , x , . . . , x n ) /n | < p (lg n ) /n for all but finitelymany n .Let q ∈ A . For q = p there is an a ∈ L such that by (A.5) and the fluctuations in going to that limit wehave | q ( a ) − a ( x , x , . . . , x n ) /n | > δ − p (lg n ) /n for all but finitely many n . Since δ > is constant,we have p (lg n ) /n < δ for all but finitely many n . Hence | q ( a ) − a ( x , x , . . . , x n ) /n | > p (lg n ) /n for all but finitely many n .Let A = q , q , . . . and p = q k with k least. We give an algorithm with as output a sequence of indexes i , i , . . . such that all but finitely many indexes are k . If L = { a , a , . . . } is infinite then the algorithmwill only use a finite subset of it. Hence we need to define this finite subset and show that the remainingelements can be ignored. Let A n = { a ∈ L : a ( x , x , . . . , x n ) > } . In case a ∈ L but a A n westill have | q k ( a ) − a ( x , x , . . . , x n ) /n | ≤ p (lg n ) /n for all but finitely many n .Now define the following sets. For each q i ∈ A the set B i,n = { a , . . . , a m } with m least such that P ∞ j = m +1 q i ( a j ) = 1 − P mj =1 q i ( a j ) < p /n . Therefore, if a ∈ L \ B i,n then q i ( a ) < p /n . In contrastto the infinity of L the sets A n and B i,n are finite for all n and i .Define L i,n = A n S B i,n . Since L i,n ⊆ L we have for every a ∈ L i,n that | q k ( a ) − a ( x , x , . . . , x n ) /n | ≤ p (lg n ) /n for all but finitely many n . However, for q i = q k there is an a ∈ L i,n but no a ∈ L \ L i,n such that | q i ( a ) − a ( x , x , . . . , x n ) /n | > p (lg n ) /n for all but finitelymany n . This leads to the following algorithm with I the set of indexes of the elements in A : for n := 1 , , . . .I := ∅ ; for i := 1 , , . . . , n if max a ∈ L i,n | q i ( a ) − a ( x , x , . . . , x n ) /n | < p (lg n ) /n then I := I S { i } ; i n := min I With probability 1 for every i < k for all but finitely many n we have i I while k ∈ I for all butfinitely many n . (Note that for every n = 1 , , . . . the main term in the above algorithm is computableeven if L is infinite.) The theorem is proven. • P ROOF . OF T HEOREM
OMPUTABLE M EASURE I DENTIFICATION
For the Kolmogorov complexity25otions see Appendix B. For the theory of computable measures, see Appendix C. In particular we usethe criterion of Definition 4 in Appendix C to show that an infinite sequence is typical in Martin-L¨of’ssense. The given data sequence x , x , . . . is by assumption typical for some computable measure µ ina c.e. or co-c.e. set of computable measures and hence satisfies (A.2) with respect to µ . We stress thatthe data sequence is possibly typical for different computable measures. Therefore we cannot speak ofthe single true computable measure, but only of a computable measure for which the data is typical.Let B be an enumeration of halting algorithms for a c.e. or co-c.e. set of computable measures such thateach element occurs infinitely many times in the list. If the enumeration is such that each element occursony finitely many times, then the enumeration can be changed into one where each element occursinfinitely many times. For instance, by repeating the first element after every position in the originalenumeration, repeating the second element in the original enumeration after every second position in theresulting enumeration, and so on.C LAIM
2: There is an algorithm with as input an enumeration B = µ , µ , . . . and as output a sequenceof indexes i , i , . . . . For every large enough n we have i n = k with µ k a computable measure for whichthe data sequence is typical.P ROOF . Define for µ in B σ ( j ) = log 1 /µ ( x . . . x j ) − K ( x . . . x j ) . Since K is upper semicomputable and µ is computable, the function σ ( j ) is lower semicomputable foreach j . Define the n th value in the lower semicomputation of σ ( j ) as σ n ( j ) . By (A.2), the data sequence x , x , . . . is typical for µ if sup j ≥ σ ( j ) = σ < ∞ In this case, since µ is lower semicomputable, max ≤ j ≤ n σ n ( j ) ≤ σ for all n . In contrast, the data sequence is not typical for µ if σ ( n ) → ∞ with n → ∞ implying σ n ( n ) → ∞ with n → ∞ .By assumption there exists a measure in B for which the data sequence is typical. Let µ h be such ameasure Since halting algorithms for µ h occur infinitely often in the enumeration B there is a haltingalgorithm µ h ′ in the enumeration B with σ h ′ = σ h and σ h < h ′ . Therefore, there exists a measure µ k in B for which the data sequence x , x , . . . is typical and σ k < k with k least. The algorithm to determine k is as follows. for n := 1 , , . . . if i ≤ n is least such that max ≤ j ≤ n σ ni ( j ) < i hen output i n = i else output i n = 1 .Eventually max ≤ j ≤ n σ nk ( j ) < k for large enough n , and k is the least index of elements in B forwhich this holds. Hence there exists an n such that i n = k for all n ≥ n . • For large enough n we have by Claim 2 a test such that we can identify in the limit an index of ameasure in B for which the provided data sequence is typical. Hence there is an n such that i n = k forall n ≥ n . We do not care what i , . . . , i n − are. This proves the theorem. • E. Genesis of the Result
At the request of a referee we give a brief account of the genesis of the result. In versionarXiv:1208.5003 we assumed that we were dealing with all computable probabilities and the necessaryextensions to measures. The first part of the technical results dealt with i.i.d drawing and ergodic Markovchains. Here a main ingredient was to appeal to the known result that computable semiprobability massfunctions (those summing to 1 or less than 1) are computably enumerable in a linear list. By some trickswe sought to computably extract the probabilities proper from among them and use the Law of LargeNumbers. For the more difficult dependent case we resorted to measures. Here we used a known resultthat the computable semimeasures (where the equality signs in the measure conditions are replaced byinequality ≤ signs) are computably enumerable as well in a linear list. Again we sought to computablyextract the measures proper from this list and use a (known) criterion that says that the measures forwhich the provided infinite sequence of examples is random (typical) keeps a certain quantity finite. Theproof in arXiv:1208.5003 entailed to separate the finite sequences of this quantity from the infinite ones.This took a long time and effort. Subsequently in [6] it was shown that the approach of arXiv:1208.5003was in error: they showed by a very technical argument that identification of computable probabilities andcomputable measures by infinite sequences of examples was impossible. Extensive email contact with oneof the authors, Laurent Bienvenu, showed that the essential point was the extraction of probabilities andmeasures from the above computable enumerations of all computable semiprobabilities and computablesemimeasures. It turned out that we required computable enumerations or co-computable enumerations ofcomputable probabilities and computable measures at the outset. This was done in arXiv:1311.7385. Thatis, the identification does not hold for all computable probabilities and computable measures as in thetoo large claims of arXiv:1208.5003 but for the subclass of computable enumerations or co-computableenumerations of them. Furthermore the very difficult argument separating bounded infinite sequences27rom unbounded ones (in the dependent case) was replaced by a simple one reminiscent of the h-indexin citation science. Namely, a bounded infinite sequence has a(n unknown) bound. But if the measuresinvolved are enumerated then eventually the index of one (there are infinitely many of them) for whichthe bound is relevant will pass this bound.A CKNOWLEDGEMENT
We thank Laurent Bienvenu for pointing out an error in the an earlier version and elucidating comments.Drafts of this paper proceeded since 2012 in various states of correctness through arXiv:1208.5003 toarXiv:1311.7385. R
EFERENCES [1] Ackley, D. H., Hinton, G. E., & Sejnowski, T. J. (1985). A learning algorithm for boltzmann machines.
Cognitive Science , , 147–169.[2] Angluin, D. (1988). Identifying languages from stochastic examples, Yale University, Dept. of Computer Science, Technicalreport, New Haven, Conn., USA.[3] Baker, C. L. & McCarthy, J. J. (1981). The logical problem of language acquisition. Cambridge, MA: MIT Press.[4] Barron, A.R. & Cover, T.M. (1991). Minimum complexity density estimation, IEEE Transactions on Information Theory , , 1034–1054.[5] Blackwell, D. & Dubins, L. (1962). Merging of opinions with increasing information, The Annals of Mathematical Statistics , , 882–886.[6] Bienvenu, L., Monin, B. & Shen, A. (2014). Algorithmic identification of probabilities is hard, Proc. Algorithmic LearningTheory, Springer Lecture Notes in Artificial Intelligence, Vol 8776, 2014, 85–95.[7] Bowerman, M. (1988). The ’No Negative Evidence’ Problem: How do Children avoid constructing an overly generalgrammar? In J. Hawkins (Ed.), Explaining Language Universals (pp. 73-101). Oxford: Blackwell.[8] Cantelli, F.P. (1917). Sulla probabilit´a come limite della frequenza,
Rendiconti della R. Academia dei Lincei, Classe discienze fisische matematiche e naturale, Serie a , , 39–45.[9] Charniak, E. (1996). Statistical language learning . Cambridge, MA: MIT press.[10] Chater, N., Clark, A., Goldsmith, J., & Perfors, A. (2015, in press).
Empiricist Approaches to Language Learning . Oxford,UK: Oxford University Press.[11] Chater, N., & Oaksford, M. (2008).
The probabilistic mind: Prospects for Bayesian cognitive science . Oxford: OxfordUniversity Press.[12] Chater, N., Tenenbaum, J. B., & Yuille, A. (2006). Probabilistic models of cognition: Conceptual foundations.
Trends incognitive sciences , , 287-291.[13] Chater, N., & Vit´anyi, P.M.B. (2007). Ideal learning of natural language: Positive results about learning from positiveevidence. Journal of Mathematical Psychology , , 135–163.[14] Chomsky, N. (1980). Rules and representations. Behavioral and Brain Sciences , , 1–15.
15] Chomsky, N. (1982).
Some concepts and consequences of the theory of government and binding . Cambridge, MA: MITPress.[16] Clark, A. (2013). Whatever next? Predictive brains, situated agents, and the future of cognitive science.
Behavioral andBrain Sciences , , 181–204.[17] Christiansen, M. & Chater, N. (2015, in press). The Now-or-Never Bottleneck: A Fundamental Constraint on Language. Behavioral and Brain Sciences .[18] Clark, A., & Lappin, S. (2010).
Linguistic Nativism and the Poverty of the Stimulus . Hoboken, NJ: John Wiley and Sons.[19] Crain, S., & Lillo-Martin, D. C. (1999).
An introduction to linguistic theory and language acquisition . Malden, MA:Blackwell.[20] Deutsch, D. (1985). Quantum theory, the Church-Turing principle and the universal quantum computer.
Proceedings of theRoyal Society of London A: Mathematical, Physical and Engineering Sciences , , 97–117).[21] Duhem, P., (1914/1954). The Aim and Structure of Physical Theory , translated from 2nd edition by P. W. Wiener; originallypublished as
La Thorie Physique: Son Objet et sa Structure (Paris: Marcel Riviera & Cie.), Princeton, NJ: PrincetonUniversity Press.[22] Elman, J. L. (1990). Finding structure in time.
Cognitive Science , , 179-211.[23] Feller, W. (1968). An Introduction to Probability Theory and Its Applications, Volume 1, (Third edition) , Wiley, New York.[24] Gold, E.M. (1965). Limiting recursion,
Journal of Symbolic Logic , , 28–48.[25] Gold, E.M. (1967). Language identification in the limit, Information and Control , , 447–474.[26] Gopnik, A., Meltzoff, A. N., & Kuhl, P. K. (1999). The scientist in the crib: Minds, brains, and how children learn.
NewYork: William Morrow & Co.[27] Haber, R. N. (1983). The impending demise of the icon: A critique of the concept of iconic storage in visual informationprocessing.
Behavioral and Brain Sciences , , 1-54.[28] Hahn, U., & Oaksford, M. (2008). Inference from absence in language and thought. In M. Oaksford & N. Chater (Eds.). The probabilistic mind: Prospects for Bayesian cognitive science , (pp. 121-42), Oxford: Oxford University Press.[29] Hollerman, J. R., & Schultz, W. (1998). Dopamine neurons report an error in the temporal prediction of reward duringlearning.
Nature Neuroscience , , 304-309.[30] Hornstein, N., & Lightfoot, D. (1981). Explanation in linguistics. The logical problem of language acquisition . London:Longman.[31] Hsu, A., Chater, N., & Vit´anyi, P.M.B. (2011). The probabilistic analysis of language acquisition: Theoretical, computational,and experimental analysis.
Cognition , , 380–390.[32] Hsu, A. S., Horng, A., Griffiths, T. L., & Chater, N. (2016). When absence of evidence is evidence of absence: Rationalinferences from absent data. Cognitive science , , 1-13.[33] Jain, S., Osherson, D.N., Royer, J.S., & Sharma, A. (1999). Systems that Learn (Second edition) , Cambridge, MA: MITPress.[34] Joshi, A. K., & Schabes, Y. (1997). Tree-adjoining grammars. In
Handbook of formal languages . (pp. 69-123). Berlin:Springer.[35] Kilner, J. M., Friston, K. J., & Frith, C. D. (2007). Predictive coding: an account of the mirror neuron system.
CognitiveProcessing , , 159-166.
36] Knill, D. C., & Pouget, A. (2004). The Bayesian brain: the role of uncertainty in neural coding and computation.
Trendsin Neurosciences , , 712-719.[37] Kolmogorov, A.N. (1933). Grundbegriffe der Wahrscheinlichkeitsrechnung , Berlin: Springer-Verlag, Berlin.[38] Kolmogorov, A.N. (1930). Sur la loi forte des grandes nombres,
C. r. Acad. Sci. Paris , , 910–912.[39] Kolmogorov, A.N. (1965). Three approaches to the quantitative definition of information, Problems of InformationTransmission , ,1–7.[40] K. Lange, Applied Probability, Springer, 2005. (Corrected 2nd printing)[41] Levin, L.A. (1974). Laws of information conservation (non-growth) and aspects of the foundation of probability theory, Problems of Information Transmission , , 206–210.[42] Li, M. % Vit´anyi, P.M.B. (2008). An Introduction to Kolmogorov Complexity and Its Applications (Third edition) , NewYork: Springer-Verlag.[43] Li, S. Z. (2012).
Markov random field modeling in computer vision . Berlin: Springer.[44] Manning, C. & Klein, D. (2003). Natural Language Parsing.
Advances in Neural Information Processing Systems 15:Proceedings of the 2002 Conference , , Cambridge, MA: MIT Press.[45] Martin-L¨of, P. (1966). The definition of random sequences, Information and Control , , 602–619.[46] Oaksford, M., & Chater, N. (2007). Bayesian rationality: The probabilistic approach to human reasoning . Oxford: OxfordUniversity Press.[47] Pearl, J. (2014).
Probabilistic reasoning in intelligent systems: networks of plausible inference.
Burlington, MA: MorganKaufmann.[48] Pickering, M. J., & Garrod, S. (2013). An integrated theory of language production and comprehension.
Behavioral andBrain Sciences , , 329-347.[49] Pinker, S. (1979). Formal models of language learning, Cognition , , 217–283.[50] Popper, K.R. (1959). The Logic of Scientific Discovery , London: Hutchinson.[51] Pouget, A., Beck, J. M., Ma, W. J., & Latham, P. E. (2013). Probabilistic brains: knowns and unknowns.
NatureNeuroscience , , 1170-1178.[52] Pullum, G. K., & Scholz, B. C. (2002). Empirical assessment of stimulus poverty arguments. The Linguistic Review , ,9-50.[53] Quine, W. V. O. (1951). Two Dogmas of Empiricism, Reprinted in From a Logical Point of View , 2nd edition (pp. 2046).Cambridge, MA: Harvard University Press.[54] Reber, A. S. (1989). Implicit learning and tacit knowledge. Journal of experimental psychology: General, 118(3), 219.[55] Rescorla, M. (2015). The Computational Theory of Mind. In E. N. Zalta (ed.)
The Stanford Encyclopedia of Philosophy(Winter 2015 Edition) , URL = ¡http://plato.stanford.edu/archives/win2015/entries/computational-mind/¿.[56] Rescorla, R. A. & A. R. Wagner (1972). A theory of Pavlovan conditioning: Variations in the effectiveness of reinforcementand nonreinforcement. In A. H. Black & W. F. Prokasy (eds)
Classical Conditioning II: Current Theory and Research (pp.6499). New York: Appleton-Century.[57] Saffran, J. R., Aslin, R. N. & Newport, E. L. (1996). Statistical learning by 8-month-old infants.
Science , , 19261928.[58] Shannon, C. E. (1951). Prediction and entropy of printed English. Bell System Technical Journal , , 50-64.
59] Shepard, R. N. (1984). Ecological constraints on internal representation: resonant kinematics of perceiving, imagining,thinking, and dreaming.
Psychological Review , , 417–447.[60] Shepard, R. N., Hovland, C. I., & Jenkins, H. M. (1961). Learning and memorization of classifications. PsychologicalMonographs: General and Applied , , 1–42.[61] Solomonoff, R.J. (1964). A formal theory of inductive inference, Part 1 and Part 2. Information and Control , , 1–22,224–254.[62] Steedman, M. (2000). The syntactic process . Cambridge: MIT press.[63] Tenenbaum, J. B., Kemp, C., Griffiths, T. L., & Goodman, N. D. (2011). How to grow a mind: Statistics, structure, andabstraction.
Science , , 1279-1285.[64] Turing, A.M. (1936). On computable numbers, with an application to the Entscheidungsproblem, Proceedings of the LondonMathematical Society , , 230–265.[65] Yuille, A., & Kersten, D. (2006). Vision as Bayesian inference: analysis by synthesis? Trends in Cognitive Sciences , ,301-308.[66] Zeugmann, T. & Zilles, S. (2008). Learning recursive functions: a survey. Theoretical Computer Science , , 4–56., 4–56.