[PDF] No Free Lunch versus Occam's Razor in Supervised Learning

Abstract

The No Free Lunch theorems are often used to argue that domain specific knowledge is required to design successful algorithms. We use algorithmic information theory to argue the case for a universal bias allowing an algorithm to succeed in all interesting problem domains. Additionally, we give a new algorithm for off-line classification, inspired by Solomonoff induction, with good performance on all structured problems under reasonable assumptions. This includes a proof of the efficacy of the well-known heuristic of randomly selecting training data in the hope of reducing misclassification rates.

Full PDF

aa r X i v : . [ c s . L G ] N ov No Free Lunch versus Occam’s Razor inSupervised Learning

Tor Lattimore and Marcus Hutter , , Research School of Computer Science Australian National University and ETH Z¨urich and NICTA { tor.lattimore,marcus.hutter } @anu.edu.au

15 November 2011

Abstract

The No Free Lunch theorems are often used to argue that domain speciﬁcknowledge is required to design successful algorithms. We use algorithmic in-formation theory to argue the case for a universal bias allowing an algorithmto succeed in all interesting problem domains. Additionally, we give a new al-gorithm for oﬀ-line classiﬁcation, inspired by Solomonoﬀ induction, with goodperformance on all structured (compressible) problems under reasonable as-sumptions. This includes a proof of the eﬃcacy of the well-known heuristic ofrandomly selecting training data in the hope of reducing the misclassiﬁcationrate.

Contents

Supervised learning; Kolmogorov complexity; no free lunch; Occam’s razor. Introduction

The No Free Lunch (NFL) theorems, stated and proven in various settings anddomains [Sch94, Wol01, WM97], show that no algorithm performs better than anyother when their performance is averaged uniformly over all possible problems of aparticular type. These are often cited to argue that algorithms must be designedfor a particular domain or style of problem, and that there is no such thing as ageneral purpose algorithm.On the other hand, Solomonoﬀ induction [Sol64a, Sol64b] and the more generalAIXI model [Hut04] appear to universally solve the sequence prediction and rein-forcement learning problems respectively. The key to the apparent contradiction isthat Solomonoﬀ induction and AIXI do not assume that each problem is equallylikely. Instead they apply a bias towards more structured problems. This bias isuniversal in the sense that no class of structured problems is favored over another.This approach is philosophically well justiﬁed by Occam’s razor.The two classic domains for NFL theorems are optimisation and classiﬁcation.In this paper we will examine classiﬁcation and only remark that the case for opti-misation is more complex. This diﬀerence is due to the active nature of optimisationwhere actions aﬀect future observations.Previously, some authors have argued that the NFL theorems do not disprovethe existence of universal algorithms for two reasons.1. That taking a uniform average is not philosophically the right thing to do, asargued informally in [GCP05].2. Carroll and Seppi in [CS07] note that the NFL theorem measures performanceas misclassiﬁcation rate, where as in practise, the utility of a misclassiﬁcationin one direction may be more costly than another.We restrict our consideration to the task of minimising the misclassiﬁcation ratewhile arguing more formally for a non-uniform prior inspired by Occam’s razor andformalised by Kolmogorov complexity. We also show that there exist algorithms(unfortunately only computable in the limit) with very good properties on all struc-tured classiﬁcation problems.The paper is structured as follows. First, the required notation is introduced(Section 2). We then state the original NFL theorem, give a brief introduction toKolmogorov complexity, and show that if a non-uniform prior inspired by Occam’srazor is used, then there exists a free lunch (Section 3). Finally, we give a newalgorithm inspired by Solomonoﬀ induction with very attractive properties in theclassiﬁcation problem (Section 4). Such results have been less formally discussed long before by Watanabe in 1969 [WD69]. Preliminaries

Here we introduce the required notation and deﬁne the problem setup for the NoFree Lunch theorems.

Strings.

A ﬁnite string x over alphabet X is a ﬁnite sequence x x x · · · x n − x n with x i ∈ X . An inﬁnite string x over alphabet X is an inﬁnite sequence x x x · · · .Alphabets are usually countable or ﬁnite, while in this paper they will almost alwaysbe binary. For ﬁnite strings we have a length function deﬁned by ℓ ( x ) := n for x = x x · · · x n . The empty string of length 0 is denoted by ǫ . The set X n is the setof all strings of length n . The set X ∗ is the set of all ﬁnite strings. The set X ∞ isthe set of all inﬁnite strings. Let x be a string (ﬁnite or inﬁnite) then substrings aredenoted x s : t := x s x s +1 · · · x t − x t where s ≤ t . A useful shorthand is x

Informally, a classiﬁcation problem is the task of matching featuresto class labels. For example, recognizing handwriting where the features are imagesand the class labels are letters. In supervised learning, it is (usually) unreasonable toexpect this to be possible without any examples of correct classiﬁcations. This canbe solved by providing a list of feature/class label pairs representing the true clas-siﬁcation of each feature. It is hoped that these examples can be used to generalizeand correctly classify other features.The following deﬁnitions formalize classiﬁcation problems, algorithms capable ofsolving them, as well as the loss incurred by an algorithm when applied to a problem,or set of problems. The setting is that of transductive learning as in [DEyM04].

Deﬁnition 1 (Classiﬁcation Problem) . Let X and Y be ﬁnite sets representing thefeature space and class labels respectively. A classiﬁcation problem over X, Y isdeﬁned by a function f : X → Y where f ( x ) is the true class label of feature x .In the handwriting example, X might be the set of all images of a particular sizeand Y would be the set of letters/numbers as well as a special symbol for imagesthat correspond to no letter/number. Deﬁnition 2 (Classiﬁcation Algorithm) . Let f be a classiﬁcation problem and X m ⊆ X be the training features on which f will be known. We write f X m torepresent the function f X m : X m → Y with f X m ( x ) := f ( x ) for all x ∈ X m . A classiﬁcation algorithm is a function, A , where A ( f X m , x ) is its guess for the classlabel of feature x ∈ X u := X − X m when given training data f X m . Note we implicitlyassume that X and Y are known to the algorithm.3 eﬁnition 3 (Loss function) . The loss of algorithm A , when applied to classiﬁca-tion problem f , with training data X m is measured by counting the proportion ofmisclassiﬁcations in the testing data, X u . L A ( f, X m ) := 1 | X u | X x ∈ X u [[ A ( f X m , x ) = f ( x )]]where [[]] is the indicator function deﬁned by, [[ expr ]] = 1 if expr is true and 0otherwise.We are interested in the expected loss of an algorithm on the set of all problemswhere expectation is taken with respect to some distribution P . Deﬁnition 4 (Expected loss) . Let M be the set of all functions from X to Y and P be a probability distribution on M . If X m is the training data then the expectedloss of algorithm A is L A ( P, X m ) := X f ∈M P ( f ) L A ( f, X m ) We now use the above notation to give a version of the No Free Lunch Theorem ofwhich Wolpert’s is a generalization.

Theorem 5 (No Free Lunch) . Let P be the uniform distribution on M . Then thefollowing holds for any algorithm A and training data X m ⊆ X . L A ( P, X m ) = | Y − | / | Y | (1)The key to the proof is the following observation. Let x ∈ X u , then for all y ∈ Y , P ( f ( x ) = y | f | X m ) = P ( f ( x ) = y ) = 1 / | Y | . This means no information canbe inferred from the training data, which suggests no algorithm can be better thanrandom. Occam’s razor/Kolmogorov complexity.

The theorem above is often used toargue that no general purpose algorithm exists and that focus should be placed onlearning in speciﬁc domains.The problem with the result is the underlying assumption that P is uniform,which implies that training data provides no evidence about the true class labelsof the test data. For example, if we have classiﬁed the sky as blue for the last1,000 years then a uniform assumption on the possible sky colours over time wouldindicate that it is just as likely to be green tomorrow as blue, a result that goesagainst all our intuition.How then, do we choose a more reasonable prior? Fortunately, this question hasalready been answered heuristically by experimental scientists who must endlessly4hoose between one of a number of competing hypotheses. Given any experiment, itis easy to construct a hypothesis that ﬁts the data by using a lookup table. Howeversuch hypotheses tend to have poor predictive power compared to a simple alternativethat also matches the data. This is known as the principle of parsimony, or Occam’srazor, and suggests that simple hypotheses should be given a greater weight thanmore complex ones.Until recently, Occam’s razor was only an informal heuristic. This changedwhen Solomonoﬀ, Kolmogorov and Chaitin independently developed the ﬁeld ofalgorithmic information theory that allows for a formal deﬁnition of Occam’s razor.We give a brief overview here, while a more detailed introduction can be foundin [LV08]. An in depth study of the philosophy behind Occam’s razor and itsformalisation by Kolmogorov complexity can be found in [KLV97, RH11]. While webelieve Kolmogorov complexity is the most foundational formalisation of Occam’srazor, there have been other approaches such as MML [WB68] and MDL [Gr¨u07].These other techniques have the advantage of being computable (given a computableprior) and so lend themselves to good practical applications.The idea of Kolmogorov complexity is to assign to each binary string an integervalued complexity that represents the length of its shortest description. Those stringswith short descriptions are considered simple, while strings with long descriptions arecomplex. For example, the string consisting of 1,000,000 1’s can easily be describedas “one million ones”. On the other hand, to describe a string generated by tossinga coin 1,000,000 times would likely require a description about 1,000,000 bits long.The key to formalising this intuition is to choose a universal Turing machine as thelanguage of descriptions. Deﬁnition 6 (Kolmogorov Complexity) . Let U be a universal Turing machine and x ∈ B ∗ be a ﬁnite binary string. Then deﬁne the plain Kolmogorov complexity C ( x )to be the length of the shortest program (description) p such that U ( p ) = x . C ( x ) := min p ∈B ∗ { ℓ ( p ) : U ( p ) = x } It is easy to show that C depends on choice of universal Turing machine U onlyup to a constant independent of x and so it is standard to choose an arbitrary reference universal Turing machine.For technical reasons it is diﬃcult to use C as a prior, so Solomonoﬀ introducedmonotone machines to construct the Solomonoﬀ prior, M . A monotone Turingmachine has one read-only input tape which may only be read from left to right andone write-only output tape that may only be written to from left to right. It hasany number of working tapes. Let T be such a machine and write T ( p ) = x to meanthat after reading p , x is on the output tape. The machines are called monotonebecause if p is a preﬁx of q then T ( p ) is a preﬁx of T ( q ). It is possible to show thereexists a universal monotone Turing machine U and this is used to deﬁne monotonecomplexity Km and Solomonoﬀ’s prior, M .5 eﬁnition 7 (Monotone Complexity) . Let U be the reference universal monotoneTuring machine then deﬁne Km , M and KM as follows, Km ( x ) := min { ℓ ( p ) : U ( p ) = x ∗} M ( x ) := X U ( p )= x ∗ − ℓ ( p ) KM ( x ) := − log M ( x )where U ( p ) = x ∗ means that when given input p , U outputs x possibly followed bymore bits.Some facts/notes follow.1. For any n , P x ∈B n M ( x ) ≤ Km , M and KM are incomputable.3. 0 < KM ( x ) ≈ Km ( x ) ≈ C ( x ) < ℓ ( x ) + O (1) To illustrate why M gives greater weight to simple x , suppose x is simple thenthere exists a relatively short monotone Turing machine p , computing it. Therefore Km ( x ) is small and so 2 − Km ( x ) ≈ M ( x ) is relatively large.Since M is a semi-measure rather than a proper measure, it is not appropriateto use it in place of P when computing expected loss. However it can be normalizedto a proper measure, M norm deﬁned inductively by M norm ( ǫ ) := 1 M norm ( xb ) := M norm ( x ) M ( xb ) M ( x

0) + M ( x P x ∈B n M norm ( x ) = 1.We will also need to deﬁne M /KM with side information, M ( y ; x ) := M ( y )where x ∗ is provided on a spare tape of the universal Turing machine. Now deﬁne KM ( y ; x ) := − log M ( y ; x ). This allows us to deﬁne the complexity of a function interms of its output relative to its input. Deﬁnition 8 (Complexity of a function) . Let X = { x , · · · , x n } ⊆ B k and f : X →B then deﬁne the complexity of f , KM ( f ; X ) by KM ( f ; X ) := KM ( f ( x ) f ( x ) · · · f ( x n ); x , x , · · · , x n )An example is useful to illustrate why this is a good measure of the complexityof f . The approximation C ( x ) ≈ Km ( x ) is only accurate to log ℓ ( x ), while KM ≈ Km is almostalways very close [G´ac83, G´ac08]. This is a little surprising since the sum in the deﬁnition of M contains 2 − Km . It shows that there are only comparitively few short programs for any x . xample 9. Let X ⊆ B n for some n , and Y = B and f : X → Y be deﬁned by f ( x ) = [[ x n = 1]]. Now for a complex X , the string f ( x ) f ( x ) · · · might be diﬃcultto describe, but there is a very short program that can output f ( x ) f ( x ) · · · whengiven x x · · · as input. This gives the expected result that KM ( f ; X ) is very small. Free lunch using Solomonoﬀ prior.

We are now ready to use M norm as a prior ona problem family. The following proposition shows that when problems are chosenaccording to the Solomonoﬀ prior that there is a (possibly small) free lunch.Before the proposition, we remark on problems with maximal complexity, KM ( f ; X ) = O ( | X | ). In this case f exhibits no structure allowing it to be com-pressed, which turns out to be equivalent to being random in every intuitive sense[ML66]. We do not believe such problems are any more interesting than trying topredict random coin ﬂips. Further, the NFL theorems can be used to show that noalgorithm can learn the class of random problems by noting that almost all problemsare random. Thus a bias towards random problems is not much of a bias (from uni-form) at all, and so at most leads to a decreasingly small free lunch as the numberof problems increases. Proposition 1 (Free lunch under Solomonoﬀ prior) . Let Y = B and ﬁx a k ∈ N .Now let X = B n and X m ⊂ X such that | X m | = 2 n − k . For suﬃciently large n there exists an algorithm A such that L A ( M norm , X m ) < / Lemma 10.

Let N ⊂ M then there exists an algorithm A N such that X f ∈ N P ( f ) L A N ( f, X m ) ≤ X f ∈ N P ( f ) Proof.

Let A i with i ∈ { , } be the algorithm always choosing i . Note that X f ∈ N P ( f ) L A ( f, X m ) = X f ∈ N P ( f )(1 − L A ( f, X m ))The result follows easily. Proof of Proposition 1.

Now let M be the set of all f ∈ M with f ( y ) = 1 ∀ y ∈ X m and M = M − M . Now construct an A by A ( f X m , x ) = ( f ∈ M A M ( f X m , x ) otherwise7et f ∈ M be the constant valued function such that f ( x ) = 1 ∀ x then L A ( M norm , X m ) = X f ∈M M norm ( f ) L A ( f, X m ) (2)= X f ∈M M norm ( f ) L A ( f, X m ) + X f ∈M M norm ( f ) L A ( f, X m ) (3) ≤ X f ∈M M norm ( f ) + X f ∈M M norm ( f ) L A ( f, X m ) (4) ≤ X f ∈M M norm ( f ) + X f ∈M − f M norm ( f ) (5) <

12 (1 − δ ) + X f ∈M − f M norm ( f ) <

12 (6)where (2) is deﬁnitional, (3) follows by splitting the sum into M and M , (4)by the previous lemma, (5) since loss is bounded by 1 and the loss incurred on f is 0. The ﬁrst inequality of (6) follows since it can be shown that there existsa δ > M norm ( f ) > δ with δ independent of n . The second becausemax f ∈M −{ f } M norm ( f ) n →∞ −→ |M | is independent of n .The proposition is unfortunately extremely weak. It is more interesting to knowexactly what conditions are required to do much better than random. In the nextsection we present an algorithm with good performance on all well structured prob-lems when given “good” training data. Without good training data, even assuminga Solomonoﬀ prior, we believe it is unlikely that the best algorithm will performwell.Note that while it appears intuitively likely that any non-uniform distributionsuch as M norm might oﬀer a free lunch, this is in fact not true. It is shown in[SVW01] that there exist non-uniform distributions where the loss over a problemfamily is independent of algorithm. These distributions satisfy certain symmetryconditions not satisﬁed by M norm , which allows Proposition 1 to hold. Solomonoﬀ induction is well known to solve the online prediction problem where thetrue value of each classiﬁcation is known after each guess. In our setup, the trueclassiﬁcation is only known for the training data, after which the algorithm no longerreceives feedback. While Solomonoﬀ induction can be used to bound the number oftotal errors while predicting deterministic sequences, it gives no indication of whenthese errors may occur. For this reason we present a complexity-inspired algorithmwith better properties for the oﬄine classiﬁcation problem.8efore the algorithm we present a little more notation. As usual, let X = { x , x , · · · , x n } ⊆ B k , Y = B and let X m ⊆ X be the training data. Now deﬁne anindicator function χ by χ i := [[ x i ∈ X m ]]. Deﬁnition 11.

Let f ∈ Y X be a classiﬁcation problem. The algorithm A ∗ is deﬁnedin two steps. ˜ f := arg min ˜ f ∈ Y X n KM ( ˜ f ; X ) : χ i = 1 = ⇒ ˜ f ( x i ) = f ( x i ) o A ∗ ( f X m , x i ) := ˜ f ( x i )Essentially A ∗ chooses for its model the simplest ˜ f consistent with the trainingdata and uses this for classifying unseen data. Note that the deﬁnition above onlyuses the value y i = f ( x i ) where χ i = 1, and so it does not depend on unseen labels.If KM ( f ; X ) is “small” then the function we wish to learn is simple so we shouldexpect to be able to perform good classiﬁcation, even given a relatively small amountof training data. This turns out to be true, but only with a good choice of trainingdata. It is well known that training data should be “broad enough”, and thisis backed up by the example below and by Theorem 14, which give an excellentjustiﬁcation for random training data based on good theoretical (Theorem 14) andphilosophical (AIT) underpinnings. The following example demonstrates the eﬀectof bad training data on the performance of A ∗ . f ( x ) x Figure 1: A simple problem

Example 12.

Let X = { , , , , · · · , , , } and f ( x ) bedeﬁned to be the ﬁrst bit of x as in Figure 1. Now suppose χ = 1 (So thealgorithm is only allowed to see the true class labels of x through x ). In this case,the simplest ˜ f consistent with the ﬁrst 16 data points, all of which are zeros, is likelyto be ˜ f ( x ) = 0 for all x ∈ X and so A ∗ will fail on every piece of testing data!On the other hand, if χ = 001010011101101, which was generated by tossing acoin 16 times, then ˜ f will very likely be equal to f and so A ∗ will make no errors.Even if χ is zero about the critical point in the middle ( χ = χ = 0) then ˜ f shouldstill match f mostly around the left and right and will only be unsure near themiddle. 9ote, the above is not precisely true since for small strings the dependence of KM on the universal monotone Turing machine can be fairly large. However if weincrease the size of the example so that | X | > Deﬁnition 13 (Entropy) . Let θ ∈ [0 , H ( θ ) := ( − [ θ log θ + (1 − θ ) log(1 − θ )] if θ = 0 and θ = 10 otherwise Theorem 14.

Let θ ∈ (0 , be the proportion of data to be given for training then:1. There exists a χ ∈ B ∞ (training set) such that for all n ∈ N , θn − c < χ n ) < θn + c and nH ( θ ) − c < KM ( χ n ) for some c , c ∈ R + .2. For n = | X | , the loss of algorithm A ∗ when using training data determined by χ is bounded by L A ∗ ( f, X m ) < KM ( f ; X ) + KM ( X ) + c + c n (1 − θ − c /n ) log(1 − θ + c /n ) − where c is some constant independent of all inputs. This theorem shows that A ∗ will do well on all problems satisfying KM ( f ; X ) = o ( n ) when given good (but not necessarily a lot) of training data. Before the proof,some remarks.1. The bound is a little messy, but for small θ , large n and simple X we get L A ∗ ( f, X m ) ≈ < KM ( f ; X ) / ( nθ ).2. The loss bound is extremely bad for large θ . We consider this unimportantsince we only really care if θ is small. Also, note that if θ is large then thenumber of points we have to classify is small and so we still make only a fewmistakes.3. The constants c , c and c are relatively small (around 100-500). They repre-sent the length of the shortest programs computing simple transformations orencodings. This is dependent on the universal Turing machine used to deﬁnethe Solomonoﬀ distribution, but for a natural universal Turing machine weexpect it to be fairly small [Hut04, sec.2.2.2].4. The “special” χ is not actually that special at all. In fact, it can be generatedeasily with probability 1 by tossing a coin with bias θ inﬁnitely often. Moreformally, it is a µ Martin-L¨of random string where µ (1 | x ) = θ for all x . Suchstrings form a µ -measure 1 set in B ∞ .10 roof of Theorem 14. The ﬁrst is a basic result in algorithmic information theory[LV08, p.318]. Essentially choosing χ to be Martin-L¨of random with respect to aBernoulli process parameterized by θ . From now on, let ¯ θ = χ ) /n . For simplicitywe write x := x x · · · x n , y := f ( x ) f ( x ) · · · f ( x n ), and ˜ y := ˜ f ( x ) ˜ f ( x ) · · · ˜ f ( x n ).Deﬁne indicator ψ by ψ i := [[ χ i = 0 ∧ y i = ˜ y i ]]. Now note that there exists c ∈ R such that KM ( χ n ) < KM ( ψ n ; y, ˜ y ) + KM ( y ; x ) + KM (˜ y ; x ) + KM ( x ) + c (7)This follows since we can easily use y , ˜ y and ψ n to recover χ n by χ i = 1 if andonly if y i = ˜ y i and ψ i = 1. The constant c is the length of the reconstructionprogram. Now KM (˜ y ; x ) ≤ KM ( y ; x ) follows directly from the deﬁnition of ˜ f . Wenow compute an upper bound on KM ( ψ ). Let α := L A ∗ ( f, X m ) be the proportionof the testing data on which A ∗ makes an error. The following is easy to verify:1. ψ ) = (1 − α )(1 − ¯ θ ) n

2. ψ ) = (1 − (1 − α )(1 − ¯ θ )) n y i = ˜ y i = ⇒ ψ i = 04. y ⊕ ˜ y ) = α (1 − ¯ θ ) n where ⊕ is the exclusive or function.We can use point 3 above to trivially encode ψ i when ˜ y i = y i . Aside from these,there are exactly ¯ θn − α )(1 − ¯ θ ) n ψ n given y and ˜ y , which we substitute into(7). nH (¯ θ ) − c ≤ KM ( χ n ) ≤ KM ( ψ n ; y, ˜ y ) + KM ( y ; x ) + KM (˜ y ; x )+ KM ( x ) + c (8) ≤ KM ( y ; x ) + KM ( x ) + nJ (¯ θ, α ) + c where J (¯ θ, α ) := (cid:2) ¯ θ + (1 − ¯ θ )(1 − α ) (cid:3) H (cid:0) ¯ θ/ (cid:2) ¯ θ + (1 − ¯ θ )(1 − α ) (cid:3)(cid:1) . An easy techni-cal result (Lemma 16 in the appendix) shows that for ¯ θ ∈ (0 , ≤ α (1 − ¯ θ ) log 11 − ¯ θ ≤ H (¯ θ ) − J (¯ θ, α )Therefore nα (1 − ¯ θ ) log − ¯ θ ≤ KM ( y ; x ) + KM ( x ) + c + c . The result follows byrearranging and using part 1 of the theorem.Since the features are known, it is unexpected for the bound to depend on theircomplexity, KM ( X ). Therefore it is not surprising that this dependence can beremoved at a small cost, and with a little extra eﬀort.11 heorem 15. Under the same conditions as Theorem 14, the loss of A ∗ is boundedby L A ∗ ( f, X m ) < KM ( f ; X ) + 2 [log | X | + log log | X | ] + cn (1 − θ − c /n ) log(1 − θ + c /n ) − where c is some constant independent of inputs. This version will be preferred to Theorem 14 in cases where KM ( X ) > | X | + log log | X | ]. The proof of Theorem 15 is almost identical to that ofTheorem 14. Proof sketch:

The idea is to replace equation (7) by KM ( χ n , x ) < KM ( ψ n ; y, ˜ y ) + KM ( y ; x ) + KM (˜ y ; x ) + KM ( x ) + c (9)Then use the following identities K ( χ n ; x, K ( x )) + K ( x ) < K ( χ n , x ) − K ( ℓ ( x )) x and χ . Finally use KM ( x ) < K ( x ) for all x and K ( ℓ ( x )) < log ℓ ( x )+2 log log ℓ ( x )+ r for some constant r > x to rearrange (9) into KM ( χ n ) < KM ( ψ n ; y, ˜ y ) + KM ( y ; x ) + KM (˜ y ; x ) + 2 log ℓ ( x )+ 2 log log ℓ ( x ) + c for some constant c > χ, ψ, x and y . Finally use the techniques inthe proof of Theorem 14 to complete the proof. Summary.

Proposition 1 shows that if problems are distributed according to theircomplexity, as Occam’s razor suggests they should, then a (possibly small) free lunchexists. While the assumption of simplicity still represents a bias towards certainproblems, it is a universal one in the sense that no style of structured problem ismore favoured than another.In Section 4 we gave a complexity-based classiﬁcation algorithm and proved thefollowing properties:1. It performs well on problems that exhibit some compressible structure, KM ( f ; X ) = o ( n ).2. Increasing the amount of training data decreases the error.3. It performs better when given a good (broad/randomized) selection of trainingdata. 12heorem 14 is reminiscent of the transductive learning bounds of Vapnik and others[DEyM04, Vap82, Vap00], but holds for all Martin-L¨of random training data, ratherthan with high probability. This is diﬀerent to the predictive result in Solomonoﬀinduction where results hold with probability 1 rather than for all Martin-L¨of ran-dom sequences [HM07]. If we assume the training set is sampled randomly, thenour bounds are comparable to those in [DEyM04].Unfortunately, the algorithm of Section 4 is incomputable. However Kolmogorovcomplexity can be approximated via standard compression algorithms, which mayallow for a computable approximation of the classiﬁer of Section 4. Such approxi-mations have had some success in other areas of AI, including general reinforcementlearning [VNH +

11] and unsupervised clustering [CV05].Occam’s razor is often thought of as the principle of choosing the simplest hy-pothesis matching your data. Our deﬁnition of simplest is the hypothesis that min-imises KM ( f ; X ) (maximises M ( f ; X )). This is perhaps not entirely natural fromthe informal statement of Occam’s razor, since M ( x ) contains contributions from allprograms computing x , not just the shortest. We justify this by combining Occam’srazor with Epicurus principle of multiple explanations that argues for all consistenthypotheses to be considered. In some ways this is the most natural interpretationas no scientist would entirely rule out a hypothesis just because it is slightly morecomplex than the simplest. A more general discussion of this issue can be found in[Dow11, sec.4]. Additionally, we can argue mathematically that since KM ≈ Km ,the simplest hypothesis is very close to the mixture. Therefore the debate is morephilosophical than practical in this setting.An alternative approach to formalising Occam’s razor has been considered inMML [WB68]. However, in the deterministic setting the probability of the datagiven the hypothesis satisﬁes P ( D | H ) = 1. This means the two part code reduces tothe code-length of the prior, log(1 /P ( H )). This means the hypothesis with minimummessage length depends only on the choice of prior, not the complexity of codingthe data. The question then is how to choose the prior, on which MML gives nogeneral guidance. Some discussion of Occam’s razor from a Kolmogorov complexityviewpoint can be found in [Hut10, KLV97, RH11], while the relation between MMLand Kolmogorov complexity is explored in [WD99]. Assumptions.

We assumed ﬁnite X , Y , and deterministic f , which is the stan-dard transductive learning setting. Generalisations to countable spaces may stillbe possible using complexity approaches, but non-computable real numbers provemore diﬃcult. One can either argue by the strong Church-Turing thesis that non-computable reals do not exist, or approximate them arbitrarily well. Stochastic f are interesting and we believe a complexity-based approach will still be eﬀective,although the theorems and proofs may turn out to be somewhat diﬀerent. Acknowledgements.

We thank Wen Shao and reviewers for valuable feedback The bounds of Section 4 would depend on the choice of complexity at most logarithmically in | X | with KM providing the uniformly better bound.

13n earlier drafts and the Australian Research Council for support under grantDP0988049.

A Technical proofs

Lemma 16 (Proof of Entropy inequality) . ≤ α (1 − θ ) log 11 − θ (10) ≤ H ( θ ) − [ θ + (1 − θ )(1 − α )] H (cid:18) θθ + (1 − θ )(1 − α ) (cid:19) (11) With equality only if θ ∈ { , } or α = 0 Proof.

First, (10) is trivial. To prove (11), note that for α = 0 or θ ∈ { , } , equalityis obvious. Now, ﬁxing θ ∈ (0 ,

1) and computing. ∂∂α (cid:20) H ( θ ) − [ θ + (1 − θ )(1 − α )] H (cid:18) θθ + (1 − θ )(1 − α ) (cid:19)(cid:21) = (1 − θ ) log 1 − α (1 − θ )(1 − α )(1 − θ ) ≥ (1 − θ ) log(1 − θ ) − Therefore integrating both sides over α gives, α (1 − θ ) log(1 − θ ) − ≤ H ( θ ) − [ θ + (1 − θ )(1 − α )] H (cid:18) θθ + (1 − θ )(1 − α ) (cid:19) as required. References [CS07] J. Carroll and K. Seppi. No-free-lunch and Bayesian optimality. In

IJCNNWorkshop on Meta-Learning , 2007.[CV05] R. Cilibrasi and P. Vitanyi. Clustering by compression.

Information Theory,IEEE Transactions on , 51(4):1523 – 1545, 2005.[DEyM04] P. Derbeko, R. El-yaniv, and R. Meir. Error bounds for transductive learningvia compression and clustering.

NIPS , 16, 2004.[Dow11] D. Dowe. MML, hybrid Bayesian network graphical models, statistical con-sistency, invariance and uniqueness. In

Handbook of Philosophy of Statistics ,volume 7, pages 901–982. Elsevier, 2011. G´ac83] P. G´acs. On the relation between descriptional complexity and algorithmicprobability.

Theoretical Computer Science , 22(1-2):71 – 93, 1983.[G´ac08] P. G´acs. Expanded and improved proof of the relation between descriptioncomplexity and algorithmic probability.

Unpublished , 2008.[GCP05] C. Giraud-Carrier and F. Provost. Toward a justiﬁcation of meta-learning: Isthe no free lunch theorem a show-stopper. In

ICML workshop on meta-learning ,pages 9–16, 2005.[Gr¨u07] P. Gr¨unwald.

The Minimum Description Length Principle , volume 1 of

MITPress Books . The MIT Press, 2007.[HM07] M. Hutter and A. Muchnik. On semimeasures predicting Martin-L¨of randomsequences.

Theoretical Computer Science , 382(3):247–261, 2007.[Hut04] M. Hutter.

Universal Artiﬁcial Intelligence: Sequential Decisions based onAlgorithmic Probability . Springer, Berlin, 2004.[Hut10] M. Hutter. A complete theory of everything (will be subjective).

Algorithms ,3(4):329–350, 2010.[KLV97] W. Kirchherr, M. Li, and P. Vitanyi. The miraculous universal distribution.

The Mathematical Intelligencer , 19(4):7–15, 1997.[LV08] M. Li and P. Vitanyi.

An Introduction to Kolmogorov Complexity and ItsApplications . Springer, Verlag, 3rd edition, 2008.[ML66] P. Martin-L¨of. The deﬁnition of random sequences.

Information and Control ,9(6):602 – 619, 1966.[RH11] S. Rathmanner and M. Hutter. A philosophical treatise of universal induction.

Entropy , 13(6):1076–1136, 2011.[Sch94] C. Schaﬀer. A conservation law for generalization performance. In

Proceedingsof the Eleventh International Conference on Machine Learning , pages 259–265.Morgan Kaufmann, 1994.[Sol64a] R. Solomonoﬀ. A formal theory of inductive inference, Part I.

Informationand Control , 7(1):1–22, 1964.[Sol64b] R. Solomonoﬀ. A formal theory of inductive inference, Part II.

Informationand Control , 7(2):224–254, 1964.[SVW01] C. Schumacher, M. Vose, and L. Whitley. The no free lunch and problemdescription length. In Lee Spector and Eric D. Goodman, editors,

GECCO2001: Proc. of the Genetic and Evolutionary Computation Conf. , pages 565–570, San Francisco, 2001. Morgan Kaufmann.[Vap82] V. Vapnik.

Estimation of Dependences Based on Empirical Data . Springer-Verlag, New York, 1982. Vap00] V. Vapnik.

The Nature of Statistical Learning Theory . Springer, Berlin, 2ndedition, 2000.[VNH +

11] J. Veness, K. S. Ng, M. Hutter, W. Uther, and D. Silver. A Monte Carlo AIXIapproximation.

Journal of Artiﬁcial Intelligence Research , 40:95–142, 2011.[WB68] C. Wallace and D. Boulton. An information measure for classiﬁcation.

TheComputer Journal , 11(2):185–194, 1968.[WD69] S. Watanabe and S. Donovan.

Knowing and guessing; a quantitative study ofinference and information . Wiley New York,, 1969.[WD99] C. Wallace and D. Dowe. Minimum message length and Kolmogorov complex-ity.

The Computer Journal , 42(4):270–283, 1999.[WM97] D. Wolpert and W. Macready. No free lunch theorems for optimization.

IEEETransactions on Evolutionary Computation , 1(1):67–82, April 1997.[Wol01] D. Wolpert. The supervised learning no-free-lunch theorems. In

In Proc. 6thOnline World Conference on Soft Computing in Industrial Applications , pages25–42, 2001., pages25–42, 2001.