[PDF] (Non-)Equivalence of Universal Priors

Abstract

Ray Solomonoff invented the notion of universal induction featuring an aptly termed "universal" prior probability function over all possible computable environments. The essential property of this prior was its ability to dominate all other such priors. Later, Levin introduced another construction --- a mixture of all possible priors or `universal mixture'. These priors are well known to be equivalent up to multiplicative constants. Here, we seek to clarify further the relationships between these three characterisations of a universal prior (Solomonoff's, universal mixtures, and universally dominant priors). We see that the the constructions of Solomonoff and Levin define an identical class of priors, while the class of universally dominant priors is strictly larger. We provide some characterisation of the discrepancy.

Full PDF

aa r X i v : . [ c s . I T ] N ov (Non-)Equivalence of Universal Priors Ian Wood and Peter Sunehag and Marcus Hutter , Research School of Computer Science Australian National University and ETH Z¨urich { ian.wood,peter.sunehag,marcus.hutter } @anu.edu.au

15 November 2011

Abstract

Ray Solomonoﬀ invented the notion of universal induction featuring anaptly termed “universal” prior probability function over all possible com-putable environments [Sol64]. The essential property of this prior was itsability to dominate all other such priors. Later, Levin introduced anotherconstruction — a mixture of all possible priors or “universal mixture” [ZL70].These priors are well known to be equivalent up to multiplicative constants.Here, we seek to clarify further the relationships between these three char-acterisations of a universal prior (Solomonoﬀ’s, universal mixtures, and uni-versally dominant priors). We see that the the constructions of Solomonoﬀand Levin deﬁne an identical class of priors, while the class of universallydominant priors is strictly larger. We provide some characterisation of thediscrepancy.

Contents algorithmic information theory; universal induction; universal prior. Introduction

In the study of universal induction, we consider an abstraction of the world inthe form of a binary string. Any sequence from a ﬁnite set of possibilities canbe expressed in this way, and that is precisely what contemporary computers arecapable of analysing. An “environment” provides a measure of probability to (pos-sibly inﬁnite) binary strings. Typically, the class M of enumerable semimeasuresis considered. Given the equivalence between M and the set of monotone Turingmachines (Lemma 6), this choice reﬂects the expectation that the environment canbe computed by (or at least approximated by) a Turing machine.Universal induction is an ideal Bayesian induction mechanism assigning prob-abilities to possible continuations of a binary string. In order to do this, a priordistribution, termed a universal prior, is deﬁned on binary strings. This prior hasthe property that the Bayesian mechanism converges to the true (generating) envi-ronment for any environment µ in M , given suﬃcient evidence.There are three popular ways of deﬁning a universal prior in the literature:Solomonoﬀ’s prior [Sol64, ZL70, Hut05], as a universal mixture [ZL70, Hut05,Hut07], or a universally dominant semimeasure [Hut05, Hut07]. Brieﬂy, a uni-versally dominant semimeasure is one that dominates every other semimeasure in M (Deﬁnition 9), a universal mixture is a mixture of all semimeasures in M withnon-zero coeﬃcients (Deﬁnition 8), and a Solomonoﬀ prior assigns the probabilitythat a (chosen) monotone universal Turing machine outputs a string given randominput (Deﬁnition 7). These and other relevant concepts are deﬁned in more detailin Section 2.Solomonoﬀ’s and the universal mixture constructions have been known for manyyears and they are often used interchangeably in textbooks and lecture notes. Theirequivalence has been shown in the sense that they dominate each other [ZL70, Hut05,LV08]. We extend this result in Section 3, showing that they in fact deﬁne exactlythe same class of priors.Further, it is trivial to see that both constructions produce universally dominantsemimeasures. The converse is, however, not true. Universally dominant semimea-sures are a larger class. We provide a simple example to demonstrate this in Section4. These results are relatively undemanding technically, however given their fun-damental nature, that they have not to our knowledge been published to date, andthe relevance to Ray Solomonoﬀ’s famous work on universal induction, we presentthem here.The following diagram summarises these inclusion relations: We represent the set of ﬁnite/inﬁnite binary strings as B ∗ and B ∞ respectively. ǫ denotes the empty string, xb the concatenation of strings x and b , ℓ ( x ) the length2niversally Dominant Lemma T heorem P(cid:16)

Universal Mixture n n T heorem Solomonoﬀ Prior

Corollary i i Figure 1:of a string x . A cylinder set, the set of all inﬁnite binary strings which start withsome x ∈ B ∗ is denoted Γ x .A string x is said to be a preﬁx of a string y if y = xz for some string z . Wewrite x ⊑ y or x ⊏ y if x is a proper substring of y (ie: z = ǫ ). We denote themaximal preﬁx-free subset of a set of ﬁnite strings P by ⌊P⌋ . It can be obtained bysuccessively removing elements that have a preﬁx in P . The uniform measure of aset of strings is denoted |P| := P p ∈⌊P⌋ − ℓ ( p ) . This is the area of continuations ofelements of P considered as binary decimal numbers.There have been several deﬁnitions of monotone Turing machines in the literature[LV08], however we choose that which is now widely accepted [Sol64, ZL70, Hut05,LV08] and has the useful and intuitive property Lemma 6. Deﬁnition 1.

A monotone Turing machine is a computer with binary (one-way)input and output tapes, a bidirectional binary work tape (with read/write headsas appropriate) and a ﬁnite state machine to determine its actions given input andwork tape values. The input tape is read-only, the output tape is write-only.The deﬁnitions of a universal Turing machine in the literature are somewhatvaried or unclear. Monotone universal Turing machines are relevant here for deﬁningthe Solomonoﬀ prior. In the algorithmic information theory literature, most authorsare concerned with the explicit construction of a single reference universal machine[Hut05, LV08, Sol64, Tur36, ZL70]. A more general deﬁnition is left to a relativelyvague statement along the lines of “a Turing machine that can emulate any otherTuring machine”. The deﬁnition below reﬂects the typical construction used and isoften referred to as universal by adjunction [DH10, FSW06].

Deﬁnition 2 (Monotone Universal Turing Machine) . A monotone universal Turingmachine is a monotone Turing machine U for which there exist:1. an enumeration { T i : i ∈ N } of all monotone Turing machines2. a computable uniquely decodable self-delimiting code I : N → B ∗ such that the programs for U that produce output coincide with the set { I ( i ) p : i ∈ N , p ∈ B ∗ } of concatenations of I ( i ) and p , and U ( I ( i ) p ) = T i ( p ) ∀ i ∈ N , p ∈ B ∗ x as the probability that some monotone Turing machine producesoutput beginning with x given unbiased coin ﬂip input. This approach was used bySolomonoﬀ to construct a universal prior [Sol64]. To better understand the proper-ties of such a function, we will need the concepts of enumerability and semimeasures: Deﬁnition 3.

A function or number φ is said to be enumerable or lower semi-computable (these terms are synonymous) if it can be approximated from below(pointwise) by a monotone increasing set { φ i : i ∈ N } of ﬁnitely computable func-tions/numbers, all calculable by a single Turing machine. We write φ i ր φ . Finitelycomputable functions/numbers can be computed in ﬁnite time by a Turing machine. Deﬁnition 4. A semimeasure is a “defective” probability measure on the σ -algebra generated by cylinder sets in B ∞ . We write µ ( x ) for x ∈ B ∗ as shorthandfor µ (Γ x ). A probability measure must satisfy µ ( ǫ ) = 1, µ ( x ) = P b ∈ B µ ( xb ). Asemimeasure allows a probability “gap”: µ ( ǫ ) ≤ µ ( x ) ≥ P b ∈ B µ ( xb ). M denotes the set of all enumerable semimeasures.The following deﬁnition explicates the relationship between monotone Turingmachines and enumerable semimeasures. Deﬁnition 5 (Solomonoﬀ semimeasure) . For each monotone Turing machine T weassociate a semimeasure λ T ( x ) := X ⌊ p : T ( p )= x ∗⌋ − ℓ ( p ) = | T − ( x ∗ ) | where ⌊P⌋ indicates the maximal preﬁx-free subset of a set of ﬁnite strings P , T ( p ) = x ∗ indicates that x is a preﬁx of (or equal to) T ( p ) and ℓ ( p ) is the length of p . If there are no such programs, we set λ T ( x ) := 0. [See [LV08] deﬁnition 4.5.4]Note that this is the probability that T outputs a string starting with x givenunbiased coin ﬂip input. To see this, consider the uniform measure given by λ (Γ p ) :=2 − ℓ ( p ) . This is the probability of obtaining p from unbiased coin ﬂips. λ T ( x ) is theuniform measure of the set of programs for T that produce output starting with x , ie: the probability of obtaining one of those programs from unbiased coin ﬂips.Note also that, since T is monotone, this set consists of a union of disjoint cylindersets { Γ p : p ∈ ⌊ q : T ( q ) = x ∗⌋} . By dovetailing a search for such programs and anlower approximation of the uniform measure λ , we can see that λ T is enumerable.See Deﬁnition 4.5.4 (p.299) and Lemma 4.5.5 (p.300) in [LV08].An important lemma in this discussion establishes the equivalence between theset of all monotone Turing machines and the set M of all enumerable semimeasures.It is equivalent to Theorem 4.5.2 in [LV08] (page 301) with a small correction: λ T ( ǫ ) = 1 for any T by construction, but µ ( ǫ ) may not be 1, so this case must beexcluded. 4 emma 6. A semimeasure µ is lower semicomputable if and only if there is amonotone Turing machine T such that µ = λ T except on Γ ǫ ≡ B ∞ and µ ( ǫ ) is lowersemicomputable. We are now equipped to formally deﬁne the 3 formulations for a universal prior:

Deﬁnition 7 (Solomonoﬀ prior) . The Solomonoﬀ prior for a given universal mono-tone Turing machine U is M := λ U The class of all Solomonoﬀ priors we denote U M . Deﬁnition 8 (Universal mixture) . A universal mixture is a mixture ξ with non-zero positive weights over an enumeration { ν i : i ∈ N , ν i ∈ M} of all enumerablesemimeasures M : ξ = X i ∈ N w i ν i : R ∋ w i > , X i ∈ N w i ≤ w () to be a lower semicomputable function. The mixture ξ is then itself an enumerable semimeasure, i.e. ξ ∈ M . The class of all universalmixtures we denote U ξ . Deﬁnition 9 (Universally dominant semimeasure) . A universally dominantsemimeasure is an enumerable semimeasure δ for which there exists a real num-ber c µ > µ satisfying: δ ( x ) ≥ c µ µ ( x ) ∀ x ∈ B ∗ The class of all universally dominant semimeasures we denote U δ .Dominance implies absolute continuity: Every enumerable semimeasure is abso-lutely continuous with respect to a universally dominant enumerable semimeasure.The converse (absolute continuity implies dominance) is however not true. We show here that every Solomonoﬀ prior M ∈ U M can be expressed as a universalmixture (i.e.: M ∈ U ξ ) and vice versa. In other words the class of Solomonoﬀ priorsand the class of universal mixtures are identical: U M = U ξ .Previously, it was known [ZL70, Hut05, LV08] that a Solomonoﬀ prior M and auniversal mixture ξ are equivalent up to multiplicative constants M ( x ) ≤ c ξ ( x ) ∀ x ∈ B ∗ ξ ( x ) ≤ c M ( x ) ∀ x ∈ B ∗ x = ǫ as M ( ǫ ) is always one for a Solomonoﬀ prior, but ξ ( ǫ ) is never one for a universal mixture ξ (as there are µ ∈ M with µ ( ǫ ) < Lemma 10.

For any monotone universal Turing machine U the associatedSolomonoﬀ prior M can be expressed as a universal mixture. i.e. there exists an enu-meration { ν i } ∞ i =1 of the set of enumerable semimeasures M and computable function w () : N → R such that M ( x ) = X i ∈ N w i ν i ( x ) ∀ x ∈ B ∗ \ ǫ with P i ∈ N w i ≤ and w i > ∀ i ∈ N . In other words the class of Solomonoﬀ priorsis a subset of the class of universal mixtures: U M ⊆ U ξ .Proof. We note that all programs that produce output from U are uniquely of theform q = I ( i ) p . This allows us to split the sum in (1) below. M ( x ) = X ⌊ q : U ( q )= x ∗⌋ − ℓ ( q ) = X i ∈ N X ⌊ p : U ( I ( i ) p )= x ∗⌋ − ℓ ( I ( i ) p ) (1)= X i ∈ N − l ( I ( i )) X ⌊ p : T i ( p )= x ∗⌋ − ℓ ( p ) = X i ∈ N − l ( I ( i )) λ T i ( x )Clearly 2 − l ( I ( i )) > i . Since I is a self-delimitingcode it must be preﬁx free, and so satisfy Kraft’s inequality: X i ∈ N − l ( I ( i )) ≤ λ T i cover every enumerable semimeasure if ǫ is excludedfrom their domain, which shows that P i ∈ N − l ( I ( i )) λ T i ( x ) is a universal mixture. Thiscompletes the proof. Corollary 11. [ZL70] The Solomonoﬀ prior M for a universal monotone Turingmachine U is universally dominant. Thus, the class of Solomonoﬀ priors is a subsetof the class of universally dominant lower semicomputable semimeasures: U M ⊆ U δ .Proof. From Lemma 10 we have for each ν ∈ M there exists j ∈ N with ν = λ T j and for all x ∈ B ∗ : M ( x ) = X i ∈ N − l ( I ( i )) λ T i ( x ) ≥ − l ( I ( j )) ν ( x )as required. 6 emma 12. Every universal mixture ξ is universally dominant. Thus, the class ofuniversal mixtures is a subset of the class of universally dominant lower semicom-putable semimeasures: U ξ ⊆ U δ .Proof. This follows from a similar argument to that in Corollary 11.

Lemma 13.

For every universal mixture ξ there exists a universal monotone Turingmachine and associated Solomonoﬀ prior M such that ξ ( x ) = M ( x ) ∀ x ∈ B ∗ \ ǫ In other words the class of universal mixtures is a subset of the class of Solomonoﬀpriors: U ξ ⊆ U M .Proof. First note that by Lemma 6 we can ﬁnd (by dovetailing possible repetitions ofsome indicies) parallel enumerations { ν i } i ∈ N of M and { T i = λ ν i } i ∈ N of all monotoneTuring machines, and computable weight function w () with ξ = X i ∈ N w i ν i , X i ∈ N w i ≤ φ ( i, t ) ր w i : w i = X t | φ ( i, t + 1) − φ ( i, t ) | (2)= X j − k ij (3) i, j k ij computable (4)The K-C theorem [Lev71, Sch73, Cha75, DH10] says that for any computable se-quence of pairs { k ij ∈ N , τ ij ∈ B ∗ } i,j ∈ N with P − k ij ≤

1, there exists a preﬁxTuring machine P and strings { σ ij ∈ B ∗ } such that ℓ ( σ ij ) = k ij , P ( σ ij ) = τ ij (5)Choosing distinct τ ij and the existence of preﬁx machine P ensures that { σ ij } ispreﬁx free. We now deﬁne a monotone Turing machine U . For strings of the form σ ij p for some i, j : U ( σ ij p ) := T i ( p ) (6)For strings not of this form, U produces no output. U inherits monotonicity fromthe T i , and since { T i } i ∈ N enumerates all monotone Turing machines, U is universal.7he Solomonoﬀ prior associated with U is then: λ U ( x ) = | U − ( x ∗ ) | (7)= X i,j − ℓ ( σ ij ) | T − i ( x ∗ ) | (8)= X i ( X j − k ij ) λ T i ( x ) (9)= X i w i ν i ( x ) (10)= ξ ( x ) (11)The main theorem for this section is now trivial: Theorem 14.

The classes U M of Solomonoﬀ priors and U ξ of universal mixturesare exactly equivalent. In other words, the two constructions deﬁne exactly the sameset of priors: U M = U ξ .Proof. Follows directly from Lemma 10 and Lemma 13.

In this section, we see that a universal mixture must have a “gap” in the semimeasureinequality greater than c − K ( ℓ ( x )) for some constant c > x , and thatthere are universally dominant enumerable semimeasures that fail this requirement.This shows that not all universally dominant enumerable semimeasures are universalmixtures. Lemma 15.

For every Solomonoﬀ prior M and associated universal monotone Tur-ing machine U , there exists a real constant c > such that M ( x ) − M ( x − M ( x M ( x ) ≥ c − K ( ℓ ( x )) ∀ x ∈ B ∗ where the Kolmogorov complexity K ( n ) of an integer n is the length of the shortestpreﬁx code for n .Proof. First, note that M ( x ) − M ( x − M ( x

1) measures the set of programs U − ( x )for which U outputs x and no more. Consider the set P := { ql ′ p | p ∈ B ∗ , U ( p ) ⊒ x } l ′ is a shortest preﬁx code for ℓ ( x ) and q is a program such that U ( ql ′ p )executes U ( p ) until ℓ ( x ) bits are output, then stops.Now, for each r = ql ′ p ∈ P we have U ( r ) = x since U ( p ) ⊒ x and q executes U ( p ) until ℓ ( x ) bits are output. Thus P ⊆ U − ( x ) and |P| ≤ | U − ( x ) | (12)Also P = ql ′ U − ( x ∗ ) := { s = ql ′ p | p ∈ U − ( x ∗ ) } , and so |P| = 2 − ℓ ( ql ′ ) | U − ( x ∗ ) | (13)combining (12) and (13) and noting that M ( x ) − M ( x − M ( x

1) = | U − ( x ) | and M ( x ) = | U − ( x ∗ ) | we obtain M ( x ) − M ( x − M ( x

1) = | U − ( x ) |≥ |P| = 2 − ℓ ( ql ′ ) | U − ( x ∗ ) | = 2 − ℓ ( q ) − K ( ℓ ( x )) M ( x )Setting c := 2 − ℓ ( q ) this proves the result. Theorem 16.

Not all universally dominant enumerable semimeasures are universalmixtures: U ξ ⊂ U δ Proof.

Take some universally dominant semimeasure δ , then deﬁne δ ′ ( ǫ ) :=1 , δ ′ (0) = δ ′ (1) := , δ ′ ( bx ) := δ ( bx ) for b ∈ B , x ∈ B ∗ \ ǫ . δ ′ is clearly a universallydominant enumerable semimeasure with δ ′ (0) + δ ′ (1) = δ ′ ( ǫ ), and by Lemma 15 itis not a universal mixture. One of Solomonoﬀ’s more famous contributions is the invention of a theoreticallyideal universal induction mechanism. The universal prior used in this mechanismcan be deﬁned/constructed in several ways. We clarify the relationships betweenthree diﬀerent deﬁnitions of universal priors, namely universal mixtures, Solomonoﬀpriors and universally dominant semimeasures. We show that the class of universalmixtures and the class of Solomonoﬀ priors are exactly the same while the class ofuniversally dominant lower semicomputable semimeasures is a strictly larger set.We have identiﬁed some aspects of the discrepancy between Solomonoﬀ pri-ors/universal mixtures and universally dominant lower semicomputable semimea-sures, however a clearer understanding and characterisation would be of interest.Since universal dominance is all that is needed to prove convergence for universalinduction [Hut05, Sol78] it is interesting to ask whether the extra properties ofthe smaller class of Solomonoﬀ priors have any positive consequences for universalinduction. 9 cknowledgements.

We would like to acknowledge the contribution of an anonymous reviewer to a moreelegant presentation of the proof of Lemma 13. This work was supported by ARCgrant DP0988049.

References [Cha75] G. J. Chaitin. A theory of program size formally identical to information theory.

Journal of the ACM , 22(3):329–340, 1975.[DH10] R. Downey and D. R. Hirschfeldt.

Algorithmic Randomness and Complexity .Springer, Berlin, 2010.[FSW06] S. Figueira, F. Stephan, and G. Wu. Randomness and universal machines.

Journal of Complexity , 22(6):738–751, 2006.[Hut05] M. Hutter.

Universal Artiﬁcial Intelligence: Sequential Decisions based on Al-gorithmic Probability . Springer, Berlin, 2005.[Hut07] M. Hutter. On universal prediction and Bayesian conﬁrmation.

TheoreticalComputer Science , 384(1):33–48, 2007.[Lev71] Leonid A Levin.

Some Theorems on the Algorithmic Approach to ProbabilityTheory and Information Theory . PhD thesis, Moscow University, Moscow, 1971.[LV08] M. Li and P. M. B. Vit´anyi.

An Introduction to Kolmogorov Complexity and itsApplications . Springer, Berlin, 3rd edition, 2008.[Sch73] C. P. Schnorr. Process complexity and eﬀective random tests.

Journal of Com-puter and System Sciences , 7(4):376–388, 1973.[Sol64] R. J. Solomonoﬀ. A formal theory of inductive inference: Parts 1 and 2.

Infor-mation and Control , 7:1–22 and 224–254, 1964.[Sol78] R. J. Solomonoﬀ. Complexity-based induction systems: Comparisons and con-vergence theorems.

IEEE Transactions on Information Theory , IT-24:422–432,1978.[Tur36] A. M. Turing. On computable numbers, with an application to the Entschei-dungsproblem.

Proc. London Mathematical Society , 2(42):230–265, 1936.[ZL70] A. K. Zvonkin and L. A. Levin. The complexity of ﬁnite objects and the devel-opment of the concepts of information and randomness by means of the theoryof algorithms.

Russian Mathematical Surveys , 25(6):83–124, 1970., 25(6):83–124, 1970.