(Non-)Equivalence of Universal Priors
aa r X i v : . [ c s . I T ] N ov (Non-)Equivalence of Universal Priors Ian Wood and Peter Sunehag and Marcus Hutter , Research School of Computer Science Australian National University and ETH Z¨urich { ian.wood,peter.sunehag,marcus.hutter } @anu.edu.au
15 November 2011
Abstract
Ray Solomonoff invented the notion of universal induction featuring anaptly termed “universal” prior probability function over all possible com-putable environments [Sol64]. The essential property of this prior was itsability to dominate all other such priors. Later, Levin introduced anotherconstruction — a mixture of all possible priors or “universal mixture” [ZL70].These priors are well known to be equivalent up to multiplicative constants.Here, we seek to clarify further the relationships between these three char-acterisations of a universal prior (Solomonoff’s, universal mixtures, and uni-versally dominant priors). We see that the the constructions of Solomonoffand Levin define an identical class of priors, while the class of universallydominant priors is strictly larger. We provide some characterisation of thediscrepancy.
Contents algorithmic information theory; universal induction; universal prior. Introduction
In the study of universal induction, we consider an abstraction of the world inthe form of a binary string. Any sequence from a finite set of possibilities canbe expressed in this way, and that is precisely what contemporary computers arecapable of analysing. An “environment” provides a measure of probability to (pos-sibly infinite) binary strings. Typically, the class M of enumerable semimeasuresis considered. Given the equivalence between M and the set of monotone Turingmachines (Lemma 6), this choice reflects the expectation that the environment canbe computed by (or at least approximated by) a Turing machine.Universal induction is an ideal Bayesian induction mechanism assigning prob-abilities to possible continuations of a binary string. In order to do this, a priordistribution, termed a universal prior, is defined on binary strings. This prior hasthe property that the Bayesian mechanism converges to the true (generating) envi-ronment for any environment µ in M , given sufficient evidence.There are three popular ways of defining a universal prior in the literature:Solomonoff’s prior [Sol64, ZL70, Hut05], as a universal mixture [ZL70, Hut05,Hut07], or a universally dominant semimeasure [Hut05, Hut07]. Briefly, a uni-versally dominant semimeasure is one that dominates every other semimeasure in M (Definition 9), a universal mixture is a mixture of all semimeasures in M withnon-zero coefficients (Definition 8), and a Solomonoff prior assigns the probabilitythat a (chosen) monotone universal Turing machine outputs a string given randominput (Definition 7). These and other relevant concepts are defined in more detailin Section 2.Solomonoff’s and the universal mixture constructions have been known for manyyears and they are often used interchangeably in textbooks and lecture notes. Theirequivalence has been shown in the sense that they dominate each other [ZL70, Hut05,LV08]. We extend this result in Section 3, showing that they in fact define exactlythe same class of priors.Further, it is trivial to see that both constructions produce universally dominantsemimeasures. The converse is, however, not true. Universally dominant semimea-sures are a larger class. We provide a simple example to demonstrate this in Section4. These results are relatively undemanding technically, however given their fun-damental nature, that they have not to our knowledge been published to date, andthe relevance to Ray Solomonoff’s famous work on universal induction, we presentthem here.The following diagram summarises these inclusion relations: We represent the set of finite/infinite binary strings as B ∗ and B ∞ respectively. ǫ denotes the empty string, xb the concatenation of strings x and b , ℓ ( x ) the length2niversally Dominant Lemma T heorem P(cid:16)
Universal Mixture n n T heorem Solomonoff Prior
Corollary i i Figure 1:of a string x . A cylinder set, the set of all infinite binary strings which start withsome x ∈ B ∗ is denoted Γ x .A string x is said to be a prefix of a string y if y = xz for some string z . Wewrite x ⊑ y or x ⊏ y if x is a proper substring of y (ie: z = ǫ ). We denote themaximal prefix-free subset of a set of finite strings P by ⌊P⌋ . It can be obtained bysuccessively removing elements that have a prefix in P . The uniform measure of aset of strings is denoted |P| := P p ∈⌊P⌋ − ℓ ( p ) . This is the area of continuations ofelements of P considered as binary decimal numbers.There have been several definitions of monotone Turing machines in the literature[LV08], however we choose that which is now widely accepted [Sol64, ZL70, Hut05,LV08] and has the useful and intuitive property Lemma 6. Definition 1.
A monotone Turing machine is a computer with binary (one-way)input and output tapes, a bidirectional binary work tape (with read/write headsas appropriate) and a finite state machine to determine its actions given input andwork tape values. The input tape is read-only, the output tape is write-only.The definitions of a universal Turing machine in the literature are somewhatvaried or unclear. Monotone universal Turing machines are relevant here for definingthe Solomonoff prior. In the algorithmic information theory literature, most authorsare concerned with the explicit construction of a single reference universal machine[Hut05, LV08, Sol64, Tur36, ZL70]. A more general definition is left to a relativelyvague statement along the lines of “a Turing machine that can emulate any otherTuring machine”. The definition below reflects the typical construction used and isoften referred to as universal by adjunction [DH10, FSW06].
Definition 2 (Monotone Universal Turing Machine) . A monotone universal Turingmachine is a monotone Turing machine U for which there exist:1. an enumeration { T i : i ∈ N } of all monotone Turing machines2. a computable uniquely decodable self-delimiting code I : N → B ∗ such that the programs for U that produce output coincide with the set { I ( i ) p : i ∈ N , p ∈ B ∗ } of concatenations of I ( i ) and p , and U ( I ( i ) p ) = T i ( p ) ∀ i ∈ N , p ∈ B ∗ x as the probability that some monotone Turing machine producesoutput beginning with x given unbiased coin flip input. This approach was used bySolomonoff to construct a universal prior [Sol64]. To better understand the proper-ties of such a function, we will need the concepts of enumerability and semimeasures: Definition 3.
A function or number φ is said to be enumerable or lower semi-computable (these terms are synonymous) if it can be approximated from below(pointwise) by a monotone increasing set { φ i : i ∈ N } of finitely computable func-tions/numbers, all calculable by a single Turing machine. We write φ i ր φ . Finitelycomputable functions/numbers can be computed in finite time by a Turing machine. Definition 4. A semimeasure is a “defective” probability measure on the σ -algebra generated by cylinder sets in B ∞ . We write µ ( x ) for x ∈ B ∗ as shorthandfor µ (Γ x ). A probability measure must satisfy µ ( ǫ ) = 1, µ ( x ) = P b ∈ B µ ( xb ). Asemimeasure allows a probability “gap”: µ ( ǫ ) ≤ µ ( x ) ≥ P b ∈ B µ ( xb ). M denotes the set of all enumerable semimeasures.The following definition explicates the relationship between monotone Turingmachines and enumerable semimeasures. Definition 5 (Solomonoff semimeasure) . For each monotone Turing machine T weassociate a semimeasure λ T ( x ) := X ⌊ p : T ( p )= x ∗⌋ − ℓ ( p ) = | T − ( x ∗ ) | where ⌊P⌋ indicates the maximal prefix-free subset of a set of finite strings P , T ( p ) = x ∗ indicates that x is a prefix of (or equal to) T ( p ) and ℓ ( p ) is the length of p . If there are no such programs, we set λ T ( x ) := 0. [See [LV08] definition 4.5.4]Note that this is the probability that T outputs a string starting with x givenunbiased coin flip input. To see this, consider the uniform measure given by λ (Γ p ) :=2 − ℓ ( p ) . This is the probability of obtaining p from unbiased coin flips. λ T ( x ) is theuniform measure of the set of programs for T that produce output starting with x , ie: the probability of obtaining one of those programs from unbiased coin flips.Note also that, since T is monotone, this set consists of a union of disjoint cylindersets { Γ p : p ∈ ⌊ q : T ( q ) = x ∗⌋} . By dovetailing a search for such programs and anlower approximation of the uniform measure λ , we can see that λ T is enumerable.See Definition 4.5.4 (p.299) and Lemma 4.5.5 (p.300) in [LV08].An important lemma in this discussion establishes the equivalence between theset of all monotone Turing machines and the set M of all enumerable semimeasures.It is equivalent to Theorem 4.5.2 in [LV08] (page 301) with a small correction: λ T ( ǫ ) = 1 for any T by construction, but µ ( ǫ ) may not be 1, so this case must beexcluded. 4 emma 6. A semimeasure µ is lower semicomputable if and only if there is amonotone Turing machine T such that µ = λ T except on Γ ǫ ≡ B ∞ and µ ( ǫ ) is lowersemicomputable. We are now equipped to formally define the 3 formulations for a universal prior:
Definition 7 (Solomonoff prior) . The Solomonoff prior for a given universal mono-tone Turing machine U is M := λ U The class of all Solomonoff priors we denote U M . Definition 8 (Universal mixture) . A universal mixture is a mixture ξ with non-zero positive weights over an enumeration { ν i : i ∈ N , ν i ∈ M} of all enumerablesemimeasures M : ξ = X i ∈ N w i ν i : R ∋ w i > , X i ∈ N w i ≤ w () to be a lower semicomputable function. The mixture ξ is then itself an enumerable semimeasure, i.e. ξ ∈ M . The class of all universalmixtures we denote U ξ . Definition 9 (Universally dominant semimeasure) . A universally dominantsemimeasure is an enumerable semimeasure δ for which there exists a real num-ber c µ > µ satisfying: δ ( x ) ≥ c µ µ ( x ) ∀ x ∈ B ∗ The class of all universally dominant semimeasures we denote U δ .Dominance implies absolute continuity: Every enumerable semimeasure is abso-lutely continuous with respect to a universally dominant enumerable semimeasure.The converse (absolute continuity implies dominance) is however not true. We show here that every Solomonoff prior M ∈ U M can be expressed as a universalmixture (i.e.: M ∈ U ξ ) and vice versa. In other words the class of Solomonoff priorsand the class of universal mixtures are identical: U M = U ξ .Previously, it was known [ZL70, Hut05, LV08] that a Solomonoff prior M and auniversal mixture ξ are equivalent up to multiplicative constants M ( x ) ≤ c ξ ( x ) ∀ x ∈ B ∗ ξ ( x ) ≤ c M ( x ) ∀ x ∈ B ∗ x = ǫ as M ( ǫ ) is always one for a Solomonoff prior, but ξ ( ǫ ) is never one for a universal mixture ξ (as there are µ ∈ M with µ ( ǫ ) < Lemma 10.
For any monotone universal Turing machine U the associatedSolomonoff prior M can be expressed as a universal mixture. i.e. there exists an enu-meration { ν i } ∞ i =1 of the set of enumerable semimeasures M and computable function w () : N → R such that M ( x ) = X i ∈ N w i ν i ( x ) ∀ x ∈ B ∗ \ ǫ with P i ∈ N w i ≤ and w i > ∀ i ∈ N . In other words the class of Solomonoff priorsis a subset of the class of universal mixtures: U M ⊆ U ξ .Proof. We note that all programs that produce output from U are uniquely of theform q = I ( i ) p . This allows us to split the sum in (1) below. M ( x ) = X ⌊ q : U ( q )= x ∗⌋ − ℓ ( q ) = X i ∈ N X ⌊ p : U ( I ( i ) p )= x ∗⌋ − ℓ ( I ( i ) p ) (1)= X i ∈ N − l ( I ( i )) X ⌊ p : T i ( p )= x ∗⌋ − ℓ ( p ) = X i ∈ N − l ( I ( i )) λ T i ( x )Clearly 2 − l ( I ( i )) > i . Since I is a self-delimitingcode it must be prefix free, and so satisfy Kraft’s inequality: X i ∈ N − l ( I ( i )) ≤ λ T i cover every enumerable semimeasure if ǫ is excludedfrom their domain, which shows that P i ∈ N − l ( I ( i )) λ T i ( x ) is a universal mixture. Thiscompletes the proof. Corollary 11. [ZL70] The Solomonoff prior M for a universal monotone Turingmachine U is universally dominant. Thus, the class of Solomonoff priors is a subsetof the class of universally dominant lower semicomputable semimeasures: U M ⊆ U δ .Proof. From Lemma 10 we have for each ν ∈ M there exists j ∈ N with ν = λ T j and for all x ∈ B ∗ : M ( x ) = X i ∈ N − l ( I ( i )) λ T i ( x ) ≥ − l ( I ( j )) ν ( x )as required. 6 emma 12. Every universal mixture ξ is universally dominant. Thus, the class ofuniversal mixtures is a subset of the class of universally dominant lower semicom-putable semimeasures: U ξ ⊆ U δ .Proof. This follows from a similar argument to that in Corollary 11.
Lemma 13.
For every universal mixture ξ there exists a universal monotone Turingmachine and associated Solomonoff prior M such that ξ ( x ) = M ( x ) ∀ x ∈ B ∗ \ ǫ In other words the class of universal mixtures is a subset of the class of Solomonoffpriors: U ξ ⊆ U M .Proof. First note that by Lemma 6 we can find (by dovetailing possible repetitions ofsome indicies) parallel enumerations { ν i } i ∈ N of M and { T i = λ ν i } i ∈ N of all monotoneTuring machines, and computable weight function w () with ξ = X i ∈ N w i ν i , X i ∈ N w i ≤ φ ( i, t ) ր w i : w i = X t | φ ( i, t + 1) − φ ( i, t ) | (2)= X j − k ij (3) i, j k ij computable (4)The K-C theorem [Lev71, Sch73, Cha75, DH10] says that for any computable se-quence of pairs { k ij ∈ N , τ ij ∈ B ∗ } i,j ∈ N with P − k ij ≤
1, there exists a prefixTuring machine P and strings { σ ij ∈ B ∗ } such that ℓ ( σ ij ) = k ij , P ( σ ij ) = τ ij (5)Choosing distinct τ ij and the existence of prefix machine P ensures that { σ ij } isprefix free. We now define a monotone Turing machine U . For strings of the form σ ij p for some i, j : U ( σ ij p ) := T i ( p ) (6)For strings not of this form, U produces no output. U inherits monotonicity fromthe T i , and since { T i } i ∈ N enumerates all monotone Turing machines, U is universal.7he Solomonoff prior associated with U is then: λ U ( x ) = | U − ( x ∗ ) | (7)= X i,j − ℓ ( σ ij ) | T − i ( x ∗ ) | (8)= X i ( X j − k ij ) λ T i ( x ) (9)= X i w i ν i ( x ) (10)= ξ ( x ) (11)The main theorem for this section is now trivial: Theorem 14.
The classes U M of Solomonoff priors and U ξ of universal mixturesare exactly equivalent. In other words, the two constructions define exactly the sameset of priors: U M = U ξ .Proof. Follows directly from Lemma 10 and Lemma 13.
In this section, we see that a universal mixture must have a “gap” in the semimeasureinequality greater than c − K ( ℓ ( x )) for some constant c > x , and thatthere are universally dominant enumerable semimeasures that fail this requirement.This shows that not all universally dominant enumerable semimeasures are universalmixtures. Lemma 15.
For every Solomonoff prior M and associated universal monotone Tur-ing machine U , there exists a real constant c > such that M ( x ) − M ( x − M ( x M ( x ) ≥ c − K ( ℓ ( x )) ∀ x ∈ B ∗ where the Kolmogorov complexity K ( n ) of an integer n is the length of the shortestprefix code for n .Proof. First, note that M ( x ) − M ( x − M ( x
1) measures the set of programs U − ( x )for which U outputs x and no more. Consider the set P := { ql ′ p | p ∈ B ∗ , U ( p ) ⊒ x } l ′ is a shortest prefix code for ℓ ( x ) and q is a program such that U ( ql ′ p )executes U ( p ) until ℓ ( x ) bits are output, then stops.Now, for each r = ql ′ p ∈ P we have U ( r ) = x since U ( p ) ⊒ x and q executes U ( p ) until ℓ ( x ) bits are output. Thus P ⊆ U − ( x ) and |P| ≤ | U − ( x ) | (12)Also P = ql ′ U − ( x ∗ ) := { s = ql ′ p | p ∈ U − ( x ∗ ) } , and so |P| = 2 − ℓ ( ql ′ ) | U − ( x ∗ ) | (13)combining (12) and (13) and noting that M ( x ) − M ( x − M ( x
1) = | U − ( x ) | and M ( x ) = | U − ( x ∗ ) | we obtain M ( x ) − M ( x − M ( x
1) = | U − ( x ) |≥ |P| = 2 − ℓ ( ql ′ ) | U − ( x ∗ ) | = 2 − ℓ ( q ) − K ( ℓ ( x )) M ( x )Setting c := 2 − ℓ ( q ) this proves the result. Theorem 16.
Not all universally dominant enumerable semimeasures are universalmixtures: U ξ ⊂ U δ Proof.
Take some universally dominant semimeasure δ , then define δ ′ ( ǫ ) :=1 , δ ′ (0) = δ ′ (1) := , δ ′ ( bx ) := δ ( bx ) for b ∈ B , x ∈ B ∗ \ ǫ . δ ′ is clearly a universallydominant enumerable semimeasure with δ ′ (0) + δ ′ (1) = δ ′ ( ǫ ), and by Lemma 15 itis not a universal mixture. One of Solomonoff’s more famous contributions is the invention of a theoreticallyideal universal induction mechanism. The universal prior used in this mechanismcan be defined/constructed in several ways. We clarify the relationships betweenthree different definitions of universal priors, namely universal mixtures, Solomonoffpriors and universally dominant semimeasures. We show that the class of universalmixtures and the class of Solomonoff priors are exactly the same while the class ofuniversally dominant lower semicomputable semimeasures is a strictly larger set.We have identified some aspects of the discrepancy between Solomonoff pri-ors/universal mixtures and universally dominant lower semicomputable semimea-sures, however a clearer understanding and characterisation would be of interest.Since universal dominance is all that is needed to prove convergence for universalinduction [Hut05, Sol78] it is interesting to ask whether the extra properties ofthe smaller class of Solomonoff priors have any positive consequences for universalinduction. 9 cknowledgements.
We would like to acknowledge the contribution of an anonymous reviewer to a moreelegant presentation of the proof of Lemma 13. This work was supported by ARCgrant DP0988049.
References [Cha75] G. J. Chaitin. A theory of program size formally identical to information theory.
Journal of the ACM , 22(3):329–340, 1975.[DH10] R. Downey and D. R. Hirschfeldt.
Algorithmic Randomness and Complexity .Springer, Berlin, 2010.[FSW06] S. Figueira, F. Stephan, and G. Wu. Randomness and universal machines.
Journal of Complexity , 22(6):738–751, 2006.[Hut05] M. Hutter.
Universal Artificial Intelligence: Sequential Decisions based on Al-gorithmic Probability . Springer, Berlin, 2005.[Hut07] M. Hutter. On universal prediction and Bayesian confirmation.
TheoreticalComputer Science , 384(1):33–48, 2007.[Lev71] Leonid A Levin.
Some Theorems on the Algorithmic Approach to ProbabilityTheory and Information Theory . PhD thesis, Moscow University, Moscow, 1971.[LV08] M. Li and P. M. B. Vit´anyi.
An Introduction to Kolmogorov Complexity and itsApplications . Springer, Berlin, 3rd edition, 2008.[Sch73] C. P. Schnorr. Process complexity and effective random tests.
Journal of Com-puter and System Sciences , 7(4):376–388, 1973.[Sol64] R. J. Solomonoff. A formal theory of inductive inference: Parts 1 and 2.
Infor-mation and Control , 7:1–22 and 224–254, 1964.[Sol78] R. J. Solomonoff. Complexity-based induction systems: Comparisons and con-vergence theorems.
IEEE Transactions on Information Theory , IT-24:422–432,1978.[Tur36] A. M. Turing. On computable numbers, with an application to the Entschei-dungsproblem.
Proc. London Mathematical Society , 2(42):230–265, 1936.[ZL70] A. K. Zvonkin and L. A. Levin. The complexity of finite objects and the devel-opment of the concepts of information and randomness by means of the theoryof algorithms.
Russian Mathematical Surveys , 25(6):83–124, 1970., 25(6):83–124, 1970.