aa r X i v : . [ m a t h . S T ] J un LEARNING THE ERGODIC DECOMPOSITION
NABIL I. AL-NAJJAR AND ERAN SHMAYA
Abstract.
A Bayesian agent learns about the structure of a stationary process from ob-serving past outcomes. We prove that his predictions about the near future become ap-proximately those he would have made if he knew the long run empirical frequencies of theprocess. Introduction
Consider a stationary, finite-valued stochastic process with probability law µ . According tothe ergodic theorem, an observer of this process can reconstruct the ‘true’ ergodic componentof the process from observing a single typical infinite realization. Decision problems, on theother hand, are often concerned with making predictions based on finite past observations.In such problems, the primary object of interest is the predictive distribution about theoutcome of the process at a given day given the finite history of outcomes from previousdays.This paper relates these two perspectives on predictions and decisions. We consider thelong-run properties of an observer’s predictive distribution over next period’s outcome asobservations accumulate. We show that the predictive distribution becomes arbitrarily closeto the predictive distribution conditioned on knowledge of the true ergodic component, inmost periods almost surely. Thus, as data accumulates, an observer’s predictive distributionsbased on finite history become the ‘correct’ predictions, in the sense of becoming nearly asgood as what he would have predicted given knowledge of the objective empirical frequenciesof the process. We demonstrate that the various qualifications we impose cannot be dropped.Our results connect several literatures on learning and predictions in stochastic environ-ments. First, there is the literature on the strong merging of opinions, pioneered by Blackwelland Dubins [2]. More directly relevant to our purpose are the weaker notions of merging in-troduced by Kalai and Lehrer [9] and Lehrer and Smorodinsky [11], which focus on closenessof near-horizon predictive distributions. While strong merging obtains only under stringentassumptions, weak merging can be more easily satisfied. In our setting, for example, the
Date : First draft: February 2013; This version: August 20, 2018.2000
Mathematics Subject Classification.
Primary: 60G10, 91A26. Secondary: 37A25,62M20, 62F15.We thank Ehud Kalai, Ehud Lehrer and Rann Smorodinsky for helpful discussions. Kalai and Lehrer [7] apply this concept to learning in games. osteriors may fail to strongly merge with the true parameter, no matter how much dataaccumulates. This strong notion of merging is unnecessary in contexts where decision makersdiscount the future or care only about a fixed number of future periods. Weak merging, towhich our results apply, is usually sufficient.Another line of enquiry focuses on representations of the form µ = R µ θ d λ ( θ ), where aprobability measure µ (the law of the stochastic process) is expressed as a convex combina-tion of distributions { µ θ } θ ∈ Θ that may be viewed as especially “simple,” or “elementary.”Such representations, also called decompositions , are useful in models of learning where theset of parameters Θ may be viewed as the main object of learning. Two seminal theoremsare de Finetti’s representation of exchangeable distributions and the ergodic decompositiontheorem for stationary processes. Exchangeability rules out many interesting patterns ofinter-temporal correlation, so it is natural to consider the larger class of stationary distri-butions. For this class, the canonical decomposition is in terms of the ergodic distributions.This is the finest decomposition possible using parameters that are themselves stationary.Our main theorem states that a Bayesian decision maker’s predictions, based on finite his-tories, become arbitrarily close to those he would have made given knowledge of the trueergodic component.Our result should also be contrasted with Doob’s consistency theorem which states thatBayesian posteriors weakly converge to the true parameter. When the focus is the qualityof decisions, what matters is not the agents’ belief about the true parameter but the qualityof his predictions. Although the two concepts are related, they are not the same. Thedifference is seen in the following example from Jackson, Kalai and Smorodinsky [6, Example5]: Assume that the outcomes Heads and Tails are generated by tossing a fair coin. If we takethe set of all dirac measures on infinite sequences of Heads-Tails outcomes as “parameters”,then the posterior about the parameter converges weakly to a belief that is concentratedon the true realization. On the other hand the agent’s predictions about next period’soutcome is constant and never approach the predictions given the true “parameter.” Thisexample highlights that convergence of posterior beliefs to the true parameters may havelittle relevance to an agent’s predictions and behavior.Every process can be represented in an infinite number of ways, many of which, likethe decomposition of the coin toss process above, are not very sensible. Jackson, Kalai andSmorodinsky [6] study the question of what makes a particular decomposition of a stochasticprocess sensible. One requirement is for the process to be learnable, in the sense that anagent’s predictions about near-horizon events become close to what he would have predictedhad he known the true parameter. Given the close connection between ergodic distributionsand long-run frequencies, the most natural decomposition µ = R µ θ d λ ( θ ) of a stationaryprocess is where the θ ’s index the ergodic distributions. We show that their results do not pply to the class of stationary processes and their canonical ergodic decompositions. Weshow, however, that the ergodic decomposition is learnable in a weaker, yet meaningful senseas described below.A third related literature, which traces to Cover [3], is non-Bayesian estimation of station-ary processes. See Morvai and Weiss [13] and the reference therein. This literature looksfor an algorithm that make near-horizon predictions that are accurate for every stationaryprocess. Our proofs of Theorem 3.1 and Example 3.3 rely on techniques that were devel-oped in this literature. There is however a major difference between that literature and ourwork: We are interested in a specific algorithm, namely Bayesian updating. Our agent’spredictions and behavior are derived from this updating process. We show how to apply themathematical apparatus developed for the non-Bayesian estimation in our Bayesian setup.2. Formal model
Preliminaries.
An agent (a decision maker, a player, or a statistician) observes astochastic process ( ζ , ζ , ζ , . . . ) that takes values in a finite set of outcomes A . Time isindexed by n and the agent starts observing the process at n = 0. Let Ω = A N be the spaceof realizations of the process, with generic element denoted ω = ( a , a , . . . ). Endow Ω withthe product topology and the induced Borel structure F . Let ∆(Ω) be the set of probabilitydistributions over Ω. The law of the process is an element µ of ∆(Ω). A standard way torepresent uncertainty about the process is in terms of an index set of “parameters:” Definition 2.1.
Let µ ∈ ∆(Ω). A decomposition of µ is a quadruple (Θ , B , λ, ( µ θ )) where:(Θ , B , λ ) is a standard probability space of parameters and µ θ ∈ ∆(Ω) for every θ ∈ Θ suchthat the map θ µ θ ( A ) is B -measurable and(1) µ ( A ) = Z Θ µ θ ( A ) λ (d θ )for every A ∈ F . N A decomposition captures a certain way in which a Bayesian agent arranges his beliefs:The agent views the process as a two stages randomization. First a parameter θ is chosenaccording to λ and then the outcomes are generated according to µ θ . Beliefs can be repre-sented in many ways. The two extreme decompositions are: (1) the Trivial Decomposition. with Θ = { ¯ θ } , B is trivial, and µ ¯ θ = µ ; and (2) the Dirac Decomposition. with Θ = A N , B = F , and λ = µ . A “parameter” in this case is just a Dirac measure δ ω that assignsprobability 1 to the realization ω .We are interested in decompositions that identify “useful” patterns shared by many real-izations. These patterns capture our intuition of fundamental properties of a process. Thetwo extreme cases are usually unsatisfactory. In the Dirac decomposition, there are as many arameters as there are realizations; parameters simply copy realizations. In the trivialdecomposition, there is a single parameter and thus cannot discriminate between differentinteresting patterns.Stationary beliefs admit a well-known decomposition with natural properties. Recall thatthe set of stationary measures over Ω is convex and compact in the weak ∗ -topology. Itsextreme points are called ergodic beliefs . We denote the set of ergodic beliefs by E . Everystationary belief µ admits a unique decomposition in which the parameter set is the set ofergodic beliefs: µ = R ν λ (d ν ) for some belief λ ∈ ∆( E ). We call this decomposition theergodic decomposition .According to the ergodic theorem, for every stationary belief µ and every block (¯ a , . . . , ¯ a k − ) ∈ A k , the limit frequencyΠ( ω ; ¯ a , . . . , ¯ a k − ) = lim n →∞ n (cid:8) ≤ t < n : a t = ¯ a , . . . , a t + k − = ¯ a k − (cid:9) exists for µ -almost every realization ω = ( a , a , . . . ). When µ is ergodic this limit equalsthe probability µ ([¯ a , . . . , ¯ a k − ]). Thus, for ergodic processes, the probability of every blockequals its (objective) empirical frequency.The ergodic decomposition theorem states that for µ -almost every ω , The function Π( ω ; · )defined over blocks can be extended to a stationary measure over ∆(Ω) which is also ergodic.Moreover, µ = R Π( ω ; · ) µ (d ω ), so that the function ω → Π( ω ; · ) recovers the ergodicparameter from the realization of the process. Thus, the parameters µ θ in the ergodicdecomposition represent the empirical distribution of finite sequences of outcome along therealization of the stationary process. These parameters capture our intuition of fundamentalsof the process.A special case of the ergodic decomposition is the decomposition of an exchangeable dis-tribution µ via i.i.d. distributions. For future reference, consider the following example: Example 2.2.
The set of outcomes is A = { , } and the agent’s belief is given by µ ( ζ n = a , . . . , ζ n + k − = a k − ) = 1( k + 1) · (cid:0) kd (cid:1) for every n, k ∈ N and a , . . . , a k − ∈ A where d = a + · · · + a k − . Thus, the agent believesthat if he observes the process k consecutive periods then the number d of good periods(periods with outcome 1) is distributed uniformly in [0 , k ] and all configuration with d goodoutcomes are equally likely. e-Finetti’s decomposition is given by (Θ , B , λ ) where Θ = [0 ,
1] equipped with the stan-dard Borel structure B and Lebesgue’s measure λ , and, for θ ∈ Θ µ θ ∈ ∆(Ω) is the distribu-tion of i.i.d coin tosses with probability of success θ : µ θ ( ζ n = a , . . . , ζ n + k − = a k ) = θ d (1 − θ ) k − d N Learning.
For every µ ∈ ∆(Ω) and sequence ( a , . . . , a n − ) ∈ A n with positive µ -probability, the n -period predictive distribution is the element µ ( ·| a , . . . , a n − ) ∈ ∆( A ) rep-resenting the agent’s prediction about next period’s outcomes given a prior µ and afterobserving the first n outcomes of the process. Predictive distributions in this paper willalways refer to one-step ahead predictions. This is for expository simplicity; our analysiscovers any finite horizon.Kalai and Lehrer [9], and Kalai, Lehrer and Smorodinsky [8] introduced the followingnotions merging. Note that in our setup, where the set of outcomes is the same in everyperiod, this definition of merging is the same as ‘weak star merging’ in D’Aristotile, Diaconisand Freedman [4]. Definition 2.3.
Let µ, ˜ µ ∈ ∆(Ω). Then the belief ˜ µ merges to µ if | ˜ µ ( ·| a , . . . , a n − ) − µ ( ·| a , . . . , a n − ) | −−−→ n →∞ µ -almost every realization ω = ( a , a , . . . ) ∈ A N .The belief ˜ µ weakly merges to µ if(2) | ˜ µ ( ·| a , . . . , a n − ) − µ ( ·| a , . . . , a n − ) | s.c −−−→ n →∞ . for µ -almost every realization ω = ( a , a , . . . ) ∈ A N . N Here and later, for every pair p, q ∈ ∆( A ) we let k p − q k = max a ∈ A | p [ a ] − q [ a ] | . Thesedefinitions were inspired by Blackwell and Dubins idea of strong merging, which requiresthat the prediction of ˜ µ will be similar to the prediction of µ not just for the next periodbut for the infinite horizon. Definition 2.4.
A decomposition (Θ , B , λ, ( µ θ )) of µ ∈ ∆(Ω) is learnable if µ merges with µ θ for λ -almost every θ . The decomposition is weakly learnable if µ weakly merges with µ θ for λ -almost every θ . N A bounded sequence of real numbers a , a , . . . is said to strongly Cesaro converges to a real number a ,denoted a n s.c −−−−→ n →∞ a , if lim n →∞ n P n − k =0 | a k − a | = 0. Equivalently, a n s.c −→ a if there exists a full density set T ⊆ N such that lim n →∞ ,n ∈ T a n = a . s an example of a learnable decomposition, consider the Bayesian agent of Example 2.2.In this case µ (1 | a , . . . , a n − ) = a + · · · + a n − + 1 n + 1 . For every θ ∈ [0 ,
1] the strong law of large numbers implies that for every parameter θ ∈ [0 , µ θ -almost surely to θ . Therefore µ merges with µ θ for every θ ,so De Finetti’s decomposition is learnable (and, a fortiori, weakly learnable). This is arare case in which the predictions µ ( ζ n ∈ ·| a , . . . , a n − ) and µ θ ( ζ n ∈ ·| a , . . . , a n − ) canbe calculated explicitly. In general merging and weak merging are difficult to establish,because the Bayesian prediction about the next period is a complicated expression which ispotentially depends on entire observed past.2.3. Motivation for Weak Merging.
In applications, µ represent the true process gener-ating observations, and ˜ µ is a Bayesian agent’s belief. To say that ˜ µ weakly merges with µ means that his next period predictions are accurate except for rare times.To connect this concept with statistical decision problems, suppose that in every period,before the outcome is realized, the agent has to take some decision from a finite set D . Theagent’s payoff is represented by the payoff function r : A × D → [0 , strategy is givenby f : S n ≥ A n D , with f ( a , . . . , a n − ) denoting the action taken given the past realizedoutcomes. Let V N ( f ) = 1 N Z N − X n =0 r ( a n , f ( a , . . . , a n − )) d µ be the expected average expected payoff in the first N periods. Fix ǫ >
0. A strategy f ∗ is ǫ -optimal for N periods under µ if V N ( f ) ≤ V N ( f ∗ ) + ǫ for every strategy f . Of coursethe optimal strategy depends on the agent’s belief µ . The following proposition, which isimmediate from the definition of weak learning, says that an agent who maximizes accordingto a belief that weakly merges with the truth will play ǫ -optimal strategies against the truthif he is sufficiently patient. By ‘sufficiently patient’ we mean that the horizon N is large.Similar result applies if the agent aggregates periods’ payoffs using some discount factorwhere by ‘sufficiently patient’ is meant that the discount factor is close to 1. Proposition 2.5.
Let ˜ µ, µ ∈ ∆(Ω) be such that ˜ µ weakly merges with µ . For every ǫ > there exists N such that for every N > N , in every decision problem, every -optimalstrategy for N periods under ˜ µ is ǫ -optimal for N periods under µ . Kalai, Leher and Smorodinsky [8] provide a motivation for weak learning in terms ofthe properties of calibration tests. The idea of calibration originated with Dawid [5]. Acalibration test of a forecast compares the predicted frequency of events to their realizedempirical frequencies. Kalai et al. showed that ˜ µ weakly merges with µ if and only if orecasts made by ˜ µ pass all calibration tests of a certain type when the outcomes aregenerated according to µ .Finally, Lehrer and Smorodinsky [12] provide a characterization of weak merging in termsof the relative entropy between ˜ µ and µ . No similar characterization is known for merging.2.4.
Merging and the Consistency of Bayesian Estimators.
The idea of learningcaptured by Definition 2.4 concerns the quality of predictions made about near-horizonevents. Another, perhaps more common, way to think about Bayesian inference is in termsof the consistency of Bayesian estimator. Consistency can be thought of as concerninglearning the parameter itself. Recall that the Bayesian estimator of the parameter θ isthe agent’s conditional belief over θ after observing the outcomes of the process. It is wellknown that under any ‘reasonable’ decomposition, the Bayesian estimator is consistent,i.e., the estimator weakly converges to the Dirac measure over the true parameter as dataaccumulates . However, consistency of the estimator does not imply that the agent can usewhat he has learned to make predictions about future outcomes. For example, consider theDirac decomposition of the process of fair coin tosses. Suppose the true parameter is ω ∗ forsome ω ∗ = ( ω ∗ , ω ∗ , . . . ). After observing the first n outcomes of the process the agent’s beliefabout the parameter is uniform over all ω that agrees with ω ∗ on the first n coordinates.While this belief indeed converges to δ ω ∗ , the agent does not gain any new insight aboutthe future of the process from learning the parameter. This decomposition is therefore notlearnable. 3. Main Theorem
We are now in a position to state our main theorem.
Theorem 3.1.
The ergodic decomposition of every stationary stochastic process is weaklylearnable.
To see the implications of our theorem, consider the following Hidden Markov process
Example 3.2.
An agent believes that the state of the economy every period is a noisysignal of an underlying “hidden” states that changes according to a Markov chain withmemory 1. Formally, let A = { B , G } be the set of outcomes, H = { B , G } the set of hidden(unobserved) states, and ( ξ n , ζ n ) a ( H × A )-valued stationary Markov process with transitionmatrix ρ : H × A → ∆( H × A ) given by ρ ( h, a )[ h ′ , a ′ ] = (cid:0) pδ h,h ′ + (1 − p )(1 − δ h,h ′ ) (cid:1) · (cid:0) qδ h ′ ,a ′ + (1 − q )(1 − δ h ′ ,a ′ ) (cid:1) , However, we do not know whether their condition can be used to prove our theorem without repeating thewhole argument. The argument traces back to Doob. See, for example, Weizsacker [16] and the references therein. It holdswhenever the decomposition has the property that the realization of the process determines the parameter here 1 / < p, q <
1. Thus, if the hidden state in period n was h then at period n + 1the hidden state h ′ remains h with probability p and changes with probability 1 − p . Theobserved state a ′ of period n + 1 will then be h ′ with probability q and is different from h with probability 1 − q . Let µ p,q ∈ ∆( A N ) be the distribution of ζ , ζ , . . . . Then µ p,q is astationary process which is not markov of any order. If the agent is uncertain about p, q then his belief µ about the outcome process is again stationary, and can be represented bysome prior over the parameter set Θ = (1 / , × (1 / , µ will bethe ergodic decomposition. N The consistency of the Bayesian estimator for ( p, q ) implies that the conditional belief overthe parameter ( p, q ) converges almost surely in the weak-topology over ∆(Θ) to the beliefconcentrated on the true parameter. However, because next-period’s predictions involvecomplicated expressions that depend on the entire history of the process, it is not clearwhether these predictions merge with the truth. It follows from our theorem that theyweakly merge.Consider now the general case. If the agent knew the fundamental θ , then at period n ,after observing the partial history ( a , . . . , a n − ), his predictive probability that the nextperiod outcome is a n would have been(3) µ θ ( a , . . . , a n − , a n ) µ θ ( a , . . . , a n − ) . Again consistency of the Bayesian estimator implies that, given uncertainty about the fun-damental, the agent’s assessment of µ θ ( b ) becomes asymptotically accurate for every block b . However, when the agent has to compute the next-period posterior probability (3), heonly had one observation of a block of size n and no observation of the block of size n + 1so at that stage his assessment of the probabilities that appear in (3) may be completelywrong. Our theorem says that the agent would still weakly learn to make these predictionscorrectly.Theorem 3.1 states that the agent will make predictions about near-horizon events as ifhe knew the fundamental of the process. Note, however, that it is not possible to ensurethat the agent will learn to predict long-run events correctly, no matter how much dataaccumulates. For example, consider an agent who faces a sequence of i.i.d. coin tosses withparameter θ ∈ [0 ,
1] representing the probability of Heads. Suppose this agent has a uniformprior over [0,1]. This agent will eventually learn to predict near horizon outcomes as if heknew the true parameter θ , but if he will continue to assign probability 0 to the event thatthe long-run frequency is θ . In economic models, discounting implies that only near-horizonevents matter. e end this section with an example that in Theorem 3.1 weak learnability cannot bereplaced by learnability. The example is a modification of an example given by Ryabko forthe forward prediction problem in a non-Bayesian setup [14]. Example 3.3.
Every period there is a probability 1 / A = { W , B , G } be the set of outcomes. We define µ ∈ ∆( A N ) through itsergodic decompositions. Let Θ = { B , G } { , ,... } be the set of parameters with the standardBorel structure B and the uniform distribution λ . Thus, a parameter is a function θ : { , , . . . } → { B , G } . We can think about this belief as a hidden markov model where theunobservable process ξ , ξ , . . . is the time that elapsed since last time a war occurred. Thus, ξ , ξ , . . . is the N -valued stationary Markov process with transition probability ρ ( j | k ) = / , if j = k + 1 , / , if j = 0 , , otherwise . for every j, k ∈ N , and µ θ is the distribution of a sequence ζ , ζ , . . . of A -valued randomvariables such that ζ n = W , if ξ n = 0 θ ( ξ n ) , otherwise . Consider a Bayesian agent who observes the process. After the first time a war erupts theagent keeps track of the state of the process ξ n at every period. If there is no uncertaintyabout the parameter, i.e., if the Bayesian agent knew θ , his prediction about the next outcomewhen ζ n = k gives probability 1 / / θ ( k + 1).On the other hand, if the agent does not know θ but believes that it is randomized accordingto λ , he can deduce the values θ ( k ) gradually while he observes the process. However forevery k ∈ { , , , . . . } there will be a time when the agent will observe k consecutive peacefulperiod for the first time and at this point the agent’s prediction about the next outcomewill be (1 / , / , / µ will differ than an agent who predicts according to µ θ .Therefore the decomposition is not learnable. On the other hand, in agreement with ourtheorem, these occasions become more infrequent as time goes by so the decomposition isweakly learnable. N . Proof of Theorem 3.1
Up to now we assumed that the stochastic process starts at time n = 0. When workingstationary processes it is natural to extend the index set of the process from N to Z , i.e. toassume that the process has infinite past. This is without loss of generality: every stationarystochastic process ζ , ζ , . . . admits an extension . . . , ζ − , ζ , ζ , . . . to the index set Z [10,Lemma 10.2]. We therefore assume hereafter, with harmless contrast with our previousnotation, that Ω = A Z .Let D be a σ -algebra Borel subsets of Ω. The quotient space of (Ω , F , µ ) with respect to D is the unique (up to isomorphism of measure spaces) standard probability space (Θ , B , λ )and a measurable map α : Ω → Θ such that D is generated by α , i.e., for every F -measurablefunction f from Ω to some standard probability space there exists a (unique up to equality λ -almost surely) B -measurable lifting ˜ f defined over Θ such that f = ˜ f ◦ α µ − a.s.. The conditional distributions of µ over D is the unique (up to equality λ -almost surely) family µ θ of probability measures over (Ω , F , µ ) such that:(1) For every θ ∈ Θ it holds that(4) µ θ ( { ω | α ( ω ) = θ } ) = 1 . (2) The map θ µ θ ( A ) is B -measurable and (1) is satisfied for every A ∈ F .We call (Θ , B , λ, µ θ ) the decomposition of µ induced by D . For every belief µ ∈ ∆(Ω),the trivial decomposition of µ is generated by the trivial sigma-algebra {∅ , Ω } , the Diracdecomposition is generated by the sigma-algebra of all Borel subsets of Ω. The ergodicdecomposition is induced by the σ -algebra I of all invariant Borel sets of Ω, i.e. all Borelsets S ⊆ Ω such that S = T − ( S ) where T : Ω → Ω is the left shift.We will prove a more general theorem, which may be interesting in its own right. Let T : A Z → A Z be the left shift so that T ( ω ) n = ω n +1 for every n ∈ Z . A sigma-algebra D ofBorel subsets of Ω is shift-invariant if S ∈ D ↔ T ( S ) ∈ D for every Borel subset S of A Z . Theorem 4.1.
Let µ be a stationary distribution over Ω and let D be a shift invariant σ -algebra of subsets of Ω such that D ⊆ F −∞ . Then the decomposition of µ induced by D isweakly learnable. Theorem 3.1 follows immediately from Theorem 4.1 since the sigma-algebra of invariantsets I which induces the ergodic decomposition satisfies the assumption of the Theorem 4.1.We will prove Theorem 4.1 using Lemma 4.2 Lemma 4.2.
Let µ be a stationary distribution over A Z and let D be a shift invariant σ -algebra of Borel subsets of A Z . Then (5) k µ ( ζ n = ·|F n ∨ D ) − µ ( ζ n = ·|F n −∞ ∨ D ) k s.c −−−→ n →∞ µ -a.s . onsider the case in which D = {∅ , Ω } is trivial. Then Lemma 4.2 says that a Bayesianagent who observes a stationary process from time n = 0 onwards will make predictions inthe long run as if he knew the infinite history of the process. Proof of Lemma 4.2.
For every n ≥ f n : Ω → ∆( A ) be a version of the conditionaldistribution of ζ according to µ given the finite history ζ − , . . . , ζ − n and D : f n = µ ( ζ = ·|F − n ∨ D ) , and let f ∞ : Ω → ∆( A ) be a version of the conditional distribution of ζ according to µ giventhe infinite history ζ − , . . . and D : f ∞ = µ ( ζ = ·|F −∞ ∨ D ) . Let g n = k f n − f ∞ k . By the martingale convergence theorem lim n →∞ f n = f ∞ µ -a.s andtherefore(6) lim n →∞ g n = 0 µ -a.sIt follows from the stationarity of µ and the fact that D is shift invariant that(7) k µ ( ζ n = ·|F n ∨ D ) − µ (cid:0) ζ n = ·|F n −∞ ∨ D (cid:1) k = k f n ◦ T n − f ∞ ◦ T n k = g n ◦ T n µ -a.sTherefore1 N N − X n =0 k µ ( ζ n = ·|F n ∨ D ) − µ (cid:0) ζ n = ·|F n −∞ ∨ D (cid:1) k = 1 N N − X n =0 g n ◦ T n −−−→ N →∞ µ -a.swhere the equality follows from (7) and the limit follows from (6) and Maker’s generalizationof the ergodic theorem [10, Corollary 10.8] to cover multiple functions simultaneously: Maker’s Ergodic Theorem.
Let µ ∈ ∆(Ω) be such that T µ = µ and let g , g , · · · : Ω → R be such that sup n | g n | ∈ L ( µ ) and g n → g ∞ µ − a.s . Then N N − X n =0 g n · T n −−−→ N →∞ E ( g ∞ |I ) µ − a.s. (cid:3) Proof of Theorem 4.1.
From
D ⊆ F −∞ it follows that F n −∞ ∨ D = F n −∞ . Therefore, fromLemma 4.2 we get that k µ ( ζ n = ·|F n ∨ D ) − µ ( ζ n = ·|F n −∞ ) k s.c −−−→ n →∞ µ -a.s . By the same lemma (with D trivial) k µ ( ζ n = ·|F n ) − µ ( ζ n = ·|F n −∞ ) k s.c −−−→ n →∞ µ -a.s . y the last two limits and the triangular inequality(8) k µ ( ζ n = ·|F n ) − µ ( ζ n = ·|F n ∨ D ) k ≤k µ ( ζ n = ·|F n ) − µ ( ζ n = ·|F n −∞ ) k + k µ ( ζ n = ·|F n ∨ D ) − µ ( ζ n = ·|F n −∞ ) k s.c −−−→ n →∞ µ -a.sLet (Θ , B , λ ) be the quotient of (Ω , F , µ ) over D and let ( µ θ ) be the corresponding conditionaldistributions. Let S be the set of all realizations ω = ( . . . , a − , a , a , . . . ) such that k µ ( ζ n = ·| a n − , . . . , a ) − µ ω ( ζ n = ·| a n − , . . . , a ) k s.c −−−→ n →∞ . Then µ ( S ) = 1 by (8). But µ ( S ) = R µ θ ( S ) λ (d θ ). It follows that µ θ ( S ) = 1 for λ -almostevery θ , a desired. (cid:3) Ergodicity and mixing
Mixing conditions formalize the intuition that observing a sequence of outcomes of aprocess does not change one’s belief about events in the far future. Standard examplesof mixing processes are i.i.d. processes and non-periodic markov processes. In this sectionwe recall a mixing condition that was called “sufficiency for prediction” in JKS, show thatthe ergodic decomposition is not necessarily sufficient for prediction and show that a finerdecomposition than the ergodic decomposition is sufficient for prediction and also weaklylearnable.Let −→T = V m ≥ F ∞ m be the future tail sigma-algebra where F ∞ m the σ -algebra of Ω thatis generated by ( ζ m , ζ m +1 , . . . ). A probability distribution (not necessarily stationary) ν ∈ ∆(Ω) is mixing if it is −→T -trivial, i.e., if ν ( B ) ∈ { , } for every B ∈ −→T . If we want the components of the decomposition to be mixing we need a finer decom-position than the ergodic decomposition. This decomposition is the decomposition that isinduced by the tail −→T as shown in the following proposition.
Proposition 5.1.
Let (Θ , B , λ, ( µ θ )) be the decomposition of a belief µ ∈ ∆(Ω) that is inducedby the tail −→T . Then µ θ is mixing for λ -almost every θ .Proof. This proposition is JKS’ Theorem 1. We repeat the argument here to clarify a gapin their proof.The proposition follows from the fact that the conditional distributions of every probabilitydistribution µ ∈ ∆(Ω) over the tail are almost surely tail-trivial (i.e., mixing). This fact was An equivalent way to write this condition is that for every n and ǫ , there is m such that | ν ( B | a , . . . , a n − ) − ν ( B ) | < ǫ for every B ∈ F ∞ m and partial history ( a , . . . , a n − ) ∈ A n . JKS call such belief sufficient for prediction .They establish the equivalence with the mixing condition in their proof of their Theorem 1 ecently proved by Berti and Rigo [1, Theorem 15] . We note that it is not true for everysigma-algebra D that the conditional distributions of µ over D are almost surely D -trivial.This property is very intuitive (and indeed, easy to prove) when D is generated by a finitepartition, or more generally when D is countably generated, but the tail is not countablygenerated, which is why Berti and Rigo’s result is required. (cid:3) The next theorem uses Lemma 4.2 to show that the tail decomposition is also weakly learn-able. In particular, Theorem 5.2 implies that the ergodic decomposition does not captureall the learnable properties of a stationary process.
Theorem 5.2.
The tail decomposition of a stationary stochastic process is weakly learnable.Proof.
From Lemma 4.2 it follows that the decomposition induced by the past tail ←−T islearnable, since the past tail is shift invariant.The theorem now follows from the fact that for every stationary belief µ over a finite setof outcomes it holds that ←−T µ = −→T µ where ←−T µ and −→T µ are the completions of the past andfuture tails under µ . See Weiss [15, Section 7]. Therefore, the decomposition of µ inducedby −→T equals the decomposition induced by ←−T , which is learnable. We note that the equalityof the past and future tails of a stationary process is not trivial, it relies on finiteness of theset of outcomes A , and the proof relies on the notion of entropy. (cid:3) We conclude with further comments on the relationship with [6]. Their main result char-acterizes the class of distributions that admit a decomposition which is both learnable andsufficient for prediction. They dub these processes “asymptotically reverse mixing.” In par-ticular, they prove that, for every such process µ , the decomposition of µ induced by thefuture tail is learnable and sufficient to prediction. In our Example 3.3, the tail decom-position equals the ergodic decomposition, and, as we have shown, is not learnable. Thisshows that stationary processes needs not be asymptotic reverse mixing. On the other hand,the class of asymptotically reverse mixing processes contains non-stationary processes. Forexample, the Dirac atomic measure δ ω is asymptotically reverse mixing for every realization ω ∈ ∆(Ω). 6. Extensions
In this section we discuss to what extent the theorems and tools of this paper extend to alarger class of process. In the process, this sheds further light on the assumptions made inour work. It is taken for granted in the first sentence of JKS’s proof of Their Theorem 1 .1. Infinite set of outcomes.
The definitions of merging and weak merging can be ex-tended to the case in which the outcome set A is a compact metric space : Let φ be theProhorov Metric over ∆( A ). Say that the belief ˜ µ ∈ ∆( A N ) merges to µ ∈ ∆( A N ) if φ ( µ ( ·| a , . . . , a n − ) , µ ( ·| a , . . . , a n − )) −−−→ n →∞ µ -almost every realization ω = ( a , a , . . . ) ∈ A N and that ˜ µ weakly merges to µ if thelimit holds in strong Cesaro sense. Theorem 3.1 extends to the case of an infinite set A of outcomes. However, Theorem 5.2 does not hold in this case. We used the finiteness inthe proof when we appealed to the equality of the past and future tails of the process. Thefollowing example shows the problem where A is infinite: Example 6.1.
Let A = { , } N equipped with the standard Borel structure. Thus anelement a ∈ A is given by a = ( a [0] , a [1] , . . . ) where a [ k ] ∈ { , } for every k ∈ N . Let µ be the belief over A Z such that { ζ n [0] } n ∈ Z are i.i.d. fair coin tosses and ζ n [ k ] = ζ n − k [0] forevery k ≥
1. Note that in this case −→T = B (so the future tail contains the entire history ofthe process) while ←−T = R (the past tail is empty). The tail decomposition in this case willbe the Dirac decomposition. However, this decomposition is not learnable: an agent whopredict according to µ will at every period n will be completely in the dark about ζ n +1 [0]. N Relaxing stationarity.
As we have argued earlier, stationary beliefs are useful tomodel situations where there is nothing remarkable about the point in time in which theagent started to keep track of the processes (so other agents who start observing the processat different times have the same beliefs) and that the agent is a passive observer who hasno impact on the process itself. The first assumption is rather strong, and can be somewhatrelaxed. In particular, consider a belief that is the posterior of some stationary prior condi-tioned on the occurrence of some event. (A similar situation is an agent who observes a finitestate markov process that starts at a given state rather than the stationary distribution.) Letus say that a belief ν ∈ A N is conditionally stationary if there exists some stationary belief µ such that ν = µ ( ·| B ) for some Borel subset B of A N with µ ( B ) >
0. While such processesare not stationary, they still admits an ergodic decomposition. they exhibit the same tailbehavior of stationary processes. In particular, our theorems extend to such processes. Theobvious details are omitted.
References [1] P. Berti and P. Rigo. 0-1 laws for regular conditional distributions.
The Annals of Probability , 35:649–662, 2007. Also for the case that A is a separable metric space, but then there are several possible non-equivalentdefinitions [4]
2] David Blackwell and Lester Dubins. Merging of opinions with increasing information.
Ann. Math.Statist. , 33:882–886, 1962.[3] Thomas M Cover. Open problems in information theory. In , pages 35–36, 1975.[4] Anthony D’Aristotile, Persi Diaconis, and David Freedman. On merging of probabilities.
Sankhy¯a: TheIndian Journal of Statistics, Series A , pages 363–380, 1988.[5] AP Dawid. The Well-Calibrated Bayesian.
Journal of the American Statistical Association , 77(379):605–610, 1982.[6] Matthew O. Jackson, Ehud Kalai, and Rann Smorodinsky. Bayesian representation of stochastic pro-cesses under learning: de Finetti revisited.
Econometrica , 67:875–893, 1999.[7] E. Kalai and E. Lehrer. Rational learning leads to Nash equilibrium.
Econometrica , 61:1019–1045, 1993.[8] E. Kalai, E. Lehrer, and R. Smorodinsky. Calibrated Forecasting and Merging.
Games and EconomicBehavior , 29(1):151–159, 1999.[9] Ehud Kalai and Ehud Lehrer. Weak and Strong Merging of Opinions.
Journal of Mathematical Eco-nomics , 23:73–86, 1994.[10] O. Kallenberg.
Foundations of Modern Probability . Springer-Verlag, New York, second edition, 2002.[11] E. Lehrer and R. Smorodinsky. Compatible measures and merging.
Mathematics of Operations Research ,pages 697–706, 1996.[12] E. Lehrer and R. Smorodinsky. Relative entropy in sequential decision problems.
Journal of Mathemat-ical Economics , 33:425–439, 2000.[13] Guszt´av Morvai and Benjamin Weiss. Forward estimation for ergodic time series. In
Annales de l’InstitutHenri Poincare (B) Probability and Statistics , volume 41, pages 859–870. Elsevier, 2005.[14] Boris Yakovlevich Ryabko. Prediction of random sequences and universal coding.
Problemy PeredachiInformatsii , 24(2):3–14, 1988.[15] Benjamin Weiss.
Single Orbit Dynamics . AMS Bookstore, 2000.[16] H.V. Weizs¨acker. Some reflections on and experiences with splifs.
Lecture Notes-Monograph Series , pages391–399, 1996.
Kellogg School of Management, Northwestern University
E-mail address : [email protected] School of Mathematics, Tel Aviv University and Kellogg School of Management, North-western University
E-mail address : [email protected]@post.tau.ac.il