[PDF] PAC-Bayes Bounds on Variational Tempered Posteriors for Markov Models

Abstract

Datasets displaying temporal dependencies abound in science and engineering applications, with Markov models representing a simplified and popular view of the temporal dependence structure. In this paper, we consider Bayesian settings that place prior distributions over the parameters of the transition kernel of a Markov model, and seeks to characterize the resulting, typically intractable, posterior distributions. We present a PAC-Bayesian analysis of variational Bayes (VB) approximations to tempered Bayesian posterior distributions, bounding the model risk of the VB approximations. Tempered posteriors are known to be robust to model misspecification, and their variational approximations do not suffer the usual problems of over confident approximations. Our results tie the risk bounds to the mixing and ergodic properties of the Markov data generating model. We illustrate the PAC-Bayes bounds through a number of example Markov models, and also consider the situation where the Markov model is misspecified.

Full PDF

aa r X i v : . [ m a t h . S T ] J a n PAC-BAYES BOUNDS ON VARIATIONAL TEMPERED POSTERIORS FORMARKOV MODELS

IMON BANERJEE, VINAYAK A. RAO

Department of Statistics, Purdue University, West Lafayette IN 47906

HARSHA HONNAPPA

School of Industrial Engineering, Purdue University, West Lafayette IN 47906

Abstract.

Datasets displaying temporal dependencies abound in science and engineering applica-tions, with Markov models representing a simpliﬁed and popular view of the temporal dependencestructure. In this paper, we consider Bayesian settings that place prior distributions over the param-eters of the transition kernel of a Markov model, and seeks to characterize the resulting, typicallyintractable, posterior distributions. We present a PAC-Bayesian analysis of variational Bayes (VB)approximations to tempered Bayesian posterior distributions, bounding the model risk of the VBapproximations. Tempered posteriors are known to be robust to model misspeciﬁcation, and theirvariational approximations do not suﬀer the usual problems of over conﬁdent approximations. Ourresults tie the risk bounds to the mixing and ergodic properties of the Markov data generatingmodel. We illustrate the PAC-Bayes bounds through a number of example Markov models, andalso consider the situation where the Markov model is misspeciﬁed. Introduction

This paper presents probably approximately correct (PAC)-Bayesian bounds on variationalBayesian (VB) approximations of fractional or tempered posterior distributions for Markov datageneration models. Exact computation of either standard or tempered posterior distributions isa hard problem that has, broadly speaking, spawned two classes of computational methods. Theﬁrst, Markov chain Monte Carlo (MCMC), constructs ergodic Markov chains to approximatelysample from the posterior distribution. MCMC is known to suﬀer from high variance and complexdiagnostics, leading to the development of variational Bayesian (VB) [25] methods as an alternativein recent years. VB methods pose posterior computation as a variational optimization problem,approximating the posterior distribution of interest by the ‘closest’ element of an appropriatelydeﬁned class of ‘simple’ probability measures. Typically, the measure of closeness used by VBmethods is the Kullback-Leibler (KL) divergence. Excellent introductions to this so-called

KL-VB method can be found in [5, 21, 6]. More recently, there has also been interest in alternative diver-gence measures, particularly the α -R´enyi divergence [16, 19, 8], though in this paper, we focus onthe KL-VB setting.Theoretical properties of VB approximations, and in particular asymptotic frequentist consis-tency, have been studied extensively under the assumption of an independent and identically dis-tributed (i.i.d.) data generation model [6, 26, 28]. On the other hand, the common setting wheredata sets display temporal dependencies presents unique challenges. In this paper, we focus on ho-mogeneous Markov chains with parameterized transition kernels, representing a parsimonious class E-mail addresses : { ibanerj,varao,honnappa } @purdue.edu . of data generation models with a wide range of applications. We work in the Bayesian framework,focusing on the posterior distribution over the unknown parameters of the transition kernel. Ourtheory develops PAC bounds that link the ergodic and mixing properties of the data generatingMarkov chain to the Bayes risk associated with approximate posterior distributions.Frequentist consistency of Bayesian methods, in the sense of concentration of the posterior dis-tribution around neighborhoods of the ‘true’ data generating distribution, have been establishedin signiﬁcant generality, in both the i.i.d. [10, 24, 22] and in the non-i.i.d. data generation set-ting [11, 3]. More recent work [1, 27, 3] has studied fractional or tempered posteriors, a classof generalized Bayesian posteriors obtained by combining the likelihood function raised to a frac-tional power with an appropriate prior distribution using Bayes theorem. Tempered posteriorsare known to be robust against model misspeciﬁcation: in the Markov setting we consider, theassociated stationary distribution as well as mixing properties are sensitive to model parameteriza-tion. Further, tempered posteriors are known to be much simpler to analyze theoretically [27, 3].Therefore, following [1, 27, 3] we focus on tempered posterior distributions on the transition kernelparameters, and study the rate of concentration of variational approximations to the temperedposterior. Equivalently, as shown in [27] and discussed in section 1.1, our results also apply toso-called α -variational approximations to standard posterior distributions over kernel parameters.The latter are modiﬁcations of the standard KL-VB algorithm to address the well-known problemof overconﬁdent posterior approximations.While there have been a number of recent papers studying the consistency of approximate vari-ational posteriors [26, 16, 1] in the large sample limit, rates of convergence have received lessattention. Exceptions include [1, 28, 15], where an i.i.d. data generation model is assumed. [1]establishes PAC-Bayes bounds on the convergence of a variational tempered posterior with frac-tional powers in the range [0 , α -mixingMarkov chain. Stationarity implies the existence of an invariant distribution corresponding to theparameterized transition kernel, implying the marginal distribution of the Markov data is invariantas well. The α -mixing condition, on the other hand, ensures that the variance of the likelihood ratioof the Markov data does not grow faster than linear in the sample size. Our main results in thissetting are applicable when the state space of the Markov chain is either continuous or discrete. Theprimary requirement on the class of data generating Markov models is for the log-likelihood ratio ofthe parameterized transition kernel and invariant distribution to satisfy a Lipschitz property. Thiscondition implies a decoupling between the model parameters and the random samples, aﬀordinga straightforward veriﬁcation of the mean and variance bounds. We highlight this main result bydemonstrating that it is satisﬁed by a ﬁnite state Markov chain, a birth-death Markov chain on thepositive integers, and a one-dimensional Gaussian linear model. AC-BAYES FOR MARKOV MODELS 3

In practice, the assumption that the data generating model is stationary is unlikely to be satisﬁed.Typically, the initial distribution is arbitrary, with the state distribution of the Markov sequenceconverging weakly to the stationary distribution. In this setting, we must further assume thatthe class of data generating Markov chains are geometrically ergodic (with an exception for ﬁnitestate Markov chains). We show that this implies the boundedness of the mean and variance ofthe log-likelihood ratio of the data generating Markov chain. Alternatively, in Theorem 4.2 wedirectly impose a drift condition on random variables that bound the log-likelihood ratio. Again,in this more general nonstationary setting, we illustrate the main results by showing that the PAC-Bayes bound is satisﬁed by a ﬁnite state Markov chain, a birth-death Markov chain on the positiveintegers, and a one-dimensional Gaussian linear model.In preparation for our main technical results starting in Section 2 we ﬁrst note relevant notationsand deﬁnitions in the next section.1.1.

Notations and Deﬁnitions.

We broadly adopt the notation in [1]. Let the sequenceof random variables X n = ( X , . . . , X n ) ⊂ R m × ( n +1) represent a data set of n + 1 observationsdrawn from a joint distribution P ( n ) θ , where θ ∈ Θ ⊆ R d is the ‘true’ parameter underlying thedata generation process. We assume the state space S ⊆ R m of the random variables X i is eitherdiscrete-valued or continuous, and write { x , . . . , x n } for a realization of the dataset. We also adoptthe convention that 0 log(0 /

0) = 0.For each θ ∈ Θ, we will write p ( n ) θ as the probability density of P ( n ) θ with respect to somemeasure Q ( n ) , i.e. p ( n ) θ := dP ( n ) θ dQ ( n ) , where Q ( n ) is either Lebesgue measure or the counting measure.All expectations and variances, which we represent as E[ X ] and Var[ X ], are taken with respect tothe true distribution P θ unless stated otherwise.Let π ( θ ) be a prior distribution with support Θ. The fractional posterior is deﬁned as π n,α | X n ( dθ ) := e − αr n ( θ,θ )( X n ) π ( dθ ) R e − αr n ( θ,θ )( X n ) π ( dθ ) , (1)where, for θ, θ ∈ Θ, r n ( θ, θ )( · ) := log (cid:18) p ( n ) θ ( · ) p ( n ) θ ( · ) (cid:19) , is the log-likelihood ratio of the correspondingdensity functions. Setting α = 1 recovers the standard Bayesian posterior.The Kullback-Leibler (KL) divergence between distributions

P, Q is deﬁned as K ( P, Q ) := Z X log (cid:18) p ( x ) q ( x ) (cid:19) p ( x ) dx, where X is an arbitrary sample space, and p, q are the densities corresponding to P, Q (respectively).In particular, the KL divergence between the distributions parameterized by θ and θ is K ( P ( n ) θ , P ( n ) θ ) := Z log p ( n ) θ ( x , . . . , x n ) p ( n ) θ ( x , . . . , x n ) ! p ( n ) θ ( x , . . . , x n ) dx · · · dx n = Z r n ( θ, θ )( x , . . . , x n ) p nθ ( x , . . . , x n ) dx · · · dx n . (2)The α -R´enyi divergence D α ( P ( n ) θ , P ( n ) θ ) is deﬁned as D α ( P ( n ) θ , P ( n ) θ ) := 1 α − Z exp ( − αr n ( θ, θ )( x , . . . , x n )) p ( n ) θ ( x , . . . , x n ) dx · · · dx n . (3)By letting α →

0, the α -R´enyi divergence recovers the KL divergence.Let F be some class of distributions with support in R d and such that any distribution P in F is absolutely continuous with respect to the tempered posterior: P ≪ π n,α | X n . Let ˜ π n,α | X n be the PAC-BAYES FOR MARKOV MODELS variational approximation to the tempered posterior, deﬁned as˜ π n,α | X n := arg min ρ ∈F K ( ρ, π n,α | X n )(4)Many choices of F exist; for instance, F can be the set of Gaussian measures, denoted F Φ id : F Φ id = { Φ( dθ ; µ, Σ) : µ ∈ R d , Σ d × d ∈ P.D. } , (5)where P.D. references the class of positive deﬁnite matrices. Alternately, F can be the family ofmean-ﬁeld or factored distributions where the components θ i of θ are independent of each other.It is easy to see that eq. (4) is equivalent to the following optimization problem:˜ π n,α | X n := arg max ρ ∈F Z r n ( θ, θ )( x , . . . , x n ) ρ ( dθ ) − α − K ( ρ, π ) . (6)Setting α = 1 again recovers the usual variational solution that seeks to approximate the posteriordistribution with the closest element of F (the right-hand side above is called the evidence lowerbound (ELBO)). Other settings of α constitute α -variational inference [27], which seeks to regularizethe ‘overconﬁdent’ approximate posteriors that standard variational methods tend to produce.1.1.1. Markov chains.

We assume the joint density or probability mass function p ( n ) θ ( x , . . . , x n ) corresponds to the‘walk probability’ of a time-homogeneous Markov chain. We call the Markov chain ‘parameterized’if the transition kernel p θ ( ·|· ) is parametrized by some θ ∈ Θ ⊆ R d . Let q (0) ( · ) be the initial density(deﬁned with respect to the Lebesgue measure over R m ) or initial probability mass function. Then,the joint density or probability mass function is p ( n ) θ ( x , . . . , x n ) = q (0) ( x ) Q n − i =0 p θ ( x i +1 | x i ).Our results in the ensuing sections will be established under strong mixing conditions [7] on theMarkov chain. Speciﬁcally, recall the deﬁnition of the α -mixing coeﬃcients of a stationary Markovchain: Deﬁnition 1.1 ( α -mixing coeﬃcient) . Let M ji denote the σ -ﬁeld generated by the Markov chain { X k : i ≤ k ≤ j } parameterized by θ ∈ Θ . Then, the α -mixing coeﬃcient is deﬁned as α k = sup t> sup ( A,B ) ∈M t −∞ ×M ∞ t + k | P θ ( A ∩ B ) − P θ ( A ) P θ ( B ) | . (7)Informally speaking, the α -mixing coeﬃcients { α k } measure the dependence between any twoevents A (in the ‘history’ σ -algebra) and B (in the ‘future’ σ -algebra) with a time lag k . Ourresults in Section 4 also rely on the ergodic properties of the Markov chain, and we assume thatthe Markov chain is V -geometrically ergodic [20, Chapter 15]. First, refer to the deﬁnition of thefunctional norm k · k V , from Deﬁnition 1.2, Deﬁnition 1.2 ( f -norm) . The functional norm in f -metric of the measure v , or the f -norm of v is deﬁned as (8) k v k f = sup g : | g |

Deﬁnition 1.3 ( V -geometric ergodicity) . A stationary Markov chain { X n } parameterized by θ ∈ Θ is V -geometrically ergodic if it is positive Harris and there exists a constant r V > , thatdepends on V , such that for any A ∈ B ( X ) , (10) n X n =1 r nV (cid:13)(cid:13)(cid:13)(cid:13) P θ ( X n ∈ A | X = x ) − Z A q θ ( y ) dy (cid:13)(cid:13)(cid:13)(cid:13) V < ∞ . It is straightforward to see that this is equivalent to (cid:13)(cid:13)(cid:13)(cid:13) P θ ( X n ∈ A | X = x ) − Z q θ ( y ) dy (cid:13)(cid:13)(cid:13)(cid:13) V ≤ Cr − n for an appropriate constant C (which may depend on the state x ). That is, the Markov chainapproaches steady state at a geometrically fast rate. If a Markov chain is V -geometrically ergodicfor V ≡

1, then, it is simply termed as geometrically ergodic . It is straightforward to see (viatheorem A.2 in the appendix) that a geometrically ergodic Markov chain is also α -mixing, with α coeﬃcients satisfying(11) X k ≥ α υk < ∞ ∀ υ > , showing that, under geometric ergodicity, the alpha mixing coeﬃcients raised to any positive power υ are ﬁnitely summable. We note here that the most standard procedure to establish V -geometricergodicity for any Markov chain is through the veriﬁcation of the drift condition, Assumption A.1.If a Markov chain satisﬁes Assumption A.1, then Theorem A.1 proves that it is also V -geometricallyergodic. 2. A Concentration Bound for the α -R´enyi Divergence The object of analysis in what follows is the probability measure ˜ π n,α | X n ( θ ), the variationalapproximation to the tempered posterior. Our main result establishes a bound on the Bayesrisk of this distribution; in particular, given a sequence of loss functions ℓ n ( θ, θ ), we bound R ℓ n ( θ, θ )˜ π n,α | X n ( θ ) dθ . Following recent work in both the i.i.d. and dependent sequence set-ting [3, 1, 27], we will use ℓ n ( θ, θ ) = D α ( P ( n ) θ , P ( n ) θ ), the α -R´enyi divergence between P ( n ) θ and P ( n ) θ as our loss function (recall that for each θ ∈ Θ and n ≥ P ( n ) θ is the distribution correspond-ing to the sequence { X , . . . , X n } ). Unlike more obvious loss functions like Euclidean distance,R´enyi divergence compares θ and θ through their eﬀect on observed sequences, so that issues likeparameter identiﬁability are no longer relevant. Our ﬁrst result generalizes [1, Theorem 2.1] to anon-i.i.d. data setting. Proposition 2.1.

Let F be a subset of all probability distributions on Θ . For any α ∈ (0 , , ǫ ∈ (0 , and n ≥ , the following probabilistic uniform upper bound on the expected α -R´enyidivergence holds: P ( n ) θ " sup ρ ∈F Z D α ( P ( n ) θ , P ( n ) θ ) ρ ( dθ ) ≤ α − α Z r n ( θ, θ ) ρ ( dθ ) + K ( ρ, π ) + log( ǫ )1 − α ≥ − ǫ. (12)The proof of Proposition 2.1 follows easily from [1], and we include it in Appendix C.1 forcompleteness. Mirroring the comments in [1], when ρ = ˜ π n,α this result is precisely [3, Theorem3.5]. This probabilistic bound implies the following PAC-Bayesian concentration bound on themodel risk computed with respect to the fractional variational posterior: Theorem 2.1.

Let F be a subset of all probability distributions parameterized by Θ , and assumethere exist ǫ n > and ρ n ∈ F such that(i) R K ( P ( n ) θ , P ( n ) θ ) ρ n ( dθ ) = R E[ r n ( θ, θ )] ρ n ( dθ ) ≤ nǫ n , PAC-BAYES FOR MARKOV MODELS (ii) R Var ( r n ( θ, θ )) ρ n ( dθ ) ≤ nǫ n , and(iii) K ( ρ n , π ) ≤ nǫ n .Then, for any α ∈ (0 , and ( ǫ, η ) ∈ (0 , × (0 , , P Z D α ( P ( n ) θ , P ( n ) θ )˜ π n,α ( dθ | X ( n ) ) ≤ ( α + 1) nǫ n + α q nǫ n η − log( ǫ )1 − α  ≥ − ǫ − η. (13)The proof of Theorem 2.1 is a straightforward generalization of [1, Theorem 2.4] to the non-i.i.d.setting, and a special case of [27, Theorem 3.1], where the problem setting includes latent variables.We include a proof for completeness. As noted in [1], the suﬃcient conditions follow closely from[11] and we will show that they hold for a variety of Markov chain models.A direct corollary of Theorem 2.1 follows by setting η = nǫ n , ǫ = e − nǫ n and using the fact that e − nǫ n ≥ nǫ n . Note that eq. (13) is vacuous if η + ǫ >

1. Therefore, without loss of generality, werestrict ourselves to the condition nǫ n < Corollary 2.1.

Assume ∃ ǫ n > , ρ n ∈ F such that the following conditions hold:(i) R K ( P ( n ) θ , P ( n ) θ ) ρ n ( dθ ) = R E [ r n ( θ, θ )] ρ n ( dθ ) ≤ nǫ n ,(ii) R Var ( r n ( θ, θ )) ρ n ( dθ ) ≤ nǫ n , and(iii) K ( ρ n , π ) ≤ nǫ n .Then, for any α ∈ (0 , , P (cid:20) Z D α ( P ( n ) θ , P ( n ) θ )˜ π n,α ( dθ | X ( n ) ) ≤ ( α + 1) ǫ n − α (cid:21) ≥ − nǫ n . (14)Observe that the ﬁrst two conditions in Corollary 2.1 ensure that the distribution ρ n concentrateson parameters θ ∈ Θ around the true parameter θ , while the third condition requires that ρ n notdiverge from the prior π rapidly as a function of the sample size n . In general, satisfying the ﬁrstand third conditions is relatively straightforward. The second condition, on the other hand, issigniﬁcantly more complicated in the current setting of dependent data, as the variance of r n ( θ, θ )includes correlations between the observations { X , . . . , X n } . In the next section, we will makeassumptions on the transition kernels (and corresponding invariant densities) that ’decouple’ thetemporal correlations and the model parameters in the setting of strongly mixing and ergodicMarkov chain models, and allow for the veriﬁcation of the conditions in Corollary 2.1.˙The computations critical to the veriﬁcation of the conditions in Corollary 2.1 are the bounds(i) and (ii), and Propositions 2.2 and 2.3 below characterize the expectation and variance of thelog-likelihood ratio r n ( · , · ) in terms of the one-step transition kernels of the Markov chain. First,consider the expectation of r n ( · , · ) in condition (i). Proposition 2.2.

Fix θ , θ ∈ Θ and consider the parameterized Markov transition kernels p θ and p θ , and initial distributions q (0) θ and q (0) θ . Let p ( n ) θ and p ( n ) θ be the corresponding joint probabilitydensities; that is, p ( n ) θ j ( x , . . . , x n ) = q (0) θ j ( x ) Q ni =1 p θ i ( x i | x i − ) for j ∈ { , } . Then, for any n ≥ ,the log-likelihood ratio r n ( θ , θ ) satisﬁes E θ [ r n ( θ , θ )] = n X i =1 E θ (cid:20) log (cid:18) p θ ( X i | X i − ) p θ ( X i | X i − ) (cid:19)(cid:21) + E[ Z ] , (15) AC-BAYES FOR MARKOV MODELS 7 where Z := log (cid:18) q (0) θ ( X ) q (0) θ ( X ) (cid:19) . The expectation in the ﬁrst term is with respect to the joint densityfunction p θ ( y, x ) = p θ ( y | x ) q ( i − θ ( x ) where the marginal density satisﬁes q ( i − θ ( x ) = (R p ( i − θ ( x , . . . , x i − , x ) dx · · · dx i − for i > , andq (0) θ ( x ) for i = 1 . If the Markov chain is also stationary under θ , then eq. (15) simpliﬁes to E θ [ r n ( θ , θ )] = n E θ (cid:20) log (cid:18) p θ ( X | X ) p θ ( X | X ) (cid:19)(cid:21) + E θ [ Z ] . (16)Notice that E θ [ r n ( θ , θ )] is precisely the KL divergence, K ( P ( n ) θ , P ( n ) θ ). Next, the followingproposition uses [13, Lemma 1.3] to upper bound the variance of the log-likelihood ratio. Proposition 2.3.

Fix θ , θ ∈ Θ and consider parameterized Markov transition kernels p θ and p θ , with initial distributions q (0) θ and q (0) θ . Let p ( n ) θ and p ( n ) θ be the corresponding joint probabilitydensities of the sequence ( x , . . . , x n ) , and q ( i ) θ j the marginal density for i ∈ { , . . . , n } and j ∈ { , } .Fix δ > and, for each i ∈ { , . . . , n } , deﬁne C ( i ) θ ,θ := Z (cid:12)(cid:12)(cid:12)(cid:12) log (cid:18) p θ ( x i | x i − ) p θ ( x i | x i − ) (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) δ p θ ( x i | x i − ) q ( i − θ ( x i − ) dx i dx i − . Similarly, deﬁne Z := log (cid:18) q (0) θ ( X ) q (0) θ ( X ) (cid:19) , and D , := E q (0) θ | Z | δ . Suppose the Markov chain corre-sponding to θ is α -mixing with coeﬃcients { α k } . Then, Var ( r n ( θ , θ )) < n X i,j =1 (cid:18) n + 2 n δ/ ( C ( i ) θ ,θ + C ( j ) θ ,θ + 2 q C ( i ) θ ,θ C ( j ) θ ,θ ) (cid:19) (cid:16) α δ/ (2+ δ ) | i − j |− (cid:17) + n X i =1 (cid:18) n + 2 n δ/ ( C ( i ) θ ,θ + D , + q C ( i ) θ ,θ D , ) (cid:19) (cid:16) α δ/ (2+ δ ) i − (cid:17) + Cov( Z , Z ) . (17)Note that this result holds for any parameterized Markov chain. In particular, when the Markovchain is stationary, C ( i ) θ ,θ = C (1) θ ,θ ∀ i and ∀ θ ∈ Θ, and the expression eq. (17) simpliﬁes toVar ( r n ( θ , θ )) < n (cid:18) n + 6 n δ/ C (1) θ ,θ (cid:19) X k ≥ α δ/ (2+ δ ) k  + (cid:18) n + 2 n δ/ ( C (1) θ ,θ + D , + q C (1) θ ,θ D , ) (cid:19) X k ≥ α δ/ (2+ δ ) k  + Cov( Z , Z ) . (18)If the sum P k ≥ α δ/ (2+ δ ) k is inﬁnite, the bound is trivially true. For it to be ﬁnite, of course, thecoeﬃcients α k must decay to zero suﬃciently quickly. For instance, Theorem A.2 shows that if theMarkov chain is geometrically ergodic, then the α -mixing coeﬃcients are geometrically decreasing.We will use this fact when the Markov chain is non-stationary, as in Section 4. In the next section,however, we ﬁrst consider the simpler stationary Markov chain setting where ergodic conditions arenot explicitly imposed. We also note that unless only a ﬁnite number of α k are nonzero, the sum P k ≥ α δ/ (2+ δ ) k is inﬁnite when δ = 0, and our results will typically require δ > PAC-BAYES FOR MARKOV MODELS Stationary Markov Data-Generating Models

In this section we assume that corresponding to each transition kernel p θ , θ ∈ Θ , there existsan invariant distribution q ( ∞ ) θ ≡ q θ that satisﬁes q θ ( x ) = Z p θ ( x | y ) q θ ( dy ) ∀ x ∈ R m , θ ∈ Θ . We will also use q θ to designate the density of the invariant measure (as before, this is withrespect to the Lebesgue or counting measure for continuous or discrete state spaces respectively).A Markov chain is stationary if its initial distribution is the invariant probability distribution, thatis, X ∼ q θ .Observe that the PAC-Bayesian concentration bound in Corollary 2.1 speciﬁcally requires bound-ing the mean and variance of the log-likelihood ratio r n ( θ, θ ). We ensure this by imposing regularityconditions on the log-ratio of the one-step transition kernels and the corresponding invariant den-sities. Speciﬁcally, we assume the following conditions that decouple the model parameters fromthe random samples, allowing us to verify the bounds in Corollary 2.1. Assumption 3.1.

There exist positive functions M (1) k ( · , · ) and M (2) k ( · ) , k ∈ { , , . . . , m } suchthat for any parameters θ , θ ∈ Θ , the log of the ratio of one-step transition kernels and the log ofthe ratio of the invariant distributions satisfy, respectively, | log p θ ( x | x ) − log p θ ( x | x ) | ≤ m X k =1 M (1) k ( x , x ) | f (1) k ( θ , θ ) | ∀ ( x , x ) , and (19) | log q θ ( x ) − log q θ ( x ) | ≤ m X k =1 M (2) k ( x ) | f (2) k ( θ , θ ) | ∀ x. (20) We further assume that for some δ > , the functions f (1) k , f (2) k and M (1) k satisfy the following:(i) there exist constants C ( t ) k and measures ρ n ∈ F such that R | f ( t ) k ( θ, θ ) | δ ρ n ( dθ ) < C ( t ) k n for t ∈ { , } , n ≥ and k ∈ { , , . . . , m } , and(ii) there exists a constant B such that R M (1) k ( x , x ) δ p θ j ( x | x ) q (0) θ j ( x ) dx dx < B, k ∈{ , . . . , m } and j ∈ { , } . The following examples illustrate eq. (19) and eq. (20) for discrete and continuous state Markovchains.

Example 3.1.

Suppose { X , . . . , X n } is generated by the birth-death chain with parameterizedtransition probability mass function, p θ ( j | i ) = ( θ if j = i − , − θ if j = i + 1 . In this example, the parameter θ denotes the probability of birth. We shall see that, m = 3 : M (1)1 ( X , X ) = I [ X = X +1] , M (1)2 ( X , X ) = I [ X = X − , and M (1)3 ( X , X ) = 1 . We also de-ﬁne M (2)1 ( X ) = 1 , and set M (2)2 ( X ) and M (2)3 ( X ) both to X − . Let f (1)1 ( θ, θ ) = log h θ θ i , f (1)2 ( θ, θ ) = log h − θ − θ i , f (1)3 ( θ, θ ) = 0 , f (2)1 ( θ, θ ) = − f (2)3 ( θ, θ ) = log h − θ − θ i , and f (2)2 ( θ, θ ) =log h θ θ i . The derivation of these terms and that they satisfy the conditions of Assumption 3.1 isprovided in the proof of Proposition 3.3. AC-BAYES FOR MARKOV MODELS 9

Example 3.2.

Suppose { X , . . . , X n } is generated by the ‘simple linear’ Markov model X n = θX n − + W n , where { W n } is a sequence of i.i.d. standard Gaussian random variables. Then, m = 2 , with M (1)1 ( X n , X n − ) = | X n X n − | , M (1)2 ( X n , X n − ) = X n , M (2)1 ( x ) = x and M (2)2 ( X ) = 0 . Corre-sponding to these, we have f (1)1 ( θ, θ ) = ( θ − θ ) , f (1)2 ( θ, θ ) = ( θ − θ ) , f (2)1 ( θ , θ ) = ( θ − θ ) and f (2)2 ( θ , θ ) = 0 . The derivation of these quantities and that these satisfy the conditions ofAssumption 3.1 under appropriate choices of ρ n is shown in the proof of Proposition 4.3. Note that assuming the same number m of M (1) k and M (2) k involves no loss of generality, sincethese functions can be set to 0. Both eq. (19) and eq. (20) can be viewed as generalized Lipschitz-smoothness conditions, recovering the usual Lipschitz-smoothness when m = 1 and when f ( t ) k isEuclidean distance. Our generalized conditions are useful for distributions like the Gaussian, whereLipschitz smoothness does not apply. Observe also that by an application of Jensen’s inequality, As-sumption 3.1 (i) above implies that for some constant C > k ∈ { , , . . . , m } , t ∈ { , } , Z | f ( t ) k ( θ, θ ) | ρ n ( dθ ) ≤ Cn / (2+ δ ) < C √ n . (21)Assumption 3.1 (i) is satisﬁed in a variety of scenarios, for example, under mild assumptions onthe partial derivatives of the functions f ( t ) k . To illustrate this, we present the following proposition. Proposition 3.1.

Let f ( θ, θ ) be a function on a bounded domain with bounded partial derivativeswith f ( θ , θ ) = 0 . Let { ρ n ( · ) } be a sequence of probability densities on θ such that E ρ n [ θ ] = θ and Var ρ n [ θ ] = σ n for some σ > . Then, for some C > , (22) Z | f ( θ, θ ) | δ ρ n ( dθ ) < Cn . Proof.

Deﬁne ∂ θ f ( θ, θ ) := ∂f ( θ,θ ) ∂θ as the partial derivative of the function f . By the mean valuetheorem, | f ( θ, θ ) | = | θ − θ || ∂ θ f ( θ ∗ , θ ) | , for some θ ∗ ∈ [min { θ, θ } , max { θ, θ } ]. Since the partialderivatives are bounded, there exists L ∈ R such that ∂ θ f ( θ ∗ , θ ) < L , and R | f ( θ, θ ) | δ ρ n ( dθ ) | θ | < G , then (cid:12)(cid:12)(cid:12) θ − θ G (cid:12)(cid:12)(cid:12) δ < (cid:12)(cid:12)(cid:12) θ − θ G (cid:12)(cid:12)(cid:12) . Therefore, R | θ − θ | δ ρ n ( dθ ) < (2 G ) δ Var (cid:2) θ G (cid:3) < (2 G ) δ σ n . Now choosing (2 G ) δ σ as C completes theproof. (cid:3) If ∂ θ f ( t ) k is continuous and Θ is compact, then ∂ θ f ( t ) k is always bounded. Also observe that ifE h M (1) k ( X , X ) δ i < B , without loss of generality we can use Jensen’s inequality to concludethat, for all 0 < a < δ , E h M (1) k ( X , X ) a i < B a δ < B .We can now state the main theorem of this section. Theorem 3.1.

Let { X , . . . , X n } be generated by a stationary, α -mixing Markov chain parametrizedby θ ∈ Θ . Suppose that Assumption 3.1 holds and that the α -mixing coeﬃcients satisfy P k ≥ α δ/ (2+ δ ) k < + ∞ . Furthermore, assume that K ( ρ n , π ) ≤ √ nC for some constant C > .Then, the conditions of Corollary 2.1 are satisﬁed with ǫ n ∈ O (cid:16) max( √ n , n δ/ n ) (cid:17) . Theorem 3.1 is satisﬁed by a large class of Markov chains, including chains with countable andcontinuous state spaces. In particular, if the Markov chain is geometrically ergodic, then it followsfrom eq. (11) that P k ≥ α δ/ (2+ δ ) k < + ∞ . Observe that in order to achieve O ( √ n ) convergence, we need δ ≤

1. Note also that as δ decreases, satisfying the condition P k ≥ α δ/ (2+ δ ) k requires theMarkov chain to be faster mixing.We now illustrate Theorem 3.1 for a number of Markov chain models. First, consider a birth-death Markov chain on a ﬁnite state space. Proposition 3.2.

Suppose the data-generating process is a birth-death Markov chain, with one-step transition kernel parametrized by the birth probability θ ∈ Θ . Let F be the set of all Betadistributions. We choose the prior to be a Beta distribution with parameters α and β . Then, theconditions of Theorem 3.1 are satisﬁed and ǫ n ∈ O (cid:16) √ n (cid:17) .Proof. The proof of Proposition 3.2 follows from the more general Proposition 4.1, by ﬁxing theinitial distribution to the invariant distribution under θ . (cid:3) The birth death chain on the ﬁnite state space is, of course, geometrically ergodic and the alphamixing coeﬃcients α k decay geometrically. Note that the invariant distribution of this Markov chainis uniform over the state space, and consequently this is a particularly simple example. A morecomplicated and more realistic example is a birth-death Markov chain on the positive integers. Wenote that if the probability of birth in a birth-death Markov chain on positive integers is greater than0 .

5, then the Markov chain is transient, and consequently, not ergodic. As seen in Example 3.1, wedenoted the probability of birth as θ . Hence, our prior should be chosen in such a fashion that thesupport should be within (0 , . Deﬁnition 3.1 (Scaled Beta) . If X is a beta distribution on with parameters α and β , then Y issaid to be a scaled beta distribution with same parameters on the interval ( c, m + c ) if, Y = mx + c ; ( m, c ) ∈ R and in that case, via transformation of variables, the pdf of Y is obtained as, f ( y ) = ( m Beta( α,β ) (cid:0) y − cm (cid:1) α − (cid:0) − y − cm (cid:1) β − if y ∈ ( c, m + c )0 otherwise . It follows that under such circumstances, E[ Y ] = m αα + β and Var[ Y ] = m αβ ( α + β ) ( α + β +1) . Thescaled beta distribution is a popular distribution which ﬁnds use in a variety of practical applica-tions. One example of such scaled beta distribution might be when m = 0 . c = 0. Then itbecomes a beta distribution rescaled to have support on (0 , ). Another example is when m = 2and when c = −

1. Then it becomes a beta distribution rescaled to have support on ( − , Proposition 3.3.

Suppose the data-generating process is a positive recurrent birth-death Markovchain on the positive integers parameterized by the birth probability θ ∈ (0 , ) . Further let F bethe set of all Beta distributions rescaled to have support (0 , ) . We choose the prior to be a scaledBeta distribution on (0 , / with parameters α and β . Then, the conditions of Theorem 3.1 aresatisﬁed with ǫ n ∈ O (cid:16) √ n (cid:17) .Proof. The proof of Proposition 3.3 follows from that of Proposition 4.2 by ﬁxing the initial distri-bution to the invariant distribution under θ . (cid:3) Note that if the transition probability of jumping from state i to state i + 1 (i.e., the ‘birth’probability) is greater than , then the birth-death chain is transient. Therefore, we restrict toonly those cases when the probability of birth is less than . Unlike with the ﬁnite state-space, theinvariant distribution now depends on the parameter θ ∈ Θ, and veriﬁcation of the conditions ofthe proposition is more involved.Both Proposition 3.2 and Proposition 3.3 assume a discrete state space. The next example con-siders a strictly stationary simple linear model (as deﬁned in Example 3.2), which has a continuous,unbounded state space.

AC-BAYES FOR MARKOV MODELS 11

Proposition 3.4.

Suppose the data-generating model is a strictly stationary simple linear modelsatisfying the equation X n = θ X n − + W n , (23) where { W n } are i.i.d. standard Gaussian random variables and | θ | < . Let F be the spaceof all parameterized distributions with support ( − , and suppose that F is the class of all betadistributions rescaled to have the support ( − , . Then, the conditions of Theorem 3.1 are satisﬁedwith ǫ n ∈ O (cid:16) √ n (cid:17) .Proof. The proof that simple linear model satisﬁes Assumption 3.1 is deferred to the proof ofProposition 4.3. The simple linear model with | θ | < α -mixing coeﬃcients as a consequence of [20, eq. (15.49)] and Theorem A.2. Combiningthese two facts, it follows that the conditions of Theorem 3.1 are satisﬁed. (cid:3) Observe that Theorem 2.1 (and Corollary 2.1) are general, and hold for any dependent data-generating process. Therefore, there can be Markov chains that satisfy these, but do not satisfyAssumption 3.1 which entails some loss of generality. However, as our examples demonstrate,common Markov chain models do indeed satisfy the latter assumption.4.

Non-Stationary, Ergodic Markov Data-Generating Models

We call a time-homogeneous Markov chain non-stationary if the initial distribution q (0) is notthe invariant distribution. There are two sets of results in this setting: in Theorem 4.1 and Theo-rem 4.2 we explicitly impose the α -mixing condition, while in Theorem 4.3 we impose a V -geometricergodicity condition (Deﬁnition 1.3). As seen in eq. (11) if the Markov chain is also geometricallyergodic, then ∀ δ > P α δ/ (2+ δ ) k < ∞ . This condition can be relaxed, albeit at the risk ofmore complicated calculations that, nonetheless, mirror those in the geometrically ergodic setting.A common thread through these results is that we must impose some integrability or regularityconditions on the functions M (1) k .First, in Theorem 4.1 we assume that the M (1) k functions in Assumption 3.1 are uniformlybounded and that the α -mixing condition is satisﬁed. This result holds for both discrete andcontinuous state space settings. Theorem 4.1.

Let { X , . . . , X n } be generated by an α -mixing Markov chain parametrized by θ ∈ Θ with transition probabilities satisfying Assumption 3.1 and with known initial distribution q (0) . Let { α k } be the α -mixing coeﬃcients under θ , and assume that P k ≥ α δ/ (2+ δ ) k < + ∞ . Suppose thatthere exists B ∈ R such that sup x,y | M (1) k ( x, y ) | < B for all k ∈ { , , . . . , m } in Assumption 3.1.Furthermore, assume that there exists ρ n ∈ F such that K ( ρ n , π ) ≤ √ nC for some constant C > .If the initial distribution q (0) satisﬁes E q (0) | M (2) k ( X ) | < + ∞ for all k ∈ { , , . . . , m } , then theconditions of Corollary 2.1 are satisﬁed with ǫ n = O (cid:16) max( √ n , n δ/ n ) (cid:17) . The following result in Proposition 4.1 illustrates Theorem 4.1 in the setting of a ﬁnite statebirth-death Markov chain.

Proposition 4.1.

Suppose the data-generating process is a ﬁnite state birth-death Markov chain,with one-step transition kernel parametrized by the birth probability θ . Let F be the set of all Betadistributions. We choose the prior to be a Beta distribution with parameters α and β . Then, theconditions of Theorem 4.1 are satisﬁed with ǫ n = O (cid:16) √ n (cid:17) for any initial distribution q (0) . Theorem 4.1 also applies to data generated by Markov chains with countably inﬁnite state spaces,so long as the class of data-generating Markov chains is strongly ergodic and the initial distribution has ﬁnite second moments. The following example demonstrates this in the setting of a birth-deathMarkov chain on the positive integers, where the initial distribution is assumed to have ﬁnite secondmoments.

Proposition 4.2.

Suppose the data-generating process is a birth-death Markov chain on the non-negative integers, parameterized by the probability of birth θ ∈ (0 , ) . Further let F be the setof all Beta distributions rescaled upon the support (0 , ) . Let q (0) be a probability mass functionon non-negative integers such that P ∞ i =1 i q (0) ( i ) < + ∞ . We choose the prior to be a scaled Betadistribution on (0 , / with parameters α and β . Then, the conditions of Theorem 4.1 are satisﬁedwith ǫ n = O (cid:16) √ n (cid:17) . Since continuous functions on a compact domain are bounded, we have the following (easy)corollary (stated without proof).

Corollary 4.1.

Let { X , . . . , X n } be generated by an α -mixing Markov chain parametrized by θ ∈ Θ on a compact state space, and with initial distribution q (0) . Suppose the α -mixing coeﬃcientssatisfy P k ≥ α δ/ (2+ δ ) k < + ∞ , and that Assumption 3.1 holds with continuous functions M (1) k ( · , · ) , k ∈ { , , . . . , m } . Furthermore, assume that there exists ρ n such that K ( ρ n , π ) ≤ √ nC for someconstant C. Then, Theorem 4.1 is satisﬁed with ǫ n = O (cid:16) max( √ n , n δ/ n ) (cid:17) . In general the M (1) k functions will not be uniformly bounded (consider the case of a Gaussian-Markov simple linear model in Example 3.2), and stronger conditions must be imposed on thedata-generating Markov chain itself. The following assumption imposes a ‘drift’ condition from [12].Speciﬁcally, [12, Theorem 2.3] shows that under the conditions of Assumption 4.1, the momentgenerating function of an aperiodic Markov chain { X n } can be upper bounded by a function of themoment generating function of X . Together with the α -mixing condition, Assumption 4.1 impliesthat this Markov data generating process satisﬁes Corollary 2.1. Assumption 4.1.

Consider a Markov chain { X n } parameterized by θ ∈ Θ . Let M n −∞ de-note the σ -ﬁeld generated by { X −∞ , . . . , X n − , X n } . Denote the stochastic process { M kn } := { M (1) k ( X n , X n − ) } ; recall M (1) k , for each k = 1 , . . . , m , are deﬁned in Assumption 3.1. For each k = 1 , . . . , m , the process { M kn } is assumed to satisfy the following conditions: • The drift condition holds for { M kn } , i.e. E (cid:2) M kn − M kn − |M n − −∞ , M kn − > a (cid:3) ≤ − ǫ for some ǫ, a > . • For some λ > and D > , E h e λ ( M kn − M kn − ) |M n − −∞ i ≤ D . Under this drift condition, the next theorem shows that Corollary 2.1 is satisﬁed.

Theorem 4.2.

Let { X , . . . , X n } be generated by an aperiodic α -mixing Markov chain parametrizedby θ ∈ Θ and initial distribution q (0) . Suppose that Assumption 3.1 and Assumption 4.1 hold, andthat the α -mixing coeﬃcients satisfy P k ≥ α δ/ (2+ δ ) k < + ∞ . Furthermore, assume K ( ρ n , π ) ≤ √ nC for some constant C > . If R e λM (1) k ( y,x ) p θ ( y | x ) q (0)1 ( x ) dx < + ∞ for all k = 1 , . . . , m , then theconditions of Corollary 2.1 are satisﬁed with ǫ n = O (cid:16) max( √ n , n δ/ n ) (cid:17) . Verifying the conditions in Theorem 4.2 can be quite challenging in general. Instead, we suggesta diﬀerent approach that requires V -geometric ergodicity. Unlike the drift condition in Assump-tion 4.1, V -geometric ergodicity additionally requires the existence of a petite set. However, geo-metric ergodicity is a fairly standard property that has already been established from a number ofimportant Markov chains. As noted before, geometric ergodicity implies α -mixing of the Markovchain with geometrically decaying mixing coeﬃcients. As with Theorem 4.2, we assume that the AC-BAYES FOR MARKOV MODELS 13

Markov chain is aperiodic. It is well known that in case of a periodic Markov chain with period d ,the notion of geometric ergodicity only makes sense for the d -chain { X dn } . Theorem 4.3.

Let { X , . . . , X n } be generated by an aperiodic Markov chain parametrized by θ ∈ Θ with known initial distribution q (0) , and assumed to be V -geometrically ergodic forsome V : R m → [1 , ∞ ) . Suppose that Assumption 3.1 holds and R M (1) k ( y, x ) δ p θ ( y | x ) dy . Furthermore, assume that K ( ρ n , π ) ≤ √ nC for some constant C > .If the initial distribution q (0) satisﬁes E q (0) [ V ( X )] < + ∞ , then the conditions of Corollary 2.1 aresatisﬁed with ǫ n = O (cid:16) max( √ n , n δ/ n ) (cid:17) . The following Proposition 4.3 shows, the simple linear model satisﬁes Theorem 4.3 when theparameter θ is suitably restricted. Proposition 4.3.

Consider the simple linear model satisfying the equation X n = θ X n − + W n , (24) where { W n } are i.i.d. standard Gaussian random variables and | θ | < δ − for δ > . Let F bethe space of all scaled Beta distributions on ( − , and suppose the prior π is a uniform distributionon ( − , . Then, the conditions of Theorem 4.3 are satisﬁed with ǫ n ∈ O (cid:16) max( √ n , n δ/ n ) (cid:17) , if theinitial distribution q (0) satisﬁes E q (0) [ X δ ] < + ∞ . Misspecified Models

We shown next how our results can be extended to the misspeciﬁed model setting. Assume thatthe true data generating distribution is parametrized by θ Θ. Let θ ∗ n := arg min θ ∈ Θ K ( P ( n ) θ , P ( n ) θ )represent the closest parametrized distribution in the variational family to the data-generatingdistribution. Further, assume that our usual hypotheses are satisﬁed:(i) R E[ r n ( θ, θ ∗ n )] ρ n ( dθ ) ≤ nǫ n ,(ii) R Var ( r n ( θ, θ ∗ n )) ρ n ( dθ ) ≤ nǫ n .Now, since r n ( θ, θ ) = r n ( θ, θ ∗ n ) + r n ( θ ∗ n , θ ), we have(25) Z K ( P ( n ) θ , P ( n ) θ ) ρ n ( dθ ) ≤ E[ r n ( θ , θ ∗ n )] + nǫ n . Similarly, decomposing the variance it follows thatVar[ r n ( θ, θ )] = Var[ r n ( θ, θ ∗ n )] + Var[ r n ( θ ∗ n , θ )] + 2Cov[ r n ( θ, θ ∗ n ) , r n ( θ ∗ n , θ )] . (26)Using the fact that 2 ab ≤ a + b to the covariance term 2Cov[ r n ( θ, θ ∗ n ) , r n ( θ ∗ n , θ )] =2E [( r n ( θ, θ ∗ n ) − E[ r n ( θ, θ ∗ n )]) ( r n ( θ ∗ n , θ ) − E[ r n ( θ ∗ n , θ )])], we haveVar[ r n ( θ, θ )] ≤ r n ( θ, θ ∗ n )] + 2Var[ r n ( θ ∗ n , θ )] . (27)Integrating both sides with respect to ρ n ( dθ ) we get Z Var[ r n ( θ, θ )] ρ n ( dθ ) ≤ Z Var[ r n ( θ, θ ∗ n )] ρ n ( dθ ) + 2 Z Var[ r n ( θ ∗ n , θ )] ρ n ( dθ ) ≤ nǫ n + 2Var[ r n ( θ ∗ n , θ )] . (28)Consequently, we arrive at the following result: Theorem 5.1.

Let F be a subset of all probability distributions parameterized by Θ . Let θ ∗ n =arg min θ ∈ Θ K ( P ( n ) θ , P ( n ) θ ) and assume there exist ǫ n > and ρ n ∈ F such that(i) R E[ r n ( θ, θ ∗ n )] ρ n ( dθ ) ≤ nǫ n , (ii) R Var ( r n ( θ, θ ∗ n )) ρ n ( dθ ) ≤ nǫ n , and(iii) K ( ρ n , π ) ≤ nǫ n .Then, for any α ∈ (0 , and ( ǫ, η ) ∈ (0 , × (0 , , P Z D α ( P ( n ) θ , P ( n ) θ )˜ π n,α ( dθ | X ( n ) ) ≤ ( α + 1) nǫ n + E[ r n ( θ , θ ∗ n )] + α q nǫ n +2Var[ r n ( θ ∗ n ,θ )] η − log( ǫ )1 − α  (29) ≥ − ǫ − η. The proof of this theorem is straightforward and follows from the proof of Theorem 2.1 byplugging in the upper bounds for KL-divergence from eq. (25), and variance from eq. (28) toEquation (39). A sketch of the proof is presented in the appendix.6.

Conclusion

Concentration of the KL-VB model risk, in terms of the expected α -R´enyi divergence, is wellestablished under the i.i.d. data generating model assumption. Here, we extended this to thesetting of Markov data generating models, linking the concentration rate to the mixing and ergodicproperties of the Markov model. Our results apply to both stationary and non-stationary Markovchains, as well as to the situation with misspeciﬁed models. There remain a number of openquestions. An immediate one is to extend the current analysis to continuous-time Markov chainsand Markov jump processes, possibly using uniformization of the continuous time model. Anotherdirection is to extend this to the setting of non-homogeneous Markov chains, where analogues ofnotions such as stationarity are less straightforward. Further, as noted in the introduction, [3]establish PAC-Bayes bounds under slightly weaker ‘existence of test functions’ conditions, whileour results are established under the stronger conditions used by [1] for the i.i.d. setting. Weakeningthe conditions in our analysis is important, but complicated. A possible path is to build on resultsfrom [4] who provides conditions form the existence of exponentially powerful test functions existfor distinguishing between two Markov chains. It is also known that there exists a likelihoodratio test separating any two ergodic measures [23]. However, leveraging these to establish thePAC-Bayes bounds for the KL-VB posterior is a challenging eﬀort that we leave to future papers.Finally it is of interest to generalize our PAC-bounds to posterior approximations beyond KL-variational inference, such as α -R´enyi posterior approximations [19], and loss-calibrated posteriorapproximations [18, 14]. Appendix A. Definitions Related to Markov Chains

As noted before, ergodicity plays an acute role in establishing our results. We consolidate variousdeﬁnitions used throughout the paper in this appendix. We assume that the parameterized Markovchains possess an invariant probability density or mass function q θ under parameter θ ∈ Θ.We deﬁned V -geometric ergodicity in the previous sections. In this section, we provide a suﬃcientcondition for a Markov chain to be V -geometrically ergodic. First, we recall the deﬁnition of petitesets, Deﬁnition A.1 (Petite Sets) . Let X , . . . , X n be n samples from a Markov chain taking values onthe state space X . Let C be a set. We shall call C to be v q petite if K q ( x, B ) ≥ υ q ( B ) for all x ∈ C and B ∈ B ( X ) , and a non-trivial measure υ q on B ( X )Now, let ∆ V ( x ) := E [ V ( X n ) | X n − = x ] − V ( x ) for V : S → [1 , ∞ ). AC-BAYES FOR MARKOV MODELS 15

Assumption A.1 (Drift condition) . Suppose the chain { X n } is, aperiodic and ψ -irreducible .Let there exists a petite set C , constants b < ∞ , β > , and a non-trivial function V : S → [1 , ∞ ) satisfying ∆ V ( x ) ≤ − βV ( x ) + bI x ∈ C ∀ x ∈ S. (30)If a Markov chain drifts towards a petite set then it is V -geometrically ergodic. Suppose, forsimplicity, that V ( x ) = | X | . Then, the drift condition becomes E [ | X n k X x − ] − | X n − | = − β | X n | + bI X n ∈ C . The left hand side of this equation represents the change in the state of the Markov chainin one time epoch. Thus, the condition in Assumption A.1 essentially states that the Markov chaindrifts towards a petite set C and then, once it reaches that set, moves to any point in the statespace with at least some probability independent of C . Theorem A.1 (Geometrically ergodic theorem) . Suppose that { X n } is satisﬁes Assumption A.1.Then, the set S V = { x : V ( x ) < ∞} is absorbing, i.e. P θ ( X ∈ S V | X = x ) = 1 ∀ x ∈ S V , and full,i.e. ψ ( S cV ) = 0 . Also, ∃ constants r > , R < ∞ such that, for any A ∈ B ( S ) , (31) (cid:13)(cid:13)(cid:13)(cid:13) P θ ( X n ∈ A | X = x ) − Z A q θ ( y ) dy (cid:13)(cid:13)(cid:13)(cid:13) V ≤ Rr − n V ( x ) . Any aperiodic and ψ -irreducible Markov chain satisfying the drift condition is geometricallyergodic. A consequence of Equation (9) is that if, { X n } is V -geometrically ergodic, then for anyother function U, such that | U | < V , it is also U -geometrically ergodic. In essence, a geometricallyergodic Markov chain is asymptotically uncorrelated in a precise sense. Recall ρ -mixing coeﬃcientsdeﬁned as follows. Let A be a sigma ﬁeld and L ( A ) be the set of square integrable, real valued, A measurable functions. Deﬁnition A.2 ( ρ -mixing coeﬃcient) . Let M ji denote the sigma ﬁeld generated by the measures X k , where i ≤ k ≤ j . Then, (32) ρ k = sup t> sup ( f,g ) ∈L ( M t −∞ ) ×L ( M ∞ t + k ) | Corr( f, g ) | , where Corr is the correlation function.

Theorem A.2. If X n is geometrically ergodic, then it is α -mixing. That is, there exists a constant c > such that α k ∈ O ( e − ck ) .Proof. By [17, Theorem 2] it follows that a geometrically ergodic Markov chain is asymptoticallyuncorrelated with ρ -mixing coeﬃcients (see deﬁnition A.2) that satisfy ρ k ∈ O ( e − ck ). Furthermore,it is well known that [7, 17] α k ≤ ρ k , implying α k ∈ O ( e − ck ). (cid:3) Appendix B. Bounding the KL-divergence between Beta distributions

The following results will be utilized in the proofs of Propositions 4.1, 4.2 and 4.3.

Lemma B.1.

Let θ ∈ (0 , . Let, ρ n be a sequence of Beta distributions with parameters α n = nθ and β n = n (1 − θ ) . Let π denote an uniform distribution, U (0 , . Then, K ( ρ n , π ) < C + log( n ) ,for some constant C > .Proof. The KL divergence K ( ρ n , π ) can be written as R log (cid:0) ρ n π (cid:1) ρ n ( dθ ). Since π is uniform, π ( θ ) = 1whenever θ ∈ (0 , ρ n R log ( ρ n ( θ )) ρ n ( dθ ) , which can be written as K ( ρ n , π ) = ( α n − ψ ( α n ) + ( β n − ψ ( β n ) − ( α n + β n − ψ ( α n + β n ) − log Beta( α n , β n ) , (33) where ψ is the digamma function. Using Stirling’s approximation on Beta( α n , β n ) yields,Beta( α n , β n ) = √ π α α n − / n β β n − / n ( α n + β n ) α n + β n − / (1 + o (1)) . Plugging in the values of α n and β n , we get,Beta( α n , β n ) = √ π ( nθ ) ( nθ ) − / ( n (1 − θ )) ( n (1 − θ )) − / ( nθ + n (1 − θ )) nθ + n (1 − θ ) − / (1 + o (1))= √ π n n − θ nθ − / (1 − θ ) n (1 − θ ) − / n n − / (1 + o (1))= r πn θ nθ − / (1 − θ ) n (1 − θ ) − / (1 + o (1)) . Setting C = log( √ π ), we have, − log Beta( α n , β n ) < log(1 + o (1)) + C + 12 log( n ) − ( nθ − /

2) log( θ ) − ( n (1 − θ ) − /

2) log(1 − θ ) . Now, we analyze the term ( α n − ψ ( α n ). From [2] we have that log( x ) − x < ψ ( x ) < log( x ) − x ∀ x >

0. Without loss of generality, assume α n >

1, giving( α n − ψ ( α n ) < ( α n −

1) log( α n ) − α n − α n . If not, then we would have obtained using the lower bound of ψ ( x ),( α n − ψ ( α n ) < ( α n −

1) log( α n ) − α n − α n , and we could have proceeded similarly. By plugging in the value of α n , we get( α n − ψ ( α n ) < ( nθ −

1) log( nθ ) − (cid:18) / − nθ (cid:19) = ( nθ −

1) log( n ) + ( nθ −

1) log( θ ) − (cid:18) / − nθ (cid:19) . Similarly assuming β n >

1, we get the following upper bound for ( β n − ψ ( β n ), giving( β n − ψ ( β n ) < ( n (1 − θ ) −

1) log( n ) + ( n (1 − θ ) −

1) log(1 − θ ) − (cid:18) / − n (1 − θ ) (cid:19) Combining all the terms, we get, for C ( n )2 = C − (1 / − nθ ) − (1 / − n (1 − θ ) )( α n − ψ ( α n ) + ( β n − ψ ( β n ) − ( α n + β n − ψ ( α n + β n ) − log Beta( α n , β n ) < C ( n )2 −

12 (log( θ ) + log(1 − θ )) + ( nθ −

1) log( n ) + ( n (1 − θ ) −

1) log( n )+ 12 log( n ) − ( α n + β n − ψ ( α n + β n ) + log(1 + o (1)) . Now plugging in the values of α n and β n in ( α n + β n − ψ ( α n + β n ), we get,( α n − ψ ( α n ) + ( β n − ψ ( β n ) − ( α n + β n − ψ ( α n + β n ) − log Beta( α n , β n ) < C ( n )2 −

12 (log( θ ) − log(1 − θ )) + ( n −

2) log( n ) + 12 log( n ) − ( n − ψ ( n )+ log(1 + o (1)) . AC-BAYES FOR MARKOV MODELS 17

By using the lower bound of ψ ( x ), we get that, − ψ ( x ) < − log( x ) + x . Plugging this into the aboveequation, we get,( α n − ψ ( α n ) + ( β n − ψ ( β n ) − ( α n + β n − ψ ( α n + β n ) − log Beta( α n , β n ) < C ( n )2 −

12 (log( θ ) − log(1 − θ )) + ( n −

2) log( n ) − ( n −

2) log( n ) + 12 log( n )+ n − n + log(1 + o (1))= C ( n )2 −

12 (log( θ ) − log(1 − θ )) + 12 log( n ) + n − n + log(1 + o (1)) . Since nθ + n (1 − θ ) ) < C (2) n can be upper bounded by C + 1. Using this fact, we get( α n − ψ ( α n ) + ( β n − ψ ( β n ) − ( α n + β n − ψ ( α n + β n ) − log Beta( α n , β n ) < C + 12 log( n ) , for some large constant C . (cid:3) Proposition B.1.

Let θ ∈ (0 , . Let, ρ n be a sequence of Beta distributions with parameters α n = nθ and β n = n (1 − θ ) . Let π denote an Beta distribution, with parameters ( α, β ) . Then, K ( ρ n , π ) < C + log( n ) , for some constant C > .Proof. The KL-divergence between ρ n and π can be written as, K ( ρ n , π ) = Z log (cid:16) ρ n π (cid:17) ρ n ( dθ ) = Z log (cid:16) ρ n U (cid:17) ρ n ( dθ ) + Z log (cid:18) Uπ (cid:19) ρ n ( dθ ) , where, U is an uniform distribution on (0 , Z log (cid:18) Uπ (cid:19) ρ n ( dθ ) = Z log α,β ) θ α − (1 − θ ) β − ! ρ n ( dθ )= C − ( α − Z log( θ ) ρ n ( dθ ) − ( β − Z log(1 − θ ) ρ n ( dθ ) , where C is log(Beta( α, β )). Since, ρ n follows a Beta distribution with parameters α n = nθ and β n = n (1 − θ ), we get that, Z log (cid:18) Uπ (cid:19) ρ n ( dθ ) = C − ( α −

1) [ ψ ( α n ) − ψ ( α n + β n )] − ( β −

1) [ ψ ( β n ) − ψ ( α n + β n )]Since, log( x ) − x < ψ ( x ) < log( x ) − x , looking at the term [ ψ ( α n ) − ψ ( α n + β n )], we get that,[ ψ ( α n ) − ψ ( α n + β n )] = [ ψ ( nθ ) − ψ ( nθ + n (1 − θ ))]= [ ψ ( nθ ) − ψ ( n )] . Using the upper bound on ψ ( nθ ) and the lower bound on ψ ( n ), we get[ ψ ( nθ ) − ψ ( n )] < log( nθ ) − nθ − log( n ) + 1 nθ = log( θ ) + 12 nθ . We can also get a lower bound very similarly.[ ψ ( α n ) − ψ ( α n + β n )] > log( θ ) − nθ . Therefore it follows that(34) log( θ ) − nθ < [ ψ ( α n ) − ψ ( α n + β n )] < log( θ ) + 12 nθ . Similarly, we have(35) log(1 − θ ) − n (1 − θ ) < [ ψ ( β n ) − ψ ( α n + β n )] < log(1 − θ ) + 12 n (1 − θ ) . Consequently, for a large enough constant C > − C < [ ψ ( α n ) − ψ ( α n + β n )] < C , and − C < [ ψ ( β n ) − ψ ( α n + β n )] < C . Without loss of generality assume that min { α − , β − } >

0. Then we get, − ( α − C < ( α −

1) [ ψ ( α n ) − ψ ( α n + β n )] < ( α − C , and − ( α − C < ( α −

1) [ ψ ( β n ) − ψ ( α n + β n )] < ( α − C . If, either of α − β − C − ( α −

1) [ ψ ( α n ) − ψ ( α n + β n )] − ( β −

1) [ ψ ( β n ) − ψ ( α n + β n )] < C + C θ + C (1 − θ ) < C, for some large constant C. Finally, we upper bound R log (cid:0) ρ n U (cid:1) ρ n ( dθ ) by Lemma B.1 thereby com-pleting the proof. (cid:3) Appendix C. Proofs of Main Results

C.1.

Proposition 2.1.

We start by recalling the variational formula of Donsker and Varadhan [9].

Lemma C.1 (Donsker-Varadhan) . For any probability distribution function π on Θ , and for anymeasurable function h : Θ → R , if R e h dπ < ∞ , then log Z e h dπ = sup ρ ∈M + (Θ) (cid:26)Z hdρ − K ( ρ, π ) (cid:27) (36) Proof of Prop. 2.1.

Fix α ∈ (0 , θ ∈ Θ. First, observe that by the deﬁnition of the α -R´enyidivergence we haveE ( n ) θ [exp( − αr n ( θ, θ ))] = exp[ − (1 − α ) D α ( P ( n ) θ , P ( n ) θ )]Multiplying both sides of the equation by exp[(1 − α ) D α ( P ( n ) θ , P ( n ) θ ) and integrating with respectto (w.r.t.) π ( θ ) it follows that Z E ( n ) θ h exp (cid:16) − αr n ( θ, θ ) + (1 − α ) D α ( P ( n ) θ , P ( n ) θ ) (cid:17)i π ( dθ ) = 1 , orE ( n ) θ (cid:20)Z exp (cid:16) − αr n ( θ, θ ) + (1 − α ) D α ( P ( n ) θ , P ( n ) θ ) (cid:17) π ( dθ ) (cid:21) = 1 . Deﬁne h ( θ ) := − αr n ( θ, θ ) + (1 − α ) D α ( P ( n ) θ , P ( n ) θ ). Then, applying Lemma C.1 to the integrandon the left hand side (l.h.s.) above, it follows thatE ( n ) θ " exp sup ρ ∈M + (Θ) (cid:20)Z h ( θ ) ρ ( dθ ) − K ( ρ, π ) (cid:21)! = 1 . AC-BAYES FOR MARKOV MODELS 19

Multiply both sides of this equation by ǫ > ( n ) θ " exp sup ρ ∈M + (Θ) (cid:20)Z h ( θ ) ρ ( dθ ) − K ( ρ, π ) + log( ǫ ) (cid:21)! = ǫ. Now, by Markov’s inequality, we have(37) P ( n ) θ (cid:20) sup ρ ∈M + (Θ) Z ( − αr n ( θ, θ ) + (1 − α ) D α ( P ( n ) θ , P ( n ) θ )) ρ ( dθ ) − K ( ρ, π ) + log( ǫ ) ≥ (cid:21) ≤ ǫ. Thus, it follows via complementation that P ( n ) θ (cid:20) ∀ ρ ∈ F (Θ) Z D α ( P ( n ) θ , P ( n ) θ ) ρ ( dθ ) ≤ α (1 − α ) Z r n ( θ, θ ) ρ ( dθ ) + K ( ρ, π ) − log( ǫ )1 − α (cid:21) ≥ − ǫ, thereby completing the proof. (cid:3) C.2.

Theorem 2.1:

Proof of Theorem 2.1.

Recall the deﬁnition of the fractional posterior and the VB approximation, π n,α | X n = exp − αr n ( θ,θ )( X n ) π ( dθ ) R exp − αr n ( γ,θ )( X n ) π ( dγ ) , ˜ π n,α | X n = arg min ρ ∈F K ( ρ, π n,α | X ( n ) ) . It follows by deﬁnition of the KL divergence that˜ π n,α | X n = arg min ρ ∈F (cid:26) − α Z r n ( θ, θ ) ρ ( dθ ) + K ( ρ, π ) (cid:27) , (38)where π is the prior distribution. Following Proposition 2.1 it follows that for any ǫ > Z D α ( P ( n ) θ , P ( n ) θ )˜ π ( dθ | X n ) ≤ α (1 − α ) Z r n ( θ, θ ) ρ ( dθ ) + K ( ρ, π ) − log( ǫ )1 − α , with probability 1 − ǫ . We ﬁx an η ∈ (0 , P ( n ) θ  α − α Z r n ( θ, θ ) ρ n ( dθ ) ≥ α − α Z E[ r n ( θ, θ )] ρ n ( dθ ) + α − α s Var[ R r n ( θ, θ ) ρ n ( dθ )] η + K ( ρ n , π )1 − α  = P ( n ) θ  α − α Z r n ( θ, θ ) ρ n ( dθ ) − α − α Z E[ r n ( θ, θ )] ρ n ( dθ ) − K ( ρ n , π )1 − α ≥ α − α s Var[ R r n ( θ, θ ) ρ n ( dθ )] η  ≤ Var h α − α R r n ( θ, θ ) ρ n ( dθ ) − α − α R E[ r n ( θ, θ )] ρ n ( dθ ) − K ( ρ n ,π )1 − α i α (1 − α ) Var[ R r n ( θ,θ ) ρ n ( dθ )] η . Note that α − α R E ( r n ( θ, θ )) ρ n ( dθ ) and K ( ρ n ,π )1 − α are constants with respect to the data, implyingVar (cid:20) α − α Z r n ( θ, θ ) ρ n ( dθ ) − α − α Z E[ r n ( θ, θ )] ρ n ( dθ ) − K ( ρ n , π )1 − α (cid:21) = α (1 − α ) Var (cid:20)Z r n ( θ, θ ) ρ n ( dθ ) (cid:21) . Therefore, we have P ( n ) θ (cid:20) α − α Z r n ( θ, θ ) ρ n ( dθ ) ≥ α − α Z E[ r n ( θ, θ )] ρ n ( dθ )+ α − α s Var[ R r n ( θ, θ ) ρ n ( dθ )] η + K ( ρ n , π )1 − α (cid:21) ≤ η. From Proposition 2.1, with probability 1 − ǫ the following holds Z D α ( P ( n ) θ , P ( n ) θ )˜ π n,α | X n ( dθ ) ≤ α R r n ( θ, θ ) ρ n ( dθ ) + K ( ρ n , π ) − log( ǫ )1 − α . Therefore, with probability 1 − η − ǫ the following statement holds Z D α ( P ( n ) θ , P ( n ) θ )˜ π n,α | X n ( dθ ) ≤ α − α Z K ( P ( n ) θ , P ( n ) θ ) ρ n ( dθ ) + α − α s Var[ R r n ( θ, θ ) ρ n ( dθ )] η + K ( ρ n , π ) − log( ǫ )1 − α . (39)Next, observe thatVar (cid:20)Z r n ( θ, θ ) ρ n ( dθ ) (cid:21) = E ( n ) θ "(cid:12)(cid:12)(cid:12)(cid:12)Z r n ( θ, θ ) ρ n ( dθ ) − E (cid:20)Z r n ( θ, θ ) ρ n ( dθ ) (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Z Var[ r n ( θ, θ )] ρ n ( dθ ) , by a straightforward application of Jensen’s inequality to the inner integral on the left hand side.Finally, following the hypotheses (i), (ii) and (iii), we have, Z D α ( P ( n ) θ , P ( n ) θ )˜ π n,α | X n ( dθ ) ≤ α − α Z (cid:18) K ( P ( n ) θ , P ( n ) θ ) + s R Var[ r n ( θ, θ )] ρ n ( dθ ) η (cid:19) ρ n ( dθ )+ 1 α ( K ( ρ n , π ) − log( ǫ )) ≤ α ( ǫ n + q nǫ n η )1 − α + nǫ n − log( ǫ )1 − α , thereby concluding the proof. (cid:3) Lemma 2 :

C.3.

Proposition 2.2.

Proof of Proposition 2.2.

We deﬁne Y i := log (cid:16) p θ ( X i | X i − ) p θ ( X i | X i − ) (cid:17) for i = 1 , . . . , n , and Z =log (cid:18) q (0)1 ( X ) q (0)2 ( X ) (cid:19) . Then, using the Markov property we can see that the Kullback-Leibler divergencebetween the joint distributions P ( n ) θ and P ( n ) θ satisﬁes K (cid:16) P ( n ) θ , P ( n ) θ (cid:17) = P ni =1 E θ [ Y i ] + E θ [ Z ] . Ifthe Markov chain { X i } is stationary under θ , so is { Y i } . Hence Y i d = Y and the above equationreduces to, K (cid:16) P ( n ) θ , P ( n ) θ (cid:17) = n E θ [ Y ] + E θ [ Z ] . (40) (cid:3) AC-BAYES FOR MARKOV MODELS 21

C.4.

Proposition 2.3.

First, recall the following result from [13].

Lemma C.2. [13, Lemma 1.2] Let X −∞ , . . . , X , X , . . . be an alpha mixing Markov chain withalpha mixing coeﬃcients given by α k . Let M ba be the sigma-ﬁeld generated by the subsequence ( X a , X a +1 , . . . , X b ) . Let η t ∈ M t −∞ and τ t ∈ M ∞ t + k be adapted random variables such that | η t | ≤ , | τ t | ≤ . Then, sup t sup η t ,τ t | E[ η t τ t ] − E[ η t ]E[ τ t ] | ≤ α k . (41)This lemma provides an upper bound on the covariance of events η and τ , as shown next. Lemma C.3.

Let η ∈ M t −∞ τ ∈ M ∞ t + k be such that, E | η | δ ≤ C , E | τ | δ ≤ C for some δ > .Then, for a ﬁxed n < + ∞ , we have (42) | E ητ − E η E τ | ≤ (cid:18) n + 2 n δ/ ( C + C ) + 2 n δ/ p C C (cid:19) α δ/ (2+ δ ) k . Proof.

Let

N < + ∞ be a ﬁxed number. We get from the triangle inequality that | E ητ − E η E τ | ≤ | E ητ I [ | η |≤ N, | τ |≤ N ] − E ηI [ | η |≤ N ] E τ I [ | τ |≤ N ] | (43) + | E ητ I [ | η |≥ N, | τ |≤ N ] − E ηI [ | η |≥ N ] E τ I [ | τ |≤ N ] | + | E ητ I [ | η |≤ N, | τ |≥ N ] − E ηI [ | η |≤ N ] E τ I [ | τ |≥ N ] | + | E ητ I [ | η |≥ N, | τ |≥ N ] − E ηI [ | η |≥ N ] E τ I [ | τ |≥ N ] | . Multiplying and dividing the ﬁrst term by N and applying Lemma C.2, we get | E ητ I [ | η |≤ N, | τ |≤ N ] − E ηI [ | η |≤ N ] E τ I [ | τ |≤ N ] | ≤ N α k . For the second term, if | τ | ≤ N , then τ ≤ N and τ ≥ − N . Pluggingthis in the second term we get, | E ητ I [ | η |≥ N, | τ |≤ N ] − E ηI [ | η |≥ N ] E τ I [ | τ |≤ N ] | ≤ (cid:12)(cid:12) N E ηI [ | η |≥ N + N (cid:2) E ηI [ | η |≥ N ] (cid:3)(cid:12)(cid:12) (44) = 2 N | E ηI [ | η |≥ N ] | . (45)Since | η | ≥ N , we have 1 ≤ | η | δ N δ . Following this, | N E ηI [ | η |≥ N ] | ≤ N (cid:12)(cid:12)(cid:12)(cid:12) E (cid:20) | η | δ N δ I [ | η |≥ N ] (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) (46) ≤ N N δ | E η δ | ≤ C N δ . (47)Similarly, we can also write for the third term, | E ητ I [ | η |≤ N, | τ |≥ N ] − E ηI [ | η |≤ N ] E τ I [ | τ |≥ N ] | ≤ C N δ .Finally, for the last term we get that by Cauchy-Schwarz inequality, | E ητ I [ | η |≥ N, | τ |≥ N ] − E ηI [ | η |≥ N ] E τ I [ | τ |≥ N ] | ≤ q Var (cid:2) ηI [ | η |≥ N ] (cid:3) Var (cid:2) τ I [ | τ |≥ N ] (cid:3) (48) < q Var (cid:2) ηI [ | η |≥ N ] (cid:3) Var (cid:2) τ I [ | τ |≥ N ] (cid:3) (49) ≤ q E (cid:2) η I [ | η |≥ N ] (cid:3) E (cid:2) τ I [ | τ |≥ N ] (cid:3) . (50)Since | η | > N , 1 < | η | δ N δ . Similarly, 1 < | τ | δ N δ . Plugging these in the previous equation, we get, q E (cid:2) η I [ | η |≥ N ] (cid:3) E (cid:2) τ I [ | τ |≥ N ] (cid:3) ≤ r N δ E (cid:2) | η | δ I [ | η |≥ N ] (cid:3) E (cid:2) | τ | δ I [ | τ |≥ N ] (cid:3) (51) ≤ N δ p C C . (52) Combining the four upper bounds above, we get, | E ητ − E η E τ | ≤ N α k + 2 N δ ( C + C ) + 2 N δ p C C . (53)Now, in particular, setting N = n − / α − / (2+ δ ) k it follows that | E ητ − E η E τ | ≤ n α δ/ (2+ δ ) k + 2 n δ/ α δ/ (2+ δ ) k ( C + C ) + 2 n δ/ α δ/ (2+ δ ) k p C C (54) = (cid:18) n + 2 n δ/ ( C + C ) + 2 n δ/ p C C (cid:19) α δ/ (2+ δ ) k . (55) (cid:3) Lemma C.4.

Let { X t } be an alpha mixing Markov chain with mixing coeﬃcient α k . Furtherassume that E | X t | δ ≤ C and E | X t + k | δ ≤ C for some δ > . Then, for any t and any n > | Cov( X t , X t + k ) | ≤ (cid:18) n + 2 n δ/ ( C + C ) + 2 n δ/ p C C (cid:19) α δ/ (2+ δ ) k . Proof.

Set η = X t , τ = X t + k in Lemma C.3. (cid:3) Proof of Proposition 2.3.

Let { X t } be a stationary alpha-mixing Markov chain under θ with mix-ing coeﬃcients { α k } . Observe that the log-likelihood can be expressed as r n ( θ , θ ) = n X i =1 log (cid:18) p θ ( X i | X i − ) p θ ( X i | X i − ) (cid:19) + log (cid:18) q (0)1 ( X ) q (0)2 ( X ) (cid:19) ≡ n X i =1 Y i + Z . Therefore, the variance of the log-likelihood ratio is simplyVar θ [ r n ( θ , θ )] = Var θ " n X i =1 Y i + Z = n X i,j =1 Cov θ ( Y i , Y j ) + n X i =1 Cov θ ( Y i , Z ) + Cov θ ( Z , Z ) . Now, using Lemma C.4 we have | Cov θ ( Y i , Y j ) | = | E θ Y i Y j − E θ Y i E θ Y j | < (cid:18) n + 2 n δ/ (E θ | Y i | δ + E θ | Y j | δ + q E θ | Y i | δ E θ | Y j | δ ) (cid:19) α δ/ (2+ δ ) | j − i |− = (cid:18) n + 2 n δ/ ( C ( i ) θ ,θ + C ( j ) θ ,θ + q C ( i ) θ ,θ C ( j ) θ ,θ ) (cid:19) α δ/ (2+ δ ) | j − i |− . Similarly, as above we can also say | Cov θ ( Y i , Z ) | < (cid:18) n + 2 n δ/ ( C ( i ) θ ,θ + D , + q C ( i ) θ ,θ D , ) (cid:19) (cid:16) α δ/ (2+ δ ) i − (cid:17) AC-BAYES FOR MARKOV MODELS 23

Combining, the two upper bounds above, we get the ﬁrst result:Var θ (cid:20) r n ( θ , θ ) (cid:21) < n X i,j =1 (cid:18) n + 2 n δ/ ( C ( i ) θ ,θ + C ( j ) θ ,θ + 2 q C ( i ) θ ,θ C ( j ) θ ,θ ) (cid:19) (cid:16) α δ/ (2+ δ ) | i − j |− (cid:17) + n X i =1 (cid:18) n + 2 n δ/ ( C ( i ) θ ,θ + D , + q C ( i ) θ ,θ D , ) (cid:19) (cid:16) α δ/ (2+ δ ) i − (cid:17) + Var[ Z , Z ] . If { X i } is stationary under θ , so is { Y i } . Therefore, E θ | Y i | δ = E θ | Y | δ = C (1) θ ,θ ∀ i , and n X i,j =1 Cov θ ( Y i , Y j ) ≤ n X i,j =1 (cid:18) n + 6 n δ/ C (1) θ ,θ (cid:19) α δ/ (2+ δ ) | j − i |− ≤ n (cid:18) n + 6 n δ/ C (1) θ ,θ (cid:19) X h ≥ α δ/ (2+ δ ) h −  . (57)Again, using Lemma C.4 on Cov θ ( Y i , Z ), yields(58) n X i =1 Cov θ ( Y i , Z ) ≤ (cid:18) n + 2 n δ/ ( C θ + D , + p C θ D , ) (cid:19) X h ≥ α δ/ (2+ δ ) h  . Finally, using eq. (57) and eq. (58) we haveVar θ [ r n ( θ , θ )] ≤ n (cid:18) n + 6 n δ/ C (1) θ ,θ (cid:19) X h ≥ α δ/ (2+ δ ) h −  + (cid:18) n + 2 n δ/ ( C (1) θ ,θ + D , + q C (1) θ ,θ D , ) (cid:19) X h ≥ α δ/ (2+ δ ) h  + Cov θ ( Z , Z ) . (cid:3) Lemma C.5.

Let { X t } be an alpha-mixing Markov Chain with mixing coeﬃcients { α t } . Then theprocess { Y t } where Y t := log (cid:16) p θ ( X t | X t − ) p θ ( X t | X t − ) (cid:17) is also alpha-mixing with mixing coeﬃcients { ˜ α t } where ˜ α t = α t − .Proof. By Z i denote the paired random measure ( X i , X i − ). Let M ji denote the sigma ﬁeldgenerated by the measures X k , where i ≤ k ≤ j . By G ji denote the sigma ﬁeld generatedby the measures Z k , where i ≤ k ≤ j . Let C ∈ M ji − . Then, C can be expressed as( C i − × C i × · · · × C j ). for C i − ∈ M i − i − , C i ∈ M ii . . . and so on. Now, consider a map. T ji : ( C i − × C i × · · · × C j ) −→ ( C i − × C i × C i × · · · × C j − × C j − × C j ). Note that, T ( C ) ∈ G ji . Itis easy to see that G ji = T ji ( M ji − ) ∪ M ∗ ji − , where T ji ( M ji − ) is obtained by applying the map T ji to each element of M ji − . If we assume this latter set to be the range and M ji − to be the domain,then, by construction, T ji is a bijection. Also, the two classes are made of disjoint sets, i.e. if A ∈ T ji ( M ji − ) and A ∗ ∈ M ∗ ji − , then A ∩ A ∗ = φ . Also, note that M j ∗ i − is made of impossible sets.i.e. P ( A ∗ ) = 0 ∀ A ∗ ∈ M j ∗ i − . Now consider the alpha mixing coeﬃcients for Z i . By deﬁnition, it is given by α zk = sup i sup A ∈G i −∞ ,B ∈G ∞ i + k | P ( A ∩ B ) − P ( A ) P ( B ) | = sup i sup A ∈G i −∞ ,B ∈G ∞ i + k | P (( A o ∪ A ∗ ) ∩ ( B o ∪ B ∗ )) − P (( A o ∪ A ∗ )) P (( B o ∪ B ∗ )) | . Where, A = ( A o ∪ A ∗ ) B = ( B o ∪ B ∗ ) A o ∈ T i −∞ ( M i −∞ ) A ∗ ∈ M ∗ i −∞ B o ∈ T ∞ i + k − ( M ∞ j + k − ) B ∗ ∈ M ∗∞ j + k − . Then, the expression for the alpha mixing coeﬃcient can be reduced into α zk = sup i sup A o ∈ T i −∞ ( M i −∞ ) ,B o ∈ T ∞ i + k − ( M ∞ i + k − ) | P ( A o ∩ B o ) − P ( A o ) P ( B o ) | . Note that, by bijection property of T ji , we can ﬁnd A ′ ∈ M i −∞ and B ′ ∈ M ∞ i + k − such that α zk = sup i sup A ′ ∈M i −∞ ,B ′ ∈M ∞ i + k − | P ( T i −∞ ( A ′ ) ∩ T ∞ i + k − ( B ′ )) − P ( T i −∞ ( A ′ )) P ( T ∞ i + k − ( B ′ )) | . = α k − . Now, log (cid:16) p θ ( X n | X n − ) p θ ( X n | X n − ) (cid:17) is just a function of the paired Markov chain Z i , therefore it has alpha-mixing coeﬃcient α k − . (cid:3) C.5.

Proof of Theorem 3.1 :

Proof. Part 1: Verifying condition (i) of Corollary 2.1.

We substitute the true parameter θ for θ and θ for θ . We also set q (0)1 to be the invariantdistribution of the Markov chain under θ , q , and q (0)2 as the invariant distribution of the Markovchain under θ , q θ . Applying the fact that these Markov chains are stationary to Proposition 2.2,we have K ( P ( n ) θ , P ( n ) θ ) = n E (cid:20) log (cid:18) p θ ( X | X ) p θ ( X | X ) (cid:19)(cid:21) + E[ Z ] , ≤ n m X j =1 E h M (1) j ( X , X ) i | f (1) j ( θ, θ ) | + m X k =1 E[ M (2) k ( X )] | f (2) k ( θ, θ ) | , (59)where the inequality follows from Assumption 3.1. Therefore, it follows that Z K ( P ( n ) θ , P ( n ) θ ) ρ n ( dθ ) ≤ n m X j =1 E h M (1) j ( X , X ) i Z | f (1) j ( θ, θ ) | ρ n ( dθ ) + m X k =1 E[ M (2) k ( X )] | Z f (2) k ( θ, θ ) | ρ n ( dθ ) . By item i in Assumption 3.1, it follows that Z K ( P ( n ) θ , P ( n ) θ ) ρ n ( dθ ) ≤ n m X j =1 E h M (1) j ( X , X ) i C √ n + m X k =1 E[ M (2) k ( X )] C √ n ≤ nǫ (1) n , where ǫ (1) n ∈ O (cid:16) √ n (cid:17) . Part 2: Verifying condition (ii) of Corollary 2.1.

Again, using Proposition 2.3 along with the fact

AC-BAYES FOR MARKOV MODELS 25 that the Markov chain is stationary we haveVar[ r n ( θ, θ )] ≤ n (cid:18) n + 6 n δ/ C (1) θ ,θ (cid:19) X k ≥ α δ/ (2+ δ ) k  + (cid:18) n + 2 n δ/ ( C (1) θ ,θ + D θ ,θ + q C (1) θ ,θ D θ ,θ ) (cid:19) X k ≥ α δ/ (2+ δ ) k  + Var[ Z ] . It then follows that Z Var[ r n ( θ, θ )] ρ n ( dθ ) ≤ n (cid:18) n + 6 n δ/ Z C (1) θ ,θ ρ n ( dθ ) (cid:19) X k ≥ α δ/ (2+ δ ) k −  + Z Var[ Z ] ρ n ( dθ )+ (cid:18) n + 2 n δ/ ( Z C (1) θ ,θ ρ n ( dθ ) + Z D θ ,θ ρ n ( dθ ) + Z q C (1) θ ,θ D θ ,θ ρ n ( dθ )) (cid:19) X k ≥ α δ/ (2+ δ ) k  . First, consider the term R C (1) θ ,θ ρ n ( θ ), and observe that Z C (1) θ ,θ ρ n ( dθ ) = Z E log (cid:12)(cid:12)(cid:12)(cid:12) p θ ( X | X ) p θ ( X | X ) (cid:12)(cid:12)(cid:12)(cid:12) δ ρ n ( dθ ) . By Assumption 3.1, we have Z E log (cid:12)(cid:12)(cid:12)(cid:12) p θ ( X | X ) p θ ( X | X ) (cid:12)(cid:12)(cid:12)(cid:12) δ ρ n ( dθ ) ≤ Z E  m X j =1 M (1) j ( X , X ) | f (1) k ( θ, θ ) |  δ ρ n ( dθ ) . Since the function x x δ is convex, we can apply Jensen’s inequality to obtain,  m X j =1 M (1) j ( X , X ) | f (1) k ( θ, θ ) |  δ ≤ m δ m X k =1 M (1) j ( X , X ) δ | f (1) k ( θ, θ ) | δ . Therefore, it follows that Z E log (cid:12)(cid:12)(cid:12)(cid:12) p θ ( X | X ) p θ ( X | X ) (cid:12)(cid:12)(cid:12)(cid:12) δ ρ n ( dθ ) ≤ m δ m X k =1 E[ M (1) k ( X , X ) δ ] Z | f (1) k ( θ, θ ) | δ ρ n ( dθ ) . By Assumption 3.1, R | f k ( θ, θ ) | δ ρ n ( dθ ) < Cn and E[ M (1) k ( X , X ) δ ] < B , implying that Z C (1) θ ,θ ρ n ( dθ ) ≤ m δ m X k =1 B Cn = m δ BCn .

Since (cid:16)P k ≥ α δ/ (2+ δ ) k (cid:17) < ∞ , it follows that (cid:16) n + 6 n δ/ R C (1) θ ,θ ρ n ( dθ ) (cid:17) (cid:16)P k ≥ α δ/ (2+ δ ) k − (cid:17) ∈ O( n δ/ n ).Similarly, we can show that R D θ ,θ ρ n ( dθ ) ∈ O ( n ), and R Var[ Z ] ρ n ( dθ ) ∈ O ( n ).For the ﬁnal term R q C (1) θ ,θ D θ ,θ ρ n ( dθ ), use the Cauchy-Schwarz inequality to obtain the upperbound (cid:16)R C (1) θ ,θ ρ n ( dθ ) R D θ ,θ ρ n ( dθ ) (cid:17) / which is also of order O ( n ). Combining all of these together we have Z Var[ r n ( θ, θ )] ρ n ( dθ ) ≤ nǫ (2) n , for some ǫ (2) n ∈ O( n δ/ n ).Since K ( ρ n , π ) < √ nC = n C √ n , it follows that K ( ρ n , π ) < nǫ (3) n , where ǫ (3) n = O(1 / √ n ) as before.Finally, by choosing ǫ n = max( ǫ (1) n , ǫ (2) n , ǫ (3) n ), our theorem is proved. (cid:3) C.6.

Proof of Theorem 4.1.

Proof. Verifying condition (i) of Corollary 2.1:

As in the proof of Theorem 3.1 substitute the trueparameter θ for θ and θ for θ in . We also set q (0)1 and q (0)2 to the distribution q (0) . ApplyingProposition 2.2 to the corresponding transition kernels and initial distribution we have, K ( P ( n ) θ , P ( n ) θ ) = n X i =1 E (cid:20) log (cid:18) p θ ( X i | X i − ) p θ ( X i | X i − ) (cid:19)(cid:21) + E (cid:20) log (cid:18) D ( X ) D ( X ) (cid:19)(cid:21) (60) = n X i =1 E (cid:20) log (cid:18) p θ ( X i | X i − ) p θ ( X i | X i − ) (cid:19)(cid:21) . Now, applying Assumption 3.1, we can bound the previous equation as follows, K ( P ( n ) θ , P ( n ) θ ) ≤ n X i =1 E " m X k =1 M (1) k ( X i , X i − ) | f (1) k ( θ, θ ) | = n X i =1 m X k =1 E h M (1) k ( X i , X i − ) i | f (1) k ( θ, θ ) | . (61)Since M (1) k ’s are bounded there exists a constant Q so that, Z K ( P ( n ) θ , P ( n ) θ ) ρ n ( dθ ) ≤ Q Z n X i =1 m X k =1 | f (1) k ( θ, θ ) | ρ n ( dθ )= Qn m X k =1 Z | f (1) k ( θ, θ ) | ρ n ( dθ ) . By eq. (21) in Assumption 3.1, it follows that Z K ( P ( n ) θ , P ( n ) θ ) ρ n ( dθ ) ≤ Qn m X k =1 C √ n = nmQ C √ n = nǫ (1) n , for some ǫ (1) n ∈ O( √ n ). Verifying condition (ii) of Corollary 2.1:

As in the previous part, Z = 0, implying that D θ,θ .Applying Proposition 2.3 and integrating with respect to ρ n , we obtain Z Var [ r n ( θ, θ )] ρ n ( dθ ) ≤ n X i =1 (cid:18) n + 2 n δ/ Z C ( i ) θ ,θ ρ n ( dθ ) (cid:19) (cid:16) α δ/ (2+ δ ) i − (cid:17) + n X i,j =1 (cid:18) n + 2 n δ/ ( Z C ( i ) θ ,θ ρ n ( dθ ) + Z C ( j ) θ ,θ ρ n ( dθ ) + Z q C ( i ) θ ,θ C ( j ) θ ,θ ρ n ( dθ )) (cid:19) (cid:16) α δ/ (2+ δ ) | i − j |− (cid:17) . (62) AC-BAYES FOR MARKOV MODELS 27

First, consider the term R C ( i ) θ ,θ ρ n ( dθ ). Using Assumption 3.1, we can upper bound C ( i ) θ ,θ as, C ( i ) θ ,θ ≤ E " m X k =1 M (1) k ( X i , X i − ) | f (1) k ( θ, θ ) | δ ≤ m X k =1 m δ E (cid:20)(cid:16) M (1) k ( X i , X i − ) | f (1) k ( θ, θ ) | (cid:17) δ (cid:21) (by Jensen’s inequality)= m X k =1 m δ E h M (1) k ( X i , X i − ) δ i | f (1) k ( θ, θ ) | δ . Since M (1) k ’s are upper bounded by Q , it follows that, C ( i ) θ ,θ ≤ P mk =1 m δ Q δ | f (1) k ( θ, θ ) | δ .Hence, from Assumption 3.1, we get, Z C ( i ) θ ,θ ρ n ( dθ ) ≤ m X k =1 m δ Q δ Z | f (1) k ( θ, θ ) | δ ρ n ( dθ ) ≤ ( mQ ) δ Cn .

Using the upper bound above, we can say for an L large enough, R C ( i ) θ ,θ ρ n ( dθ ) ≤ Ln . Next, by theCauchy-Schwarz inequality, we have that R q C ( i ) θ ,θ C ( j ) θ ,θ ρ n ( dθ )) < qR C ( i ) θ ,θ ρ n ( dθ ) R C ( j ) θ ,θ ρ n ( dθ )) ≤ Ln . Thus, we have the following upper bound. Z Var [ r n ( θ, θ )] ρ n ( dθ ) ≤ n X i =1 (cid:18) n + 2 n δ/ Ln (cid:19) (cid:16) α δ/ (2+ δ ) i − (cid:17) + n X i,j =1 (cid:18) n + 2 n δ/ ( Ln + Ln + Ln ) (cid:19) (cid:16) α δ/ (2+ δ ) | i − j |− (cid:17) = (cid:18) n + 2 n δ/ Ln (cid:19) n X i =1 α δ/ (2+ δ ) i − ! + (cid:18) n + 6 n δ/ Ln (cid:19)  n X i,j =1 α δ/ (2+ δ ) | i − j |−  . Since P ni,j =1 α δ/ (2+ δ ) | i − j |− < n P k ≥ α δ/ (2+ δ ) k − < ∞ , we have that for some ǫ (2) n ∈ O( n δ/ n ), Z Var [ r n ( θ, θ )] ρ n ( dθ ) < nǫ (2) n . Since K ( ρ n , π ) ≤ √ nC , following the concluding argument in Theorem 3.1 completes the proof. (cid:3) C.7.

Proposition 4.1.

Proof of Proposition 4.1.

We verify Assumption 3.1 and the proof follows from Theorem 4.1. For i ∈ { , , . . . , K − } , p θ ( j | i ) = ( θ if j = i − , − θ if j = i + 1 . If i = 0 or i = K , then the Markov chain goes back to 1 or K − = 0, the log ratio of the transition probabilities becomes, | log p θ ( X | X ) − log p θ ( X | X ) | = I [ X = X +1] log (cid:18) θ θ (cid:19) + I [ X = X − log (cid:18) − θ − θ (cid:19) . In this case, m = 2. M (1)1 ( X , X ) = I [ X = X +1] and M (1)2 ( X , X ) = I [ X = X − , both of which arebounded. Let f (1)1 ( θ, θ ) := log (cid:16) θ θ (cid:17) suppose f (1)2 ( θ, θ ) := log (cid:16) − θ − θ (cid:17) . The stationary distribution q θ ( i ) = K ∀ i ∈ , , . . . , K . Hence the log of the ratio of theinvariant distribution becomes log q ( x ) − log q θ ( x ) = 0 , (63)and we can set M (2) i ( · ) := 1 and f (2) i ( · , · ) := 0 for i ∈ { , } . Thus, to prove the concentration boundfor this Markov chain it is enough to assume that δ = 1 and show that R [ f (1)1 ( θ, θ )] ρ n ( dθ ) < Cn and R [ f (1)2 ( θ, θ )] ρ n ( dθ ) < Cn for some constant C > { ρ n } is a sequence of beta probability distribution functions, with parameters α n , β n that satisfy the constraint α n α n + β n = θ . Speciﬁcally, we choose α n = nθ and (therefore) β n = n (1 − θ ). Thus, we get the following, Z | f (1)1 ( θ, θ ) | ρ n ( dθ ) = Z (cid:12)(cid:12)(cid:12)(cid:12) log (cid:18) θ θ (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ρ n ( dθ ) < Z (cid:12)(cid:12)(cid:12)(cid:12) θ θ − (cid:12)(cid:12)(cid:12)(cid:12) ρ n ( dθ )= 1Beta( α n , β n ) Z (cid:12)(cid:12)(cid:12)(cid:12) θ − θθ (cid:12)(cid:12)(cid:12)(cid:12) θ α n − (1 − θ ) β n − dθ. Since θ , θ ∈ (0 , | θ − θ | , giving | θ − θ | < θ − θ ) . We use that fact to arrive at Z | f (1)1 ( θ, θ ) | ρ n ( dθ ) ≤ α n , β n ) Z ( θ − θ ) θ α n − (1 − θ ) β n − dθ = 2Beta( α n − , β n )Beta( α n , β n ) ( α n − β n )( α n + β n − ( α n + β n − . From our choice of α n and β n , α n − ,β n )Beta( α n ,β n ) = O (1), and plugging the values of α n and β n into ( α n − β n )( α n + β n − ( α n + β n − , we get ( α n − β n )( α n + β n − ( α n + β n − = n ( θ − n )(1 − θ )(1 − n ) (1 − n ) , which is upper bounded by C n for some constant C >

0. Hence, Z | f (1)1 ( θ, θ ) | ρ n ( dθ ) < C n . Similarly, we can also show that, Z | f (1)2 ( θ, θ ) | ρ n ( dθ ) < C n . Finally, from Proposition B.1, we get that K ( ρ n , π ) < C + log( n ) for some large constant C .Hence, K ( ρ n , π ) < C √ n for some constant C >

0. Choosing C = max( C , C , C ), we satisfy allthe conditions of Assumption 3.1 and Theorem 4.1. (cid:3) C.8.

Proof of Proposition 4.2.

Proof.

For the purpose of this proof, we choose ρ n ’s with scaled Beta distribution with parameters α n = n (2 θ ) and β n = n (1 − θ ). Since, ρ n is a scaled Beta distribution with the scaling factors m = 0 . c = 0, the pdf of ρ n is given by ρ n ( θ ) = 0 . α n , β n ) (2 θ ) α n (1 − θ ) β n Since this is a scaled distribution, E ρ n [ θ ] = 0 . α n α n + β n = θ and there exists a constant σ > ρ n [ θ ] = σ n . Now, we analyse the transition probabilities. For i ∈ { , , . . . } , the Birth-Death AC-BAYES FOR MARKOV MODELS 29 process has transition probabilities p θ ( j | i ) = ( θ if j = i − , − θ if j = i + 1 . If i = 0, then the Markov chain goes to 1 with probability 1. Hence with the convention log = 0the ratio of the log of the transition probabilities becomes, | log p θ ( X | X ) − log p θ ( X | X ) | = I [ X = X +1] log (cid:20) θ θ (cid:21) + I [ X = X − log (cid:20) − θ − θ (cid:21) . In this case, m = 3. M (1)1 ( X , X ) = I [ X = X +1] and M (1)2 ( X , X ) = I [ X = X − . De-ﬁne M (1)3 ( X , X ) := 1. All these random variables are bounded. Deﬁne f (1)1 ( θ, θ ) :=log h θ θ i , f (1)2 ( θ, θ ) := log h − θ − θ i and f (1)3 ( θ, θ ) := 0. Similarly as in the proof on Proposition 4.1, Z [ f (1)1 ( θ, θ )] ρ n ( dθ ) < C n , and Z [ f (1)2 ( θ, θ )] ρ n ( dθ ) < C n . The stationary distribution is given by q θ ( i ) = ( θ − θ ) i − q θ (1) ∀ i ∈ , , . . . , so that q θ ( i ) = (1 − θ )( θ − θ ) i − Hence the log of the ratio of the invariant distribution becomeslog q ( i ) − log q θ ( i ) = log (cid:20) − θ − θ (cid:21) + ( i −

1) log (cid:20) θ θ (cid:21) − ( i −

1) log (cid:20) − θ − θ (cid:21) (64)We deﬁne M (2)1 ( X ) := 1, and M (2)2 ( X ) = M (2)3 ( X ) := X −

1. We can write E q (0) [ M (2)2 ( X )] = P ∞ i =1 ( i − q (0) ( i ) < P ∞ i =1 i q (0) ( i ). We have chosen q (0) such that P ∞ i =1 i q (0) ( i ) is bounded.Hence, E q (0) [ M (2)2 ( X )] < ∞ . To verify Item i deﬁne, f (2)1 ( θ, θ ) = − f (2)3 ( θ, θ ) := log h − θ − θ i , anddeﬁne f (2)2 ( θ, θ ) := log h θ θ i . Therefore following the proof of Proposition 4.1, Z | f (2)1 ( θ, θ ) | ρ n ( dθ ) = Z | f (2)3 ( θ, θ ) | ρ n ( dθ ) = Z | f (1)2 ( θ, θ ) | ρ n ( dθ ) < C n , and , Z | f (2)2 ( θ, θ ) | ρ n ( dθ ) = Z | f (1)1 ( θ, θ ) | ρ n ( dθ ) < C n . Finally, we take the KL-divergence K ( ρ n , π ). ρ n follows a scaled Beta distribution on (0 , /

2) withparameters α n = n (2 θ ) and β n = n (1 − θ ), while π follows a scaled Beta distribution on (0 , / α and β . Thus, K ( ρ n , π ) = Z ρ n ( θ ) π ( θ ) ρ n ( dθ ) , which, by substituting t = 2 θ , we get, K ( ρ n , π ) = 2 Z ρ n ( t ) π ( t ) ρ n ( dt ) . R ρ n ( t ) π ( t ) ρ n ( dt ) is the KL-divergence between a Beta distribution with parameters α n and β n anda Beta distribution with parameters α and β . An application of proposition B.1 gives us for aconstant C > Z ρ n ( t ) π ( t ) ρ n ( dt ) < C + 12 log( n ) . Hence we can say, K ( ρ n , π ) < (cid:2) C + log( n ) (cid:3) . Thus, we now get that for some constant C > K ( ρ n , π ) < C √ n. Choosing C = max( C , C , C ) we satisfy all of the conditions of Assumption 3.1 and thus byTheorem 4.1, we are complete the proof. (cid:3) C.9.

Proof of Theorem 4.3 :

Proof. Veriﬁcation of condition (i) of Corollary 2.1

As in the proof of Theorem 3.1 substitute thetrue parameter θ for θ and θ for θ . We also set q (0)1 and q (0)2 to the known initial distribution q (0) . Similar to the steps leading to eq. (61), we get K ( P ( n ) θ , P ( n ) θ ) ≤ n X i =1 m X k =1 E h M (1) k ( X i , X i − ) i | f (1) k ( θ, θ ) | . Consider the term E h M (1) k ( X i , X i − ) i . With q ( i − θ the marginal distribution of X i − , we haveE h M (1) k ( X i , X i − ) i = Z M (1) k ( x i , x i − ) p θ ( x i | x i − ) q ( i − θ ( x i − ) dx i dx i − . E h M (1) k ( X i , X i − ) i = Z M (1) k ( x i , x i − ) p θ ( x i | x i − ) p i − θ ( x i − | x ) q (0) θ ( x ) dx dx i dx i − Recall that the marginal density satisﬁes q ( i − θ ( x i − ) = R p i − θ ( x i − | x ) q (0) θ ( x ) d ( x ), where p iθ ( ·| x ) is the i -step transition probability. ThenE h M (1) k ( X i , X i − ) i = Z E h M (1) k ( X i , x i − ) | x i − i p i − θ ( x i − | x ) q (0) θ ( x ) dx dx i − . Since the Markov chain { X n } satisﬁes Assumption A.1, we know by the application of Theorem A.1that { X n } is V -geometrically ergodic. Hence, ∃ τ < R < ∞ such that ∀ | f | < V | Z f ( x i − ) p i − θ ( x i − | x ) dx i − − Z f ( x i − ) q θ ( x i − ) dx i − | < RV ( x ) τ i − , where q θ is the stationary distribution, implying that Z f ( x i − ) p i − θ ( x i − | x ) dx i − < Z f ( x i − ) q θ ( x i − ) dx i − + RV ( x ) τ i − . Using Jensen’s inequality we have (cid:16) E h M (1) k ( X i , X i − ) | X i − i(cid:17) δ ≤ E h M (1) k ( X i , X i − ) δ | X i − i

1, it follows that E h M (1) k ( X i , X i − ) | X i − i < V ( X i − ) / (2+ δ ) ≤ V ( X i − ).Thus, setting f ( x ) = E h M (1) k ( X i , X i − ) | X i − = x i , we obtainE h M (1) k ( X i , X i − ) i < Z h E h M (1) k ( X i , X i − ) | X i − i q θ ( x i ) dx i − + RV ( x ) τ i − i q (0) ( x ) dx = E[ M (1) k ( X , X )] + τ i − Z RV ( x ) q (0) ( x ) dx . AC-BAYES FOR MARKOV MODELS 31

Summing from i = 1 to n , we get n X i =1 E h M (1) k ( x i , x i − ) i < n E[ M (1) k ( X , X )] + n X i =1 τ i − Z RV ( x ) q (0) ( x ) dx = n E[ M (1) k ( X , X )] + 1 − τ n − τ Z RV ( x ) q (0) ( x ) dx . This gives us the following bound on R K ( P ( n ) θ , P ( n ) θ ) ρ n ( dθ ): Z K ( P ( n ) θ , P ( n ) θ ) ρ n ( dθ ) ≤ m X k =1 (cid:20) n E[ M (1) k ( X , X )] + 1 − τ n − τ Z RV ( x ) D ( x ) dx (cid:21) Z | f (1) k ( θ, θ ) | ρ n ( dθ ) . By Assumption 3.1, R | f (1) k ( θ, θ ) | ρ n ( dθ ) < C √ n . Hence, we can rewrite the previous expression as Z K ( P ( n ) θ , P ( n ) θ ) ρ n ( dθ ) ≤ m X k =1 (cid:20) n E[ M (1) k ( X , X )] + 1 − τ n − τ Z RV ( x ) D ( x ) dx (cid:21) C √ n = nm (cid:20) E[ M (1) k ( X , X )] + 1 − τ n n (1 − τ ) Z RV ( x ) D ( x ) dx (cid:21) C √ n . Since, τ <

1, 0 < − τ n <

1, and we rewrite the previous equation as, Z K ( P ( n ) θ , P ( n ) θ ) ρ n ( dθ ) ≤ nm (cid:20) E[ M (1) k ( X , X )] + 1 n (1 − τ ) Z RV ( x ) D ( x ) dx (cid:21) C √ n . Hence, there exists an ǫ (1) n ∈ O( √ n ) such that R K ( P ( n ) θ , P ( n ) θ ) ρ n ( dθ ) ≤ nǫ (1) n . Veriﬁcation of condition (ii) of Corollary 2.1:

1, such thatE h M (1) k ( X i , X i − ) i δ ≤ n E[ M (1) k ( X , X )] δ + τ i − Z RV ( x ) D ( x ) dx , which, by the fact that τ i − < τ , gives us,E h M (1) k ( X i , X i − ) i δ ≤ E[ M (1) k ( X , X )] δ + τ Z RV ( x ) D ( x ) dx . By Assumption 3.1, we know that, R | f (1) k ( θ, θ ) | δ ρ n ( dθ ) < Cn . Hence, for a large constant L , R C ( i ) θ ,θ ρ n ( dθ ) ≤ Ln . We also see that since the chain is geometrically ergodic, by the application ofeq. (11), P k ≥ α δ/ (2+ δ ) k < + ∞ . The rest of the proof follows similarly as in the proof of Theorem 4.1,and we obtain an ǫ (2) n ∈ O( n δ/ n ), such that, Z Var[ r n ( θ, θ )] ρ n ( dθ ) < nǫ (2) n . Since, K ( ρ n , π ) ≤ √ nC , similar arguments as in the proof of Theorem 3.1 holds. The theorem isthus proved. (cid:3) C.10.

Proof of Theorem 4.2.

Proof. Veriﬁcation of condition (i) of Corollary 2.1

As in the proof of Theorem 3.1 substitute thetrue parameter θ for θ and θ for θ . We also set our initial distributions q (0)1 and q (0)2 to the knowninitial distribution q (0) . A method similar to Equation (61), yields K ( P ( n ) θ , P ( n ) θ ) ≤ n X i =1 m X k =1 E h M (1) k ( X i , X i − ) i | f (1) k ( θ, θ ) | . Because M (1) k s satisfy Assumption 4.1, it follows by the application of Theorem 2.3, [12] that ∃ λ > < κ ≤ λ , and for some ζ ∈ (0 ,

1) possibly depending upon λ ,E h e κM (1) k ( X i ,X i − ) (cid:12)(cid:12)(cid:12) X , X ] ≤ ζ i − e κM (1) k ( X ,X ) + 1 − ζ i − ζ D e κa for all i > . We rewrite E h M (1) k ( X i , X i − ) | X , X i as follows:E h M (1) k ( X i , X i − ) | X , X i = E[ κM (1) k ( X i , X i − ) | X , X ] κ ≤ E[ e κM (1) k ( X i ,X i − ) | X , X ] κ . Therefore, P ni =1 E h M (1) k ( X i , X i − ) i can be upper bounded as, n X i =1 E h M (1) k ( X i , X i − ) i = n X i =1 E h M (1) k ( X i , X i − ) | X , X i κ − ≤ n X i =1 (cid:20) ζ i − E e κM (1) k ( X ,X ) + 1 − ζ i − ζ D e κa (cid:21) κ − . Since, ζ ∈ (0 , ζ i <

1. Hence, we can write that, n X i =1 (cid:20) ζ i − E e κM (1) k ( X ,X ) + 1 − ζ i − ζ D e κa (cid:21) ≤ n X i =1 (cid:20) ζ i − E e κM (1) k ( X ,X ) + 11 − ζ D e κa (cid:21) κ − = (cid:20) − ζ n − ζ E e κM (1) k ( X ,X ) + n − ζ D e κa (cid:21) κ − ≤ nL, AC-BAYES FOR MARKOV MODELS 33 for a large constant L . Therefore R K ( P ( n ) θ , P ( n ) θ ) ρ n ( dθ ) can be upper bounded as follows, Z K ( P ( n ) θ , P ( n ) θ ) ρ n ( dθ ) ≤ Z m X k =1 nL | f (1) k ( θ, θ ) | ρ n ( dθ )= m X k =1 nL Z | f (1) k ( θ, θ ) | ρ n ( dθ ) . By Assumption 3.1, R | f (1) k ( θ, θ ) | ρ n ( dθ ) < Cn , hence, Z K ( P ( n ) θ , P ( n ) θ ) ρ n ( dθ ) ≤ nL C √ n . Hence, for some ǫ (1) n ∈ O( √ n ), we have obtained that, R K ( P ( n ) θ , P ( n ) θ ) ρ n ( dθ ) ≤ nǫ (1) n . Veriﬁcation of condition (ii) of Corollary 2.1:

Similar to as in the proof of Theorem 4.1, weupper bound R Var [ r n ( θ, θ )] ρ n ( dθ ) by Z Var [ r n ( θ, θ )] ρ n ( dθ ) ≤ n X i,j =1 (cid:18) n + 2 n δ/ ( Z C ( i ) θ ,θ ρ n ( dθ ) + Z C ( j ) θ ,θ ρ n ( dθ ) + Z q C ( i ) θ ,θ C ( j ) θ ,θ ρ n ( dθ )) (cid:19) (cid:16) α δ/ (2+ δ ) | i − j |− (cid:17) (66) + n X i =1 (cid:18) n + 2 n δ/ Z C ( i ) θ ,θ ρ n ( dθ ) (cid:19) (cid:16) α δ/ (2+ δ ) i − (cid:17) , where C θ ,θ is upper bounded as C ( i ) θ ,θ ≤ m X k =1 m δ E h M (1) k ( X i , X i − ) i δ | f (1) k ( θ, θ ) | δ . There exists a constant C δ depending upon δ such that,[ M (1) k ] δ ( X i , X i − ) = κ δ [ M (1) k ] δ ( X i , X i − ) δ κ δ ≤ e κM (1) k ( X i ,X i − ) + C δ κ δ . By expressing E h M (1) k ( X i , X i − ) δ i = E h E h M (1) k ( X i , X i − ) δ | X , X ii and following a methodsimilar to the previous part, we get,E h M (1) k ( X i , X i − ) δ i ≤ h ζ i E e κM (1) k ( X ,X ) + − ζ i − ζ D e κa i + C δ κ δ . The fact that 0 < ζ < < ζ i < ζ . This gives us the following,E h M (1) k ( X i , X i − ) δ i ≤ h ζ E e κM (1) k ( X ,X ) + − ζ D e κa i + C δ κ δ . Since κ < λ , by the application of Jensen’s inequality, we getE h M (1) k ( X i , X i − ) δ i ≤ h ζ E e λM (1) k ( X ,X ) + − ζ D e κa i + C δ κ δ = h ζ R e λM (1) k ( x ,x ) p θ ( x | x ) D ( x ) dx dx + − ζ D e κa i + C δ κ δ . We know that R | f (1) k ( θ, θ ) | δ ρ n ( dθ ) < Cn . Thus, following Assumption 3.1 we can say that, fora large constant L , R C ( i ) θ ,θ ρ n ( dθ ) ≤ Ln . The rest of the proof follows similarly as in the proof ofTheorem 4.1, and we obtain an ǫ (2) n ∈ O( n δ/ n ), such that, Z Var[ r n ( θ, θ )] ρ n ( dθ ) < nǫ (2) n . Since, K ( ρ n , π ) ≤ √ nC , similar arguments as in the proof of Theorem 3.1 holds. The theorem isthus proved. (cid:3) C.11.

Proof of Proposition 4.3.

Proof.

For the purpose of the proof, we choose ρ n ’s with scaled Beta distribution with parameters α n = n θ and β n = n − θ . Since, ρ n is a scaled Beta distribution with the scaling factors m = 2and c = −

1, the pdf of ρ n is given by ρ n ( θ ) = 2Beta( α n , β n ) (cid:18) θ (cid:19) α n (cid:18) − θ (cid:19) β n Since this is a scaled distribution, E ρ n [ θ ] = 2 α n α n + β n − θ and there exists a constant σ > ρ n [ θ ] = σ n . We now analyse the log-ratio of the transition probabilities for the Markov chain,log p θ ( X n | X n − ) − log p θ ( X n | X n − ) = 2 X n X n − ( θ − θ ) + X n − ( θ − θ ) . Observe that in this setting, M (1)1 ( X n , X n − ) = | X n X n − | and M (1)2 ( X n , X n − ) = X n . Next, usingthe fact that E[ | X n | δ | X n − ] = E[ | X n − θ X n − + θ X n − | δ | X n − ] , and by an application of triangle inequality, we obtainE[ | X n | δ | X n − ] ≤ E h ( | X n − θ X n − | + | θ X n − | ) δ | X n − i = E "(cid:18) | X n − θ X n − | + | θ X n − | (cid:19) δ | X n − = E " δ (cid:18) | X n − θ X n − | + | θ X n − | (cid:19) δ | X n − . Now by using Jensen’s inequality we get,E[ | X n | δ | X n − ] ≤ E (cid:20) δ (cid:18) | X n − θ X n − | δ + | θ X n − | δ (cid:19) | X n − (cid:21) = 2 δ E h | X n − θ X n − | δ | X n − i + 2 δ | θ X n − | . AC-BAYES FOR MARKOV MODELS 35

We know if Y ∼ N ( µ, σ ), then E | Y − µ | p = σ p p p +12 ) √ π . Consequently,E[ | X n | δ | X n − ] ≤ δ " δ Γ( δ ) √ π + 2 δ | θ X n − | δ . (67)It follows that,E[ M (1)1 ( X n , X n − ) δ | X n − ] ≤ δ " δ Γ( δ ) √ π | X n − | δ + 2 δ | θ | δ | X n − | δ ≤ δ " δ Γ( δ ) √ π + 2 δ | θ | δ ! ( | X n − | δ + 1) . Since θ <

1, we can say,E[ M (1)1 ( X n , X n − ) δ | X n − ] ≤ δ " δ Γ( δ ) √ π + 2 δ ! ( | X n − | δ + 1) . Deﬁne a constant C δ := (cid:18) δ (cid:20) δ δ √ π (cid:21) + 2 δ (cid:19) . The above term then becomes,E[ M (1)1 ( X n , X n − ) δ | X n − ] ≤ C δ ( | X n − | δ + 1) . Next we analyse the term M (1)2 ( X n , X n − ).E h M (1)2 ( X n , X n − ) δ | X n − i = E[ X δn − | X n − ]= X δn − ≤ C δ ( X δn − + 1) . Then, deﬁning V ( x ) := C δ ( x δ + 1) it follows that,E [ V ( X n ) | X n − ] = E h C δ ( X δn + 1) | X n − i . Using a technique similar to Equation (67) we get,E h C δ ( X δn + 1) | X n − i ≤ " C δ (2 δ " δ Γ( δ ) √ π + 2 δ | θ X n − | δ + 1) . Deﬁne another constant C ′ δ := C δ (cid:18) δ (cid:20) δ δ √ π (cid:21) − δ | θ | δ + 1 (cid:19) . Since δ > δ δ √ π >

1. And since | θ | <

1, so is | θ | δ . Hence,2 δ " δ Γ( δ ) √ π − δ | θ | δ > . Hence, we have shown that,E [ V ( X n ) | X n − ] ≤ (2 δ | θ | δ ) C δ ( X δn − + 1) + C ′ δ . Since | θ | < δ − , 2 δ | θ | δ <

1, and we can express the above equation as,E [ V ( X n ) | X n − ] ≤ V ( X n − ) + C ′ δ . Deﬁne the set C ( m ) := { x : | x | δ + 1 ≤ m } . From Proposition 11.4.2, [20], for a large enough m , C ( m ) forms a petite set. Thus, we have proved that V ( x ) as deﬁned in this example satisﬁesAssumption A.1, and { X n } is V -geometrically ergodic. The f (1) j ’s corresponding to Assumption 3.1are given by f (1)1 ( θ, θ ) = ( θ − θ ) and f (1)2 ( θ, θ ) = ( θ − θ ). Therefore, it follows that, ∂ θ f (1)1 = 1 ,∂ θ f (1)2 = − θ and − < − θ < . Since f (1)1 ( θ , θ ) = f (1)2 ( θ , θ ) = 0, We just showed that they also have bounded partial derivatives.We also know that | θ | <

1. Hence, by Proposition 3.1 f (1) j ’s satisfy the conditions of Assumption 3.1.The invariant distribution for the simple linear model Markov-chain under parameter θ is givenby a gaussian distribution with mean 0 and variance − θ . In other words, q θ ( x ) = 1 √ π e − − θ x . Analyzing the log likelihood yields,log q ( x ) − log q θ ( x ) = − x − θ ) + x − θ )= x θ − θ ) . Let f (2)1 ( θ , θ ) = ( θ − θ ) and f (2)1 ( θ , θ ) = 0. Since f (2)1 ( θ , θ ) = f (1)2 ( θ , θ ), by follow-ing arguments similar as before, can conclude that f (2)1 ( θ , θ ) also satisﬁes the requirements ofAssumption 3.1. Let M (2)1 ( x ) = x and deﬁne M (2)2 ( x ) := 1. Let X ∼ q (0)1 . As long as R x δ q (0)1 ( x ) dx < ∞ , we satisfy all the conditions required for Theorem 4.3. Finally we needto verify the condition that K ( ρ n , π ) < C √ n for some constant C >

0. The KL-divergence R log (cid:16) ρ n ( θ ) π ( θ ) (cid:17) ρ n ( dθ ) becomes, K ( ρ n , π ) = Z − log α n , β n ) (cid:18) θ (cid:19) α n (cid:18) − θ (cid:19) β n ! α n , β n ) (cid:18) θ (cid:19) α n (cid:18) − θ (cid:19) β n dθ. Substituting, y = θ , we get, K ( ρ n , π ) = Z log (cid:18) α n , β n ) ( y ) α n (1 − y ) β n (cid:19) α n , β n ) ( y ) α n (1 − y ) β n dy = Z log(2) 1Beta( α n , β n ) ( y ) α n (1 − y ) β n dy + Z log (cid:18) α n , β n ) ( y ) α n (1 − y ) β n (cid:19) α n , β n ) ( y ) α n (1 − y ) β n . The ﬁrst term integrates upto 2. The second term is the KL-divergence between a Uniform andBeta distribution with parameters α n = n θ and β n = n (1 − θ ) and support [0 , AC-BAYES FOR MARKOV MODELS 37

Lemma B.1 it follows that K ( ρ n , π ) is upper bounded by, K ( ρ n , π ) < C + 12 log( n ) < C √ n, for some large constant C . This completes the proof. (cid:3) C.12.

Proof of Theorem 5.1.

Proof.

As in the proof of Theorem 2.1, following eq. (39), we note that, Z D α ( P ( n ) θ , P ( n ) θ )˜ π n,α | X n ( dθ ) ≤ α − α Z K ( P ( n ) θ , P ( n ) θ ) ρ n ( dθ ) + α − α s Var[ R r n ( θ, θ ) ρ n ( dθ )] η + K ( ρ n , π ) − log( ǫ )1 − α . (68)Following from eq. (25) and eq. (28), we get that, Z K ( P ( n ) θ , P ( n ) θ ) ρ n ( dθ ) ≤ E[ r n ( θ , θ ∗ n )] + nǫ n , and Z Var[ r n ( θ, θ )] ρ n ( dθ ) ≤ nǫ n + 2Var[ r n ( θ ∗ n , θ )] . Plugging these into eq. (68), we are done. (cid:3)

References [1] Pierre Alquier and James Ridgway. Concentration of tempered posteriors and of their varia-tional approximations. Annals of Statistics, 48(3):1475–1497, 2020.[2] Horst Alzer. On some inequalities for the gamma and psi functions. Mathematics ofComputation, 66(217):373–389, 1997.[3] Anirban Bhattacharya, Debdeep Pati, and Yun Yang. Bayesian fractional posteriors. TheAnnals of Statistics, 47(1):39–66, 2019.[4] Lucien Birg´e. Robust testing for independent non identically distributed variables and Markovchains. In Specifying Statistical Models, pages 134–162. Springer, 1983.[5] Christopher M Bishop. Pattern recognition and machine learning. Springer, 2006.[6] David M Blei, Alp Kucukelbir, and Jon D McAuliﬀe. Variational inference: A review forstatisticians. Journal of the American statistical Association, 112(518):859–877, 2017.[7] Richard C. Bradley. Basic properties of strong mixing conditions. a survey and some openquestions. Probab. Surveys, 2:107–144, 2005.[8] Adji Bousso Dieng, Dustin Tran, Rajesh Ranganath, John Paisley, and David Blei. Variationalinference via χ upper bound minimization. In Advances in Neural Information ProcessingSystems, pages 2732–2741, 2017.[9] Monroe D Donsker and SR Srinivasa Varadhan. Asymptotic evaluation of certain Markovprocess expectations for large time, i. Communications on Pure and Applied Mathematics,28(1):1–47, 1975.[10] Subhashis Ghosal, Jayanta K Ghosh, and Aad W Van Der Vaart. Convergence rates of posteriordistributions. Annals of Statistics, 28(2):500–531, 2000.[11] Subhashis Ghosal and Aad W Van Der Vaart. Entropies and rates of convergence for maximumlikelihood and Bayes estimation for mixtures of normal densities. Annals of Statistics, pages1233–1263, 2001.[12] Bruce Hajek. Hitting-time and occupation-time bounds implied by drift analysis with appli-cations. Advances in Applied Probability, pages 502–525, 1982. [13] Ildar A Ibragimov. Some limit theorems for stationary processes. Theory of Probability & ItsApplications, 7(4):349–382, 1962.[14] Prateek Jaiswal, Harsha Honnappa, and Vinayak A. Rao. Asymptotic consistency of loss-calibrated variational Bayes. Stat, 9(1):e258. e258 sta4.258.[15] Prateek Jaiswal, Harsha Honnappa, and Vinayak A Rao. Risk-sensitive variational Bayes:Formulations and bounds. arXiv preprint arXiv:1903.05220 , 2019.[16] Prateek Jaiswal, Vinayak Rao, and Harsha Honnappa. Asymptotic consistency of α -R´enyi-approximate posteriors. Journal of Machine Learning Research, 21(156):1–42, 2020.[17] Galin L Jones. On the Markov chain central limit theorem. Probability Surveys, 1:299–320,2004.[18] Simon Lacoste-Julien, Ferenc Husz´ar, and Zoubin Ghahramani. Approximate inference for theloss-calibrated Bayesian. In International Conference on Artiﬁcial Intelligence and Statistics,pages 416–424, 2011.[19] Yingzhen Li and Richard E Turner. R´enyi divergence variational inference. Advances in NeuralInformation Processing Systems, 29:1073–1081, 2016.[20] Sean P Meyn and Richard L Tweedie. Markov chains and stochastic stability. Springer Science& Business Media, 2012.[21] John T Ormerod and Matt P Wand. Explaining variational approximations. The AmericanStatistician, 64(2):140–153, 2010.[22] Judith Rousseau. On the frequentist properties of bayesian nonparametric methods. AnnualReview of Statistics and Its Application , 3:211–231, 2016.[23] Daniil Ryabko. Testing statistical hypotheses about ergodic processes. In 2008 IEEERegion 8 International Conference on Computational Technologies in Electrical and ElectronicsEngineering, pages 257–260. IEEE, 2008.[24] Xiaotong Shen and Larry Wasserman. Rates of convergence of posterior distributions. TheAnnals of Statistics, 29(3):687–714, 2001.[25] Martin J Wainwright and Michael I Jordan. Introduction to variational methods for graphicalmodels. Foundations and Trends in Machine Learning, 1:1–103, 2008.[26] Yixin Wang and David M Blei. Frequentist consistency of variational Bayes. Journal of theAmerican Statistical Association, 114(527):1147–1161, 2019.[27] Yun Yang, Debdeep Pati, and Anirban Bhattacharya. αα