[PDF] Gibbs sampler and coordinate ascent variational inference: a set-theoretical review

Abstract

One of the fundamental problems in Bayesian statistics is the approximation of the posterior distribution. Gibbs sampler and coordinate ascent variational inference are renownedly utilized approximation techniques that rely on stochastic and deterministic approximations. In this paper, we define fundamental sets of densities frequently used in Bayesian inference. We shall be concerned with the clarification of the two schemes from the set-theoretical point of view. This new way provides an alternative mechanism for analyzing the two schemes endowed with pedagogical insights.

Full PDF

GGibbs sampler and coordinate ascent variationalinference: a set-theoretical review

Se Yoon Lee Department of Statistics, Texas A&M University, College Station, Texas, USA [email protected]

Abstract.

A central task in Bayesian machine learning is the approximation ofthe posterior distribution. Gibbs sampler and coordinate ascent variational infer-ence are renownedly utilized approximation techniques that rely on stochastic anddeterministic approximations. This article clariﬁes that the two schemes can beexplained more generally in a set-theoretical point of view. The alternative viewsare consequences of a duality formula for variational inference.

Keywords:

Gibbs sampler, coordinate ascent variational inference, duality for-mula

A statistical model contains a sample space of observations y endowed with an appro-priate σ -ﬁeld of sets over which is given a family of probability measures. For almostall problems, it is sufﬁcient to suppose that these probability measures can be describedthrough their density functions, p ( y | θ ) , indexed by a parameter θ belonging to the pa-rameter space Θ . In many problems, one of the essential goals is to make inferenceabout the parameter θ , and this article particularly concerns Bayesian inference.Bayesian approaches start with expressing the uncertainty associated with the pa-rameter θ through a density π ( θ ) supported on the parameter space Θ , called a prior.Collection { p ( y | θ ) , π ( θ ) } is referred to as a Bayesian model. Given ﬁnite evidence m ( y ) = (cid:82) p ( y | θ ) · π ( θ ) d θ for all y , the Bayes’ theorem formalizes an inversion pro-cess to learn the parameter θ given the observations y through its posterior distribution[17]: π ( θ | y ) = p ( y | θ ) · π ( θ ) m ( y ) . (1)A central task in the application of Bayesian models is the evaluation of this joint den-sity π ( θ | y ) (1) or indeed to compute expectation with respect to this density.However, for many complex Bayesian models [6, 13], it is difﬁcult to evaluate theposterior distribution. In such situations, we need to resort to approximation techniques,and these fall broadly into two classes, according to whether they rely on stochastic[8, 20, 21, 22] or deterministic [7, 19, 25, 29] approximations. See [1, 31] for reviewpapers for these techniques. a r X i v : . [ m a t h . S T ] S e p Se Yoon Lee

Among many techniques, the Gibbs sampler [8] and coordinate ascent variationalinference (CAVI) algorithm [7] are possibly the most popularly utilized techniques forapproximation of the target density π ( θ | y ) (1). In practice, they are ﬂexibly jointedwith more sophisticated samplers or optimizers. For instances, Gibbs sampler is oftencombined with the Metropolis-Hastings algorithm [3, 10] endowed with a nice proposaldensity, and CAVI algorithm is combined with the (stochastic) gradient descent method[28] endowed with a reasonable mean-ﬁeld assumption.Essentially, the utilities of the two schemes may be ascribed to their exploitations ofthe conditional independences [12] formulated between latent variables (that is, compo-nents in the vector θ ) and observations y . Indeed, conditional independence is the keystatistical property which enables us to decompose the original problem of approxima-tion of the joint density π ( θ | y ) , possibly supported on a high-dimensional parameterspace Θ , into a collection of small problems with low dimensionalities. A key featureof the resultant algorithm based on such a conditional independence is that a singlecycle comprises multiple steps where at each step only a small fraction of componentsof the parameter θ is updated, whereas other components are ﬁxed with most recentlyupdated information with regard to them.The aim of this article is to understand the Gibbs sampler and CAVI algorithm froma set-theoretical point of view, and clarify some common structure between the twoschemes. Here, we say “set-theoretical understanding” in the sense that we will treatsome fundamental densities participated in the two schemes as elements of some sets ofdensities. Actually, these sets are byproducts that naturally arise from Bayesian learningtheory provided that Gibbs sampler or CAVI algorithms is employed to approximate thetarget π ( θ | y ) (1). A duality formula for variational inference is the fundamental formulawhich makes it possible to bridge set and Bayesian learning theories. The present section states a duality formula for variational inference [18]. We ﬁrst in-troduce some ingredients for the argument. Let Θ be a set endowed with an appropriate σ -ﬁeld F , and two probability measures P and Q , which formulates two probabilityspaces, ( Θ , F , P ) and ( Θ , F , Q ) . We use notation Q (cid:28) P to indicate that Q is ab-solutely continuous with respect to P (that is, Q ( A ) = 0 holds for any measurableset A ∈ F with P ( A ) = 0 ). Let notation E P [ · ] denotes integration with respect tothe probability measure P . Given any real-valued random variable g deﬁned on theprobability space ( Θ , F , P ) , notation g ∈ L ( P ) represents that the random variable g is integrable with respect to measure P , that is, E P [ | g | ] = (cid:82) | g | dP < ∞ . The nota-tion KL ( Q (cid:107) P ) represents the Kullback-Leibler divergence from P to Q , KL ( Q (cid:107) P ) = (cid:82) log ( dQ/dP ) dQ [16]. Theorem 1 (Duality formula).

Consider two probability spaces ( Θ , F , P ) and ( Θ , F , Q ) with Q (cid:28) P . Assume that there is a common dominating probability measure λ suchthat P (cid:28) λ and Q (cid:28) λ . Let h denotes any real-valued random variable on ( Θ , F , P ) that satisﬁes exp h ∈ L ( P ) . Then the following equality holds log E P [exp h ] = sup Q (cid:28) P { E Q [ h ] − KL ( Q (cid:107) P ) } . ayesian learning theory 3 Further, the supremum on the right-hand side is attained when q ( θ ) p ( θ ) = exp h ( θ ) E P [exp h ] , where p ( θ ) = dP/dλ and q ( θ ) = dQ/dλ denote the Radon-Nikodym derivatives of theprobability measures P and Q with respect to λ , respectively. In practice, a common dominating measure λ for P and Q is usually either Lebesgueor counting measure. In this paper, we particularly focus on the former case where theduality formula in Theorem 1 can be expressed as log E p ( θ ) [exp h ( θ )] = sup q (cid:28) p { E q ( θ ) [ h ( θ )] − KL ( q (cid:107) p ) } , (2)where p ( θ ) = dP/dλ and q ( θ ) = dQ/dλ are probability density functions (pdf)corresponding to the probability measures P and Q , respectively, and h ( θ ) is anymeasurable function such that the expectation E p ( θ ) [exp h ( θ )] is ﬁnite. Expectationsin the equilibrium (2) are taken with respect to densities on the subscripts. For in-stance, the expectation E p ( θ ) [exp h ( θ )] represents the integral (cid:82) exp h ( θ ) p ( θ ) dθ , andthe Kullback-Leibler divergence is expressed with the pdf version, KL ( q ( θ ) (cid:107) p ( θ )) = (cid:82) q ( θ ) log( q ( θ ) /p ( θ )) dθ . In (2), we use the notation q (cid:28) p to indicate that their prob-ability measures corresponding to the pdfs satisfy Q (cid:28) P . (The case where λ is acounting measure maybe also similarly derived, and we omit results.) Consider a Bayesian model { p ( y | θ ) , π ( θ ) } where p ( y | θ ) is a data generating processand π ( θ ) is a prior density as explained in Introduction . For the purpose of illustration,we additionally assume that the parameter space Θ is decomposed as Θ = Π Ki =1 Θ i = Θ × · · · × Θ i × · · · × Θ K , (3)for some integer K > , where each of the component parameter spaces Θ i ( i =1 , · · · , K ) is allowed to be multidimensional. The notation A × B denotes the Caresianproduct between two sets A and B . Under the decomposition (3), elements of the set Θ can be expressed as θ = ( θ , · · · , θ i , · · · , θ K ) ∈ Θ where θ i ∈ Θ i ( i = 1 , · · · , K ).For each i ( i = 1 , · · · , K ), deﬁne a set which complements the i -th componentparameter space Θ i : Θ − i = Π Kl =1 ,l (cid:54) = i Θ l = Θ × · · · × Θ i − × Θ i +1 × · · · × Θ K . (4)We denote an elements of the set Θ − i (4) with θ − i = ( θ , · · · , θ i − , θ i +1 , · · · , θ K ) ∈ Θ − i .It is important to emphasize that how to make a decomposition on the set Θ (thatis, to determine the integer K or the dimension of the component parameter spaces Θ i in (3)) is at the discretion of a model builder. For instance, when a Bayesian modelretains a certain hierarchical structure, he or she may impose a decomposition on the Se Yoon Lee

Fig. 1.

Venn diagram that overlappingly describes two set-inclusion relationships: (1) Q MF θ | y ⊂Q θ | y ⊂ Q θ , and (2) Q mθ i | y ⊂ Q θ i | y ⊂ Q θ i for each component index i ( i = 1 , · · · , K ). Symbols • indicates elements for the sets. set Θ based on the conditional independence induced by hierarchical structure amongthe latent variables θ i ’s and observations y .Now, we deﬁne fundamental sets of densities, itemized with (i) – (v) . They playcrucial roles in Bayesian estimation for the parameter θ provided that Gibbs sampler orCAVI algorithm is carried out to approximate the target density π ( θ | y ) (1): (i) Set Q θ is the collection of all the densities supported on the parameter space Θ .Set Q θ | y is the collection of all the densities conditioned on the observations y . Bydeﬁnitions, it holds a subset inclusion, Q θ | y ⊂ Q θ . A key element of the set Q θ isa prior density π ( θ ) , and that of the set Q θ | y is the (target) posterior density π ( θ | y ) (1). As the prior density π ( θ ) is not conditioned on the observations y , it belongsto the set Q θ − Q θ | y = Q θ ∩ ( Q θ | y ) c ; (ii) For each i ( i = 1 , · · · , K ), set Q θ i is the collections of all the densities sup-ported on the i -th component parameter space Θ i , and set Q θ i | y denotes the col-lection for the only posterior densities supported on Θ i . This implies that a sub-set inclusion, Q θ i | y ⊂ Q θ i holds for each i . For each i , full conditional posteriordensity π ( θ i | θ − i , y ) = π ( θ i , θ − i , y ) /π ( θ − i , y ) = π ( θ , y ) /π ( θ − i , y ) and marginalposterior density π ( θ i | y ) are typical elements of the set Q θ i | y ; (iii) For each i ( i = 1 , · · · , K ), set Q mθ i is the collections of all the ‘marginal’ den-sities supported on the i -th component parameter space Θ i , and set Q mθ i | y is the col-lection for the only ‘marginal’ posterior densities supported on Θ i . (The superscript‘ m ’ represents the ‘marginal’.) For each i , the marginal posterior density π ( θ i | y ) belongs to Q mθ i | y . However, the full conditional posterior density π ( θ i | θ − i , y ) ∈Q θ i | y may not belong to the set Q mθ i | y because it is conditioned on the θ − i ; ayesian learning theory 5 (iv) Cartesian product of the sets {Q mθ i } Ki =1 (deﬁned in the item (iii) ) deﬁnes a set Q MF θ := K (cid:89) i =1 Q mθ i = Q mθ × · · · × Q mθ i × · · · Q mθ K (5) = { q ( θ ) | q ( θ ) = K (cid:89) i =1 q ( θ i ) = q ( θ ) · · · q ( θ i ) · · · q ( θ K ) , q ( θ i ) ∈ Q mθ i } . The set Q MF θ (5) is referred to as the mean-ﬁeld variational family [15], whoseroot can be found in statistical physics literature [2, 9, 24]. (The superscript ‘ M F ’represents the ‘mean-ﬁeld’.)Note that elements of the set Q MF θ are expressed with product-form distributionssupported on the parameter space Θ (3). Due to the deﬁnition of the marginal set Q mθ i ( i = 1 , · · · , K ) (iii) where the elements of the set can be any marginal densitysupported on the i -th component parameter space Θ i , elements of the set Q MF θ (5)enjoys a ﬂexibility, a nice feature of non-parametric densities, with the only uniqueconstraint on the ﬂexibility is the (marginal) independence among θ i ’s induced bythe mean-ﬁeld theory (5) [23].Likewisely, we deﬁne a set Q MF θ | y via Cartesian product of the sets Q mθ i | y ( i =1 , · · · , K ), Q MF θ | y := K (cid:89) i =1 Q mθ i | y = Q mθ | y × · · · × Q mθ i | y × · · · × Q mθ K | y ; (v) For each i ( i = 1 , · · · , K ), Cartesian product of the sets {Q mθ i } Ki =1 − {Q mθ i } (deﬁned in (iii) ) deﬁnes a set Q MFθ − i := K (cid:89) l =1 ,l (cid:54) = i Q mθ l = Q mθ × · · · × Q mθ i − (6) × Q mθ i +1 × · · · × Q mθ K = { q ( θ − i ) | q ( θ − i ) = q ( θ ) · · · q ( θ i − ) · q ( θ i +1 ) · · · q ( θ K ) , q ( θ i ) ∈ Q mθ i } . Elements of the set Q MFθ − i are expressed with product-form distributions supportedon the i -th complementary parameter space Θ − i (4).Similarly, we deﬁne Q MFθ − i | y := K (cid:89) l =1 ,l (cid:54) = i Q mθ l | y = Q mθ | y × · · · × Q mθ i − | y (7) × Q mθ i +1 | y × · · · × Q mθ K | y . Figure 1 shows a Venn diagram which depicts the set-inclusion relationship amongfundamental sets deﬁned in items (i) − (v) along with their key elements. As seen from Se Yoon Lee the panel, by notational deﬁnition, two chains of subset-inclusion hold: (1) densitiessupported on the entire parameter space Θ , Q MF θ | y ⊂ Q θ | y ⊂ Q θ ; and (2) densitiessupported on the i -th component parameter space Θ i ( i = 1 , · · · , K ), Q mθ i | y ⊂ Q θ i | y ⊂Q θ i .Because the Venn diagram overlaid the above two inclusion relationships on a sin-gle panel for visualization purpose, it should not be interpreted that subset-inclusions Q mθ i | y ⊂ Q MF θ | y , Q θ i | y ⊂ Q θ | y , and Q θ i ⊂ Q θ hold for each i ( i = 1 , · · · , K ). Rather,it is should be interpreted that each of the sets Q mθ i | y , Q θ i | y , and Q θ i participate to eachof the sets Q MF θ | y , Q θ | y , and Q θ as a piece via Cartesian product, respectively. Consider again a Bayesian model { p ( y | θ ) , π ( θ ) } as illustrated in Introduction . Gibbssampler [8] is a Markov chain Monte Carlo (MCMC) sampling scheme to approximatethe target density π ( θ | y ) ∈ Q θ | y (1). A single cycle of the Gibbs sampler is executedby iteratively realizing a sample from each of the full conditional posteriors π ( θ i | θ − i , y ) ∈ Q θ i | y , ( i = 1 , · · · , K ) , (8)while ﬁxing other full conditional posteriors. In each of the K steps in the cycle, latentvariables conditioned on the density (8) (that is, θ − i ) are updated by the most recentlyrealized samples throughout the iterations.Variational inference is a functional optimization method to approximate the targetdensity π ( θ | y ) ∈ Q θ | y (1). Mean-ﬁeld variational inference (MFVI) is a special kindof variational inference, principled on mean-ﬁeld theory [9]. The MFVI is operated byminimizing the Kullback-Leibler divergence over a mean-ﬁeld variational family Q MF θ (5) as follows: q ∗ ( θ ) = argmin q ( θ ) ∈Q MF θ KL ( q ( θ ) || π ( θ | y )) (9) = q ∗ ( θ ) · · · q ∗ ( θ i ) · · · q ∗ ( θ K ) ∈ Q MF θ | y . (10)The superscripts ∗ on each of the densities in (9) and (10) are marked to emphasizethat the corresponding density has been optimized though an appropriate algorithm.An optimized full joint variational density q ∗ ( θ ) (9) is referred to as variational Bayes(VB) posterior [30], and each of the optimized marginal variational densities q ∗ ( θ i ) ( i = 1 , · · · , K ) (10) is referred to as the variational factor [7].The CAVI algorithm [5, 7] is an algorithm to induce the functional minimization (9).A single cycle of the CAVI is carried out by iteratively updating each of the variationalfactors q ∗ ( θ i ) = exp E q ( θ − i ) [log π ( θ i | θ − i , y )] (cid:82) exp E q ( θ − i ) [log π ( θ i | θ − i , y )] dθ i ∈ Q mθ i | y , ( i = 1 , · · · , K ) , (11)while ﬁxing other variational factors. In each of the K steps within the cycle, expec-tation E q ( θ − i ) [ · ] in (11) is taken with respect to the most recently updated variational ayesian learning theory 7 density q ( θ − i ) ∈ Q MFθ − i | y (7) throughout the iterations. For a derivation for (11), refer to[23].We convey two key messages. First, the full conditional posterior π ( θ i | θ − i , y ) ∈Q θ i | y (8) plays a central role in the updating procedures not only for the Gibbs samplerbut also in the CAVI algorithm [23]. Second, although the Gibbs sampler eventuallyleads to the exact target density π ( θ | y ) ∈ Q θ | y (1) when the number of iterations islarge, the property is not guaranteed for the CAVI. The later is because there exists a dis-tributional gap (represented via Kullback-Leibler divergence) between the target π ( θ | y ) (1) and VB posterior q ∗ ( θ ) (9) regardless of number of iterations. Set-theoretically, thisis obvious because the two elements q ∗ ( θ ) and π ( θ | y ) (can) belong to different sets [7](refer to Figure 1). The duality formula (2) provides an alternative view of the Gibbs sampler from theperspective of functional optimization:

Corollary 1.

Consider a Bayesian model { p ( y | θ ) , π ( θ ) } with entire parameter space Θ decomposed as (3). Assume that the Gibbs sampler is used to approximate the targetdensity π ( θ | y ) (1). Deﬁne a functional F i : Q θ i → R induced by the duality formulafor each i ( i = 1 , · · · , K ) as follow: F i { q ( θ i ) } = E q ( θ i ) [log π ( θ − i | θ i , y )] − KL ( q ( θ i ) || π ( θ i | y )) . (12) Then, the followings hold for each i ( i = 1 , · · · , K ): (a) the functional F i is concave over Q θ i ; (b) for all densities q ( θ i ) ∈ Q θ i | y , F i { q ( θ i ) } ≤ log π ( θ − i | y ) ; (c) the functional F i attains the maximum value (that is, log π ( θ − i | y ) ) at the fullconditional posterior q ( θ i ) = π ( θ i | θ − i , y ) (8). Corollary 1 states that for each i ( i = 1 , · · · , K ) the full conditional posterior π ( θ i | θ − i , y ) ∈ Q θ i | y ⊂ Q θ i (8) is a global maximum for the functional F i : Q θ i → R (12). See panel (a) in the Figure 2 for a pictorial description. Under MFVI assumption, for each i ( i = 1 , · · · , K ) we can regard that an optimized i -th variational factor q ∗ ( θ i ) (11) is a surrogate for the marginal target density π ( θ i | y ) .Note that the two densities belong to the same set Q mθ i | y (deﬁned in the item (iii) ); referto the Venn diagram in Figure 1. This suggests that an ‘intrinsic’ approximation qualityof the MFVI can be explained by the Kullback-Leibler divergence KL ( q ∗ ( θ i ) || π ( θ i | y )) or its lower bound for each i : lower values of them indicate a better approximationquality due to the mean-ﬁeld theory (5).In practice, although it is possible to sample from marginal posterior density π ( θ i | y ) ( i = 1 , · · · , K ) through various MCMC techniques [14], but it is difﬁcult to ob-tain an analytic expression for the density π ( θ i | y ) , hence, so is for the divergence Se Yoon Lee

Prob. (a) A Gibbs Sampler (b) A CAVI algorithm

Fig. 2.

Pictorial illustrations for (a) Gibbs sampler and (b) CAVI algorithm. For each i ( i =1 , · · · , K ), the panel (a) shows that the full conditional posterior π ( θ i | θ − i , y ) (8) is a globalmaximum for the functional F i (12): and the panel (b) shows that optimized variational factor q ∗ ( θ i ) (11) can be squashed by a constant R − i { q ∗ ( θ − i ) } with respect to q ∗ ( θ i ) so that thebyproduct R − i { q ∗ ( θ − i ) } · q ∗ ( θ i ) is kept below the marginal target density π ( θ i | y ) on the i -thcomponent parameter space Θ i . KL ( q ∗ ( θ i ) || π ( θ i | y )) . It is also nontrivial to acquire a lower bound for KL ( q ∗ ( θ i ) || π ( θ i | y )) through information inequalities (for example, Pinsker’s inequality [18]) as such in-equalities again require a closed form expression for the density π ( θ i | y ) for each i ,( i = 1 , · · · , K ).The duality formula (2) provides an alternative view for the MFVI and an algorithmic-based lower bound for the KL ( q ∗ ( θ i ) || π ( θ i | y )) for each i , ( i = 1 , · · · , K ) provided theCAVI algorithm (11) is employed: Corollary 2.

Consider a Bayesian model { p ( y | θ ) , π ( θ ) } with entire parameter space Θ decomposed as (3). Assume the CAVI algorithm is used to approximate the targetdensity π ( θ | y ) (1) through a variational density q ( θ ) that belongs to mean-ﬁeld family(5). Deﬁne a functional R − i : Q MFθ − i → (0 , ∞ ) for each i ( i = 1 , · · · , K ): R − i { q ( θ − i ) } = (cid:82) exp E q ( θ − i ) [log π ( θ i | θ − i , y )] dθ i exp KL ( q ( θ − i ) || π ( θ − i | y )) . Let q ∗ ( θ − i ) ∈ Q MFθ − i | y represents an optimized variational density for θ − i , that is, q ∗ ( θ − i ) = q ∗ ( θ ) · · · q ∗ ( θ i − ) · q ∗ ( θ i +1 ) · · · q ∗ ( θ K ) , where each variational factor on the right hand side has been optimized through theformula (11).Then, the followings hold for each i ( i = 1 , · · · , K ): (a) variational factor q ∗ ( θ i ) is squashed by the constant R − i { q ∗ ( θ − i ) } ∈ (0 , : R − i { q ∗ ( θ − i ) } · q ∗ ( θ i ) ≤ π ( θ i | y ) for all θ i ∈ Θ i ; (13) ayesian learning theory 9 (b) Kullback-Leibler divergence between q ∗ ( θ i ) and π ( θ i | y ) is lower bounded byKL ( q ∗ ( θ i ) || π ( θ i | y )) (14) ≥ max (cid:26) , log (cid:18) (cid:90) exp E q ∗ ( θ i ) [log π ( θ − i | θ i , y )] d ( θ − i ) (cid:19)(cid:27) . Corollary 2 (a) states that for each i ( i = 1 , · · · , K ), there is a constant whichuniformly presses the surrogate q ∗ ( θ i ) from above on the i -th component parameterspace Θ i so that the inequality (13) holds: this distributional inequality is depicted inthe panel (b) in the Figure 2. Corollary 2 (b) suggests that the denominator in the CAVIformula (11) plays an important role by participating into a lower bound of the dis-tance KL ( q ∗ ( θ i ) || π ( θ i | y )) . Note that the lower bound is algorithm-based, which maybe approximated via a Monte Carlo algorithm. This paper revisited the Gibbs sampler and CAVI algorithm for a clariﬁcation of thealgorithms with a set-theory, thereby, providing an intuitive understanding for them.We explained two algorithms by treating some key ingredients participating in the al-gorithms as elements of fundamental sets that naturally arise from a duality formula forvariational inference. Among novel ﬁndings, one of the key discoveries was that the fullconditional posterior distribution can be viewed as a global maximum of a functionalassociated with the duality formula. Additionally, we found that the formula links thedenominator of the variational factor, which is often disregarded in the literature, withan approximation quality of the MFVI induced by the mean-ﬁeld theory.

Proof– Theorem 1

We prove the theorem by using measurement theory [27]. (See page99 of [18] for an alternative proof which used properties of entropy.) Due to the dom-inating assumptions P (cid:28) λ and Q (cid:28) λ and the Radon-Nikodym theorem (Theorem32.1 of [4]), there exist Radon-Nikodym derivatives (also called generalized probabilitydensities [16]) p ( θ ) = dP/dλ and q ( θ ) = dQ/dλ unique up to sets of measure (prob-ability) zero in λ corresponding to measures P and Q , respectively. On the other hand,due to the dominating assumption Q (cid:28) P , there exists Radon-Nikodym derivative dQ/dP , hence, Kullback-Leibler divergence KL ( Q (cid:107) P ) = (cid:82) log ( dQ/dP ) dQ is well-deﬁned and ﬁnite. By using conventional measure-theoretic notation (for example, seepage 4 of [16]), we can also write dP ( θ ) = p ( θ ) dλ ( θ ) and dQ ( θ ) = q ( θ ) dλ ( θ ) , and (cid:82) gdP = (cid:82) g ( θ ) dP ( θ ) for any g ∈ L ( P ) [26]. Now, it is straightforward to prove the equilibrium of the duality formula: E Q [ h ] − KL ( Q (cid:107) P ) = (cid:90) hdQ − (cid:90) log (cid:18) dQdP (cid:19) dQ = (cid:90) h ( θ ) dQ ( θ ) − (cid:90) log (cid:18) dQ ( θ ) dP ( θ ) (cid:19) dQ ( θ )= (cid:90) h ( θ ) q ( θ ) dλ ( θ ) − (cid:90) log (cid:18) q ( θ ) p ( θ ) (cid:19) q ( θ ) dλ ( θ )= (cid:90) log (cid:18) e h ( θ ) p ( θ ) q ( θ ) (cid:19) q ( θ ) dλ ( θ ) ≤ log (cid:18) (cid:90) (cid:18) e h ( θ ) p ( θ ) q ( θ ) (cid:19) q ( θ ) dλ ( θ ) (cid:19) (15) = log (cid:18) (cid:90) e h ( θ ) p ( θ ) dλ ( θ ) (cid:19) = log (cid:18) (cid:90) e h ( θ ) dP ( θ ) (cid:19) = log (cid:18) (cid:90) e h dP (cid:19) = log E P [exp h ] . Note the Jensen’s inequality is used to derive the inequality in (15). This inequalitybecomes the equality when e h ( θ ) p ( θ ) /q ( θ ) is constant with respect to θ , which ﬁnalizesthe proof. (a) Let p ( θ i ) and q ( θ i ) are elements of the set Q θ i . For any ≤ a ≤ , start with F i { ap ( θ i ) + (1 − a ) q ( θ i ) } = (cid:90) log π ( θ − i | θ i , y ) { ap ( θ i )+ (1 − a ) q ( θ i ) } dθ i − KL ( ap ( θ i ) + (1 − a ) q ( θ i ) || π ( θ i | y )) . (16) The integral term on the right-hand side can be written as: a E p ( θ i ) [log π ( θ − i | θ i , y )] + (1 − a ) E q ( θ i ) [log π ( θ − i | θ i , y )] , (17)where the expectation E p ( θ i ) [ · ] and E q ( θ i ) [ · ] are taken with respect to the densities p ( θ i ) and q ( θ i ) , respectively.The (negative of) second term on the right-hand side (16) satisﬁes the followinginequality KL ( ap ( θ i ) + (1 − a ) q ( θ i ) || π ( θ i | y )) (18) ≤ a KL ( p ( θ i ) || π ( θ i | y )) + (1 − a ) KL ( q ( θ i ) || π ( θ i | y )) (The inequality (18) generally holds due to the joint convexity of the f -divergence; seeLemma 4.1 of [11].) ayesian learning theory 11 Now, use the expression (17) and inequality (18) to ﬁnish the proof: F i { ap ( θ i ) + (1 − a ) q ( θ i ) } ≥ a { E p ( θ i ) [log π ( θ − i | θ i , y )] − KL ( p ( θ i ) || π ( θ i | y )) } + (1 − a ) { E q ( θ i ) [log π ( θ − i | θ i , y )] − KL ( p ( θ i ) || π ( θ i | y )) } = a F i { p ( θ i ) } + (1 − a ) F i { q ( θ i ) } . (b) and (c) For each i = 1 , · · · , K , use the duality formula (2) by replacing the q ( θ ) , p ( θ ) , and h ( θ ) in the formula with q ( θ i ) ∈ Q θ i , π ( θ i | y ) ∈ Q θ i , and log π ( θ − i | θ i , y ) ∈Q θ − i , respectively. (Recall that in the formula (2), q and p need to be densities, whereas h is a measurable function.) This leads to log E π ( θ i | y ) [ π ( θ − i | θ i , y )] (19) = sup q ( θ i ) (cid:28) π ( θ i | y ) { E q ( θ i ) [log π ( θ − i | θ i , y )] − KL ( q ( θ i ) (cid:107) π ( θ i | y )) } . Theorem 1 also tells us that the supremum on the right-hand side of (19) is attainedwhen q ( θ i ) = π ( θ i | y ) · π ( θ − i | θ i , y ) E π ( θ i | y ) [ π ( θ − i | θ i , y )]= π ( θ i | θ − i , y ) ∈ Q θ i | y . On the other hand, it is straightforward to derive that the left hand side of (19), log E π ( θ i | y ) [ π ( θ − i | θ i , y )] , is simpliﬁed to log π ( θ − i | y ) .Finalize the proof by using the above facts: it holds E q ( θ i ) [log π ( θ − i | θ i , y )] − KL ( q ( θ i ) (cid:107) π ( θ i | y )) ≤ log π ( θ − i | y ) for all density q ( θ i ) supported on the i -th component parameter space Θ i which sat-isﬁes the dominating condition q ( θ i ) (cid:28) π ( θ i | y ) , where the equality holds if q ( θ i ) = π ( θ i | θ − i , y ) ∈ Q θ i | y . (We used the deﬁnition of the notations q (cid:28) p and Q θ i | y toconclude.) (a) To start with, for each i ( i = 1 , · · · , K ), deﬁne a functional F − i : Q θ − i → R thatcomplements the functional F i (12): F − i { q ( θ − i ) } (20) = E q ( θ − i ) [log π ( θ i | − θ i , y )] − KL ( q ( θ − i ) || π ( θ − i | y )) . For each i = 1 , · · · , K , use the duality formula (2) by replacing the q ( θ ) , p ( θ ) , and h ( θ ) in the formula with q ( θ − i ) ∈ Q θ − i , π ( θ − i | y ) ∈ Q θ − i , and log π ( θ i | θ − i , y ) ∈ Q θ i ,respectively, which leads to log π ( θ i | y ) = sup q ( θ − i ) (cid:28) π ( θ − i | y ) F{ q ( θ − i ) } . (21) Now, take the exp ( · ) to the both sides of (21), and then change the exp ( · ) and sup ( · ) toobtain π ( θ i | y ) = exp [ sup q ( θ − i ) (cid:28) π ( θ − i | y ) F{ q ( θ − i ) } ]= sup q ( θ − i ) (cid:28) π ( θ − i | y ) [exp F{ q ( θ − i ) } ]= sup q ( θ − i ) (cid:28) π ( θ − i | y ) (cid:20) exp E q ( θ − i ) [log π ( θ i | θ − i , y )]exp KL ( q ( θ − i ) || π ( θ − i | y ) (cid:21) ≥ sup q ( θ − i ) (cid:28) π ( θ − i | y ) ,q ( θ − i ) ∈Q MFθ − i (cid:20) exp E q ( θ − i ) [log π ( θ i | θ − i , y )]exp KL ( q ( θ − i ) || π ( θ − i | y ) (cid:21) . (22)The last inequality holds because of a general property of supremum: it holds sup A ( · ) ≥ sup B ( · ) if B ⊂ A .On the other hand, a CAVI-optimized variational density for θ − i , denoted as q ∗ ( θ − i ) ,can be represented by q ∗ ( θ − i ) = q ∗ ( θ ) · · · q ∗ ( θ i − ) · q ∗ ( θ i +1 ) · · · q ∗ ( θ K ) ∈ Q MFθ − i | y , (23)where each of the components on the right hand side has been optimized through theCAVI optimization formula (11). Clearly, the density q ∗ ( θ − i ) (23) belongs to the set B := { q : Θ − i → [0 , ∞ ) | q is a density supported on Θ − i , q ( θ − i ) (cid:28) π ( θ − i | y ) , q ( θ − i ) ∈ Q MFθ − i } which is the set considered in the sup ( · ) (22).Now, use the deﬁnition of supremum and a simple calculation a × (1 /a ) = 1 toderive the following inequality π ( θ i | y ) ≥ exp E q ∗ ( θ − i ) [log π ( θ i | θ − i , y )]exp KL ( q ∗ ( θ − i ) || π ( θ − i | y )= (cid:82) exp E q ∗ ( θ − i ) [log π ( θ i | θ − i , y )] dθ i exp KL ( q ∗ ( θ − i ) || π ( θ − i | y ) × exp E q ∗ ( θ − i ) [log π ( θ i | θ − i , y )] (cid:82) exp E q ∗ ( θ − i ) [log π ( θ i | θ − i , y )] dθ i = R − i { q ∗ ( θ − i ) } × q ∗ ( θ i ) on Θ i , (24)where R − i { q ( θ − i ) } = (cid:82) exp E q ( θ − i ) [log π ( θ i | θ − i , y )] dθ i exp KL ( q ( θ − i ) || π ( θ − i | y ): Q MFθ − i → (0 , ∞ ) and q ∗ ( θ i ) ∈ Q mθ i | y (11).Finally, because π ( θ i | y ) and q ∗ ( θ i ) are densities, by taking (cid:82) · dθ i on the both sidesof (24), we can further obtain < R − i { q ∗ ( θ − i ) } ≤ . ayesian learning theory 13 (b) For each i ( i = 1 , · · · , K ), use the same logic in proving (a) in order to get thefollowing inequality π ( θ − i | y ) ≥ exp E q ∗ ( θ i ) [log π ( θ − i | θ i , y )]exp KL ( q ∗ ( θ i ) || π ( θ i | y ) (25) = R i { q ∗ ( θ i ) } · q ∗ ( θ − i ) on θ − i , where R i { q ( θ i ) } = (cid:82) exp E q ( θ i ) [log π ( θ − i | θ i , y )] d ( θ − i )exp KL ( q ( θ i ) || π ( θ i | y ): Q mθ i → (0 , ∞ ) and q ∗ ( θ − i ) ∈ Q MFθ − i | y is given by (23).Because π ( θ − i | y ) and q ∗ ( θ − i ) are densities, by taking (cid:82) · d ( θ − i ) on the both sidesof (25), we have < R i { q ∗ ( θ i ) } ≤ .Finally, as Kullback-Leibler divergence is non-negative, we conclude the proof:KL ( q ∗ ( θ i ) || π ( θ i | y )) ≥ max (cid:26) , log (cid:18) (cid:90) exp E q ∗ ( θ i ) [log π ( θ − i | θ i , y )] d ( θ − i ) (cid:19)(cid:27) . ibliography [1] Christophe Andrieu, Nando De Freitas, Arnaud Doucet, and Michael I Jordan.An introduction to mcmc for machine learning. Machine learning , 50(1-2):5–43,2003.[2] Rodney J Baxter.

Exactly solved models in statistical mechanics . Elsevier, 2016.[3] Isabel Beichl and Francis Sullivan. The metropolis algorithm.

Computing inScience & Engineering , 2(1):65–69, 2000.[4] Patrick Billingsley.

Probability and measure . John Wiley & Sons, 2008.[5] Christopher M Bishop.

Pattern recognition and machine learning . springer, 2006.[6] David M Blei, Michael I Jordan, et al. Variational inference for dirichlet processmixtures.

Bayesian analysis , 1(1):121–143, 2006.[7] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference:A review for statisticians.

Journal of the American statistical Association ,112(518):859–877, 2017.[8] George Casella and Edward I George. Explaining the gibbs sampler.

The Ameri-can Statistician , 46(3):167–174, 1992.[9] David Chandler. Introduction to modern statistical.

Mechanics. Oxford UniversityPress, Oxford, UK , 1987.[10] Siddhartha Chib and Edward Greenberg. Understanding the metropolis-hastingsalgorithm.

The american statistician , 49(4):327–335, 1995.[11] Imre Csisz´ar, Paul C Shields, et al. Information theory and statistics: A tutorial.

Foundations and Trends R (cid:13) in Communications and Information Theory , 1(4):417–528, 2004.[12] A Philip Dawid. Conditional independence in statistical theory. Journal of theRoyal Statistical Society: Series B (Methodological) , 41(1):1–15, 1979.[13] Yarin Gal. Uncertainty in deep learning.

University of Cambridge , 1:3, 2016.[14] Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, andDonald B Rubin.

Bayesian data analysis . CRC press, 2013.[15] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul.An introduction to variational methods for graphical models.

Machine learning ,37(2):183–233, 1999.[16] Solomon Kullback.

Information theory and statistics . Courier Corporation, 1997.[17] Pierre Simon Laplace. Memoir on the probability of the causes of events.

Statis-tical Science , 1(3):364–378, 1986.[18] Pascal Massart.

Concentration inequalities and model selection , volume 6.Springer, 2007.[19] Thomas P Minka. Expectation propagation for approximate bayesian inference. arXiv preprint arXiv:1301.2294 , 2013.[20] Iain Murray, Ryan Prescott Adams, and David JC MacKay. Elliptical slice sam-pling. 2010.[21] Radford M Neal. Slice sampling.

Annals of statistics , pages 705–741, 2003.[22] Radford M Neal et al. Mcmc using hamiltonian dynamics.

Handbook of markovchain monte carlo , 2(11):2, 2011. ayesian learning theory 15 [23] John T Ormerod and Matt P Wand. Explaining variational approximations.

TheAmerican Statistician , 64(2):140–153, 2010.[24] Giorgio Parisi.

Statistical ﬁeld theory . Addison-Wesley, 1988.[25] Rajesh Ranganath, Sean Gerrish, and David Blei. Black box variational inference.In

Artiﬁcial Intelligence and Statistics , pages 814–822, 2014.[26] Sidney I Resnick.

A probability path . Springer, 2003.[27] Halsey Lawrence Royden and Patrick Fitzpatrick.

Real analysis , volume 32.Macmillan New York, 1988.[28] Sebastian Ruder. An overview of gradient descent optimization algorithms. arXivpreprint arXiv:1609.04747 , 2016.[29] Chong Wang and David M Blei. Variational inference in nonconjugate models.

Journal of Machine Learning Research , 14(Apr):1005–1031, 2013.[30] Yixin Wang and David M Blei. Frequentist consistency of variational bayes.

Jour-nal of the American Statistical Association , 114(527):1147–1161, 2019.[31] Cheng Zhang, Judith B¨utepage, Hedvig Kjellstr¨om, and Stephan Mandt. Ad-vances in variational inference.