α -Variational Inference with Statistical Guarantees
αα -Variational Inference with Statistical Guarantees Yun Yang ∗ , Debdeep Pati † , and Anirban Bhattacharya ‡ Department of Statistics, Florida State University Department of Statistics, Texas A&M University
Abstract
We propose a family of variational approximations to Bayesian posterior distri-butions, called α -VB, with provable statistical guarantees. The standard variationalapproximation is a special case of α -VB with α = 1. When α ∈ (0 , α -VB procedure converge at an optimal rate to the true parameter in a wide rangeof problems. We illustrate our general theory with a number of examples, including themean-field variational approximation to (low)-high-dimensional Bayesian linear regres-sion with spike and slab priors, mixture of Gaussian models, latent Dirichlet allocation,and (mixture of) Gaussian variational approximation in regular parametric models. Keywords:
Bayes risk; Evidence lower bound; Latent variable models; R´enyi divergence; Vari-ational inference.
Variational inference [25, 38] is a widely-used tool for approximating complicated proba-bility densities, especially those arising as posterior distributions from complex hierarchicalBayesian models. It provides an alternative strategy to Markov chain Monte Carlo (MCMC, ∗ [email protected] † [email protected] ‡ [email protected] a r X i v : . [ m a t h . S T ] F e b
20, 15]) sampling by turning the sampling/inference problem into an optimization problem,where a closest member, relative to the Kullback–Leibler (KL) divergence, in a family ofapproximate densities is picked out as a proxy to the target density. Variational inferencehas found its success in a variety of contexts, especially in models involving latent variables,such as Hidden Markov models [30], graphical models [3, 38], mixture models [23, 14, 35],and topic models [9, 11] among others. See the recent review paper [10] by Blei et al. for acomprehensive introduction to variational inference.The popularity of variational methods can be largely attributed to their computationaladvantages over MCMC. It has been empirically observed in many applications that vari-ational inference operates orders of magnitude faster than MCMC for achieving the sameapproximation accuracy. Moreover, compared to MCMC, variational inference tends to beeasier to scale to big data due to its inherent optimization nature, and can take advantageof modern optimization techniques such as stochastic optimization [27, 26] and distributedoptimization [1]. However, unlike MCMC that is guaranteed to produce (almost) exact sam-ples from the target density for ergodic chains [33], variational inference does not enjoy suchgeneral theoretical guarantee.Several threads of research have been devoted to characterize statistical properties of thevariational proxy to the true posterior distribution; refer to Section 5.2 of [10] for a relativelycomprehensive survey of the theoretical literature on variational inference. However, almostall these studies are conducted in a case-by-case manner, by either explicitly analyzingthe fixed point equation of the variational optimization problem, or directly analyzing theiterative algorithm for solving the optimization problem. In addition, these analyses requirecertain structural assumptions on the priors such as conjugacy, and is not applicable tobroader classes of priors.This article introduces a novel class of variational approximations and studies their largesample convergence properties in a unified framework. The new variational approximation,termed α -VB, introduces a fixed temperature parameter α inside the usual VB objectivefunction which controls the relative trade-off between model-fit and prior regularization.The usual VB approximation is retained as a special case corresponding to α = 1. The α -VB objective function is partly motivated by fractional posteriors [39, 4]; specific connec-tions are drawn in § α -VB procedure also inherits all the computationaltractability and scalability from the α = 1 case, and implementation-wise only requiressimple modifications to existing variational algorithms.For α ∈ (0 , α -VB objectivefunction, implying that maximizing the evidence lower bound has the effect of minimizing2he Bayes risk within the variational density family. A crucial upshot of this analysis is thatpoint estimates constructed from the variational posterior concentrate at the true parameterat the same rate as those constructed from the actual posterior for a variety of problems.There is now a well-developed literature on the frequentist concentration properties of pos-terior distributions in nonparametric problems; refer to [34] for a detailed review, and thepresent paper takes a step towards developing similar general-purpose theoretical guaranteesfor variational solutions. We applied our theory to a number of examples where VB is com-monly used, including mean-field variational approximation to high-dimensional Bayesianlinear regression with spike and slab priors, mixtures of Gaussian models, latent Dirichletallocation, and Gaussian-mixture variational approximation to regular parametric models.The α < α -VBobjective function considered here incorporates a much broader class of models involvinglatent variables, and the corresponding variational inequality recovers the risk bound of [2]when no latent variables are present. The variational inequalities for the α < α = 1 case under a simple limiting operation, and requirea separate treatment under stronger assumptions. In particular, we make use of additionaltestability assumptions on the likelihood function detailed in § We briefly introduce notation that will be used throughout the paper. Let h ( p || q ) =( (cid:82) ( p / − q / ) dµ ) / and D ( p || q ) = (cid:82) p log ( p/q ) dµ denote the Hellinger distance andKullback–Leibler divergence, respectively, between two probability density functions p and q relative to a common dominating measure µ . We define an additional discrepancy mea-sure V ( p || q ) = (cid:82) p log ( p/q ) dµ , which will be referred to as the V -divergence. For a set A , we use the notation I A to denote its indicator function. For any vector µ and positivesemidefinite matrix Σ, we use N ( µ, Σ) to denote the normal distribution with mean µ andcovariance matrix Σ, and use N ( θ ; µ, Σ) to denote its pdf at θ .For any α ∈ (0 , D α ( p || q ) = 1 α − (cid:90) p α q − α dµ (1.1)denote the R´enyi divergence of order α . Jensen’s inequality implies that D α ( p || q ) ≥ α ∈ (0 , p = q . The Hellinger distance can be relatedwith the α -divergence with α = 1 / D / ( p || q ) = − { − (1 / h ( p || q ) } ≥ h ( p || q )using the inequality log(1+ t ) < t for t > −
1. More details and properties of the α -divergencecan be found in [37]. Suppose we have observations Y n = ( Y , . . . , Y n ) ∈ Y n with n denoting the sample size. Let P ( n ) θ be the distribution of Y n given parameter θ ∈ Θ that admits a density p ( n ) θ relative to theLebesgue measure. We will also interchangeably use P ( Y n | θ ) and p ( Y n | θ ) to denote P ( n ) θ and its density function (likelihood function) p ( n ) θ . Assume additionally that the likelihood p ( Y n | θ ) can be represented as p ( Y n | θ ) = (cid:88) s n p ( Y n | S n = s n , θ ) p ( S n = s n | θ ) , where S n denotes a collection of latent or unobserved variables; the superscript n signifiesthe possible dependence of the number of latent variables on n ; for example, when thereare observation specific latent variables. In certain situations, the latent variables may beintroduced for purely computational reasons to simplify an otherwise intractable likelihood,such as the latent cluster indicators in a mixture model. Alternatively, a complex proba-4ilistic model p ( Y n | θ ) may itself be defined in a hierarchical fashion by first specifying thedistribution of the data given latent variables and parameters, and then specifying the la-tent variable distribution given parameters; examples include the latent Dirichlet allocationand many other prominent Bayesian hierarchical models. For ease of presentation, we haveassumed discrete latent variables in the above display and continue to do so subsequently,although our development seamlessly extends to continuous latent variables by replacingsums with integrals; further details are provided in a supplemental document.Let P θ denote a prior distribution on θ with density function p θ , and denote W n =( θ, S n ) ∈ W n . In a Bayesian framework, all inference is based on the augmented posteriordensity p ( W n | Y n ) given by p ( W n | Y n ) = p ( θ, S n | Y n ) ∝ p ( Y n | θ, S n ) p ( S n | θ ) p θ ( θ ) . (1.2)In many cases, p ( W n | Y n ) can be inconvenient for conducting direct analysis due to its in-tractable normalizing constant and expensive to sample from due to the slow mixing of stan-dard MCMC algorithms. Variational inference aims to bypass these difficulties by turningthe inference problem into an optimization problem, which can be solved by using iterativealgorithms such as coordinate descent [7] and alternating minimization.Let Γ denote a pre-specified family of density functions over W n that can be either param-eterized by some “variational parameters”, or required to satisfy some structural constraints(see below for examples of Γ). The goal of variational inference is to approximate this con-ditional density p ( W n | Y n ) by finding the closest member of this family in KL divergence tothe conditional density p ( W n | Y n ) of interest, that is, computing the minimizer (cid:98) q W n : = argmin q Wn ∈ Γ D (cid:2) q W n ( · ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | Y n ) (cid:3) = argmin q Wn ∈ Γ (cid:26) − (cid:90) W n q W n ( w n ) log p ( w n | Y n ) q W n ( w n ) dw n (cid:27) = argmin q Wn ∈ Γ (cid:26) − (cid:90) W n q W n ( w n ) log p ( Y n | w n ) p W n ( w n ) q W n ( w n ) dw n (cid:124) (cid:123)(cid:122) (cid:125) L ( q Wn ) (cid:27) (1.3)where the last step follows by using Bayes’ rule and the fact that the marginal density p ( Y n )does not depend on W n and q W n . The function L ( q W n ) inside the argmin-operator above(without the negative sign) is called the evidence lower bound (ELBO, [10]) since it provides5 lower bound to the log evidence log p ( Y n ),log p ( Y n ) = L ( q W n ) + D (cid:2) q W n ( · ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | Y n ) (cid:3) ≥ L ( q W n ) , (1.4)where the equality holds if and only if q W n = p ( · | Y n ). The decomposition (1.4) providesan alternative interpretation of variational inference to the original derivation from Jensen’sinequality[25]—minimizing the KL divergence over the variational family Γ is equivalent tomaximizing the ELBO over Γ. When Γ is composed of all densities over W n , this variationalapproximation (cid:98) q W n exactly recovers p ( W n | Y n ). In general, the variational family Γ is chosento balance between the computational tractability and the approximation accuracy. Somecommon examples of Γ are provided below. Example: (Exponential variational family)
When there is no latent variable and W n = θ ∈ Θ corresponds to the parameter in the model, a popular choice of the variationalfamily is an exponential family of distributions. Among the exponential variational families,the Gaussian variational family, q θ ( θ ; µ, Σ) ≡ N ( θ ; µ, Σ) for θ ∈ R d , is the most widely-used owing to the Bernstein von-Mises theorem (Section 10.2 of [36]), stating that for regularparametric models, the posterior distribution converges to a Gaussian limit relative to thetotal variation metric as the sample size tends to infinity. There are also some recent devel-opments by replacing the single Gaussian with a Gaussian-mixture as the variational familyto improve finite-sample approximation [47], which is useful when the posterior distributionis skewed or far away from Gaussian for the given sample size. Example: (Mean-field variational family)
Suppose that W n can be decomposed into m components (or blocks) as W n = ( W , W , . . . , W m ) for some m >
1, where each compo-nent W j ∈ W j can be multidimensional. The mean-field variational family Γ MF is composedof all density functions over W n = (cid:81) mj =1 W j that factorizes as q W n ( w n ) = m (cid:89) j =1 q W j ( w j ) , w n = ( w , . . . , w m ) ∈ W n , where each variational factor q W j is a density function over W j for j = 1 , . . . , m . A naturalmean-field decomposition is to let q W n ( w n ) = q θ ( θ ) q S n ( s n ), with q S n often further decom-posed as q S n ( s n ) = (cid:81) ni =1 q S i ( s i ).Note that we have not specified the parametric form of the individual variational factors,which are determined by properties of the model— in some cases, the optimal q W j is in6he same parametric family as the conditional distribution of W j given the parameter. Thecorresponding mean-field variational approximation (cid:98) q W n , which is necessarily of the form (cid:81) mj =1 (cid:98) q W j ( w j ), can be computed via the coordinate ascent variational inference (CAVI) algo-rithm [7, 10] which iteratively optimizes each variational factor keeping others fixed at theirpresent value and resembles the EM algorithm in the presence of latent variables.The mean-field variational family can be further constrained by restricting each factor q W j to belong to a parametric family, such as the exponential family in the previous example.In particular, it is a common practice to restrict the variational density q θ of the parameterinto a structured family (for example, the mean-field family if θ is multi-dimensional), whichwill be denoted by Γ θ in the sequel.The rest of the paper is organized as follows. In §
2, we introduce the α -VB objective functionand relate it to usual VB. § α -VB solution. In §
4, we apply the theory to concrete examples. Weconclude with a discussion in §
5. All proofs and some additional discussions are providedin a separate supplemental document. The supplemental document also contains a detailedsimulation study. α -VB procedure Before introducing the proposed family of objective functions, we first represent the KL term D (cid:2) q W n ( · ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | Y n ) (cid:3) in a more convenient form which provides intuition into how VB worksin the presence of latent variables and aids our subsequent theoretical development. To aid our subsequent development, we introduce some additional notation and make somesimplifying assumptions. First, we decompose θ = ( µ, π ), with p ( Y n | S n = s n , θ ) = p ( Y n | S n = s n , µ ) and and π s n : = p ( S n = s n | θ ). In other words, µ is the parametercharacterizing the conditional distribution P ( Y n | S n , µ ) of the observation Y n given latentvariable S n , and π = ( π s n ) characterizes the distribution P ( S n | π ) of the latent variables.We shall also assume the mean-field decomposition q W n ( w n ) = q θ ( θ ) q S n ( s n ) (2.1)throughout, and let Γ = Γ θ × Γ S n denote the class of such product variational distributions.When necessary subsequently, we shall further assume q S n ( s n ) = (cid:81) ni =1 q S i ( s i ) and q θ ( θ ) =7 µ ( µ ) q π ( π ), which however is not immediately necessary for this subsection.The KL divergence D (cid:2) q W n ( · ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | Y n ) (cid:3) in (1.3) involves both parameters and latentvariables. Separating out the KL divergence for the parameter part leads to the equivalentrepresentation D (cid:2) q W n ( · ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | Y n ) (cid:3) = D ( q θ (cid:12)(cid:12)(cid:12)(cid:12) p θ ) − (cid:90) Θ (cid:20) (cid:88) s n q S n ( s n ) log p ( Y n | µ, s n ) π s n q S n ( s n ) (cid:21)(cid:124) (cid:123)(cid:122) (cid:125) (cid:98) (cid:96) n ( θ ) q θ ( dθ ) . (2.2)Observe that, using concavity of x (cid:55)→ log x and Jensen’s inequality,log p ( Y n | θ ) = log (cid:20) (cid:88) s n q S n ( s n ) p ( Y n | µ, s n ) π s n q S n ( s n ) (cid:21) ≥ (cid:88) s n q S n ( s n ) log p ( Y n | µ, s n ) π s n q S n ( s n ) . The quantity (cid:98) (cid:96) n ( θ ) in (2.2) can therefore be recognized as an approximation (from below) tothe log likelihood (cid:96) n ( θ ) : = log p ( Y n | θ ) in terms of the latent variables. Define an averageJensen gap ∆ J due to the variational approximation to the log-likelihood,∆ J ( q θ , q S n ) = (cid:90) Θ (cid:2) (cid:96) n ( θ ) − (cid:98) (cid:96) n ( θ ) (cid:3) q θ ( dθ ) ≥ . With this, write the KL divergence D (cid:2) q W n ( · ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | Y n ) (cid:3) as D (cid:2) q W n ( · ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | Y n ) (cid:3) = − (cid:90) Θ (cid:96) n ( θ ) q θ ( dθ ) + ∆ J ( q θ , q S n ) + D ( q θ (cid:12)(cid:12)(cid:12)(cid:12) p θ ) , (2.3)which splits as a sum of three terms: an integrated (w.r.t. the variational distribution)negative log-likelihood, the KL divergence between the variational distribution q θ and theprior p θ for θ , and the Jensen gap ∆ J due to the latent variables. In particular, the role ofthe latent variable variational distribution q S n is conveniently confined to ∆ J .Another view of the above is an equivalent formulation of the ELBO decomposition (1.4),log p ( Y n ) = L ( q W n ) + ∆ J ( q θ , q S n ) + D (cid:2) q θ ( θ ) (cid:12)(cid:12)(cid:12)(cid:12) p ( θ | Y n ) (cid:3) . (2.4)which readily follows since D (cid:2) q θ ( θ ) (cid:12)(cid:12)(cid:12)(cid:12) p ( θ | Y n ) (cid:3) = − (cid:90) Θ (cid:96) n ( θ ) q θ ( dθ ) + D ( q θ (cid:12)(cid:12)(cid:12)(cid:12) p θ ) . Thus, in latent variable models, maximizing the ELBO L ( q W n ) is equivalent to minimizinga sum of the Jensen gap ∆ J and the KL divergence between the variational density and the8osterior density of the parameters. When there is no likelihood approximation with latentvariables, ∆ J = 0. α -VB objective function Here and in the rest of the paper, we adopt the frequentist perspective by assuming thatthere is a true data generating model P ( n ) θ ∗ that generates the data Y n , and θ ∗ will be referredto as the true parameter, or simply truth. Let (cid:96) n ( θ, θ ∗ ) = (cid:96) n ( θ ) − (cid:96) n ( θ ∗ ) be the log-likelihoodratio. Define Ψ n ( q θ , q S n ) = − (cid:90) Θ (cid:96) n ( θ, θ ∗ ) q θ ( dθ ) + ∆ J ( q θ , q S n ) + D ( q θ (cid:12)(cid:12)(cid:12)(cid:12) p θ ) , (2.5)and observe that Ψ n differs from the KL divergence D (cid:2) q W n ( · ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | Y n ) (cid:3) in (2.3) only by (cid:96) n ( θ ∗ ) which does not involve the variational densities. Hence, minimizing D (cid:2) q W n ( · ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | Y n ) (cid:3) is equivalent to minimizing Ψ n ( q θ , q S n ). We note here that the introduction of the (cid:96) n ( θ ∗ )term is to develop theoretical intuition and the actual minimization does not require theknowledge of θ ∗ .The objective function Ψ n in (2.5) elucidates the trade-off between model-fit and fidelityto the prior underlying a variational approximation, which is akin to the classical bias-variance trade-off for shrinkage or penalized estimators. The model-fit term consists of twoconstituents: the first term is an averaged (with respect to the variational distribution) log-likelihood ratio which tends to get small as the variational distribution q θ places more massnear the true parameter θ ∗ , while the second term is the Jensen gap ∆ J due to the variationalapproximation with the latent variables. On the other hand, the regularization or penaltyterm D ( q θ || p θ ) prevents over-fitting to the data by constricting the KL divergence betweenthe variational solution and the prior.In this article, we study a wider class of variational objective functions Ψ n,α indexed bya scalar parameter α ∈ (0 ,
1] which encompass the usual VB,Ψ n,α ( q θ , q S n ) = − (cid:90) Θ (cid:96) n ( θ, θ ∗ ) q θ ( dθ ) + ∆ J ( q θ , q S n ) (cid:124) (cid:123)(cid:122) (cid:125) model fit + α − D ( q θ (cid:12)(cid:12)(cid:12)(cid:12) p θ ) (cid:124) (cid:123)(cid:122) (cid:125) regularization , (2.6)and define the α -VB solution as( (cid:98) q θ,α , (cid:98) q S n ,α ) = argmin ( q θ ,q Sn ) ∈ Γ Ψ n,α ( q θ , q S n ) . (2.7)Observe that the α -VB criterion Ψ n,α differs from Ψ n only in the regularization term, where9he inverse temperature parameter α controls the amount of regularization, with smaller α implying a stronger penalty. When α = 1, Ψ n,α reduces to the usual variational objectivefunction Ψ n in (2.5), and we shall denote the solution of (2.7) by (cid:98) q θ and (cid:98) q S n as before. Aswe shall see in the sequel, the introduction of the temperature parameter α substantiallysimplifies the theoretical analysis and allows one to certify (near-)minimax optimality of the α -VB solution for α < α = 1) requires more intricate testing arguments.The α -VB solution can also be interpreted as the minimizer of a certain divergence func-tion between the product variational distribution q θ ( θ ) × q S n ( s n ) and the joint α - fractionalposterior distribution [4] of ( θ, S n ), P α ( θ ∈ B, s n | Y n ) = (cid:82) B (cid:2) p ( Y n | µ, s n ) π s n (cid:3) α p θ ( θ ) dθ (cid:82) Θ (cid:80) s n (cid:2) p ( Y n | µ, s n ) π s n (cid:3) α p θ ( θ ) dθ , (2.8)which is obtained by raising the joint likelihood of ( θ, s n ) to the fractional power α , andcombining with the prior p θ using Bayes’ rule. We shall use p α ( · | Y n ) to denote the fractionalposterior density. The fractional posterior is a specific example of a Gibbs posterior [24] andshares a nice coherence property with the usual posterior when viewed as a mechanism forupdating beliefs [8]. Proposition 2.1 (Connection with fractional posteriors) . The α -VB solution ( (cid:98) q θ,α , (cid:98) q S n ,α ) satisfy, ( (cid:98) q θ,α , (cid:98) q S n ,α ) = argmin ( q θ ,q Sn ) ∈ Γ (cid:20) D (cid:2) q W n ( · ) (cid:12)(cid:12)(cid:12)(cid:12) p α ( · | Y n ) (cid:3) + (1 − α ) H ( q S n ) (cid:21) , where H ( q S n ) = − (cid:80) s n q S n ( s n ) log q S n ( s n ) is the entropy of q S n , and p α ( · | Y n ) is the joint α -fractional posterior density of w n = ( θ, s n ) . The entropy term H ( q S n ) encourages the latent-variable variational density q S n to beconcentrated to the uniform distribution, in addition to minimizing the KL divergence be-tween q W n ( · ) and p α ( · | Y n ). In particular, if there are no latent variables, the entropy termdisappears and the objective function reduces to a KL divergence between q θ and p α ( θ | Y n ).We conclude this section by remarking that the additive decomposition of the model-fitterm in (2.6) provides a peak into why mean-field approximations work for latent variablemodels, since the roles of the variational density q S n for the latent variables and q θ for themodel parameters are de-coupled. Roughly speaking, a good choice of q S n should aim to makethe Jensen gap ∆ J small, while the choice of q θ should balance the integrated log-likelihoodratio and the penalty term. This point is crucial for the theoretical analysis.10 Variational risk bounds for α -VB In this section, we investigate concentration properties of the α -VB posterior under a fre-quentist framework assuming the existence of a true data generating parameter θ ∗ . We firstfocus on the α < α = 1 case. The main take-awaymessage from our theoretical results below is that under fairly general conditions, the α -VBprocedure concentrates at the true parameter at the same rate as the actual posterior, andas a result, point estimates obtained from the α -VB can provide rate-optimal frequentistestimators. These results thus compliment the empirical success of VB in a wide variety ofmodels.We present our results in the form of Bayes risk bounds for the variational distribution.Specifically, for a suitable loss function r ( θ, θ ∗ ), we aim to obtain a high-probability (underthe data generating distribution P ( n ) θ ∗ ) to the variational risk (cid:90) r ( θ, θ ∗ ) (cid:98) q θ,α ( dθ ) . (3.1)In particular, if r ( · , · ) is convex in its first argument, then the above risk bound immediatelytranslates into a risk bound for the α -VB point estimate (cid:98) θ VB ,α = (cid:82) θ (cid:98) q θ,α ( dθ ) using Jensen’sinequality: r ( (cid:98) θ VB ,α , θ ∗ ) ≤ (cid:90) r ( θ, θ ∗ ) (cid:98) q θ,α ( dθ ) . Specifically, our goal will be to establish general conditions under which (cid:98) θ VB ,α concentratesaround θ ∗ at the minimax rate for the particular problem. α < case: We use the shorthand 1 n D ( n ) α ( θ, θ ∗ ) : = 1 n D α (cid:2) p ( n ) θ (cid:12)(cid:12)(cid:12)(cid:12) p ( n ) θ ∗ (cid:3) to denote the averaged α -divergence between P ( n ) θ and P ( n ) θ ∗ . We adopt the theoretical frame-work of [4] to use this divergence as our loss function r ( θ, θ ∗ ) for measuring the closenessbetween any θ ∈ Θ and the truth θ ∗ . Note that in case of i.i.d. observations, this averageddivergence n − D ( n ) α ( θ, θ ∗ ) simplifies to D α (cid:2) p θ || p θ ∗ (cid:3) , which is stronger than the squaredHellinger distance h (cid:2) p θ || p θ ∗ (cid:3) between p θ and p θ ∗ for any fixed α ∈ [1 / , r ( θ, θ ∗ ).11 heorem 3.1 (Variational risk bound) . Recall the α -VB objective function Ψ n,α ( q θ , q S n ) from (2.6) . For any ζ ∈ (0 , , it holds with P nθ ∗ probability at least (1 − ζ ) that for anyprobability measure q θ ∈ Γ θ with q θ (cid:28) p θ and any probability measure q S n ∈ Γ S n on S n , (cid:90) n D ( n ) α ( θ, θ ∗ ) (cid:98) q θ,α ( θ ) dθ ≤ αn (1 − α ) Ψ n,α ( q θ , q S n ) + 1 n (1 − α ) log(1 /ζ ) . Here and elsewhere, the probability statement is uniform over all ( q θ , q S n ) ∈ Γ. Theo-rem 3.1 links the variational Bayes risk for the α -divergence to the objective function Ψ n,α in (2.6). As a consequence, minimizing Ψ n,α in (2.6) has the same effect as minimizing thevariational Bayes risk. To apply Theorem 3.1 to various problems, we now discuss strategiesto further analyze and simplify Ψ n,α under appropriate structural constraints of Γ θ and Γ S n .To that end, we make some simplifying assumptions.First, we assume a further mean-field decomposition q S n ( s n ) = (cid:81) ni =1 q S i ( s i ) for the la-tent variables S n , where each factor q S i is restriction-free. Second, the inconsistency of themean-field approximation for state-space models proved in [40] indicates that this mean-field approximation for the latent variables may not generally work for non-independentobservations with non-independent latent variables. For this reason, we assume that theobservation latent variable pair ( S i , Y i ) are mutually independent across i = 1 , , . . . , n . Infact, we assume that ( S i , Y i ) are i.i.d. copies of ( S, Y ) whose density function is given by p ( S, Y | µ, π ) = p ( Y | S, µ ) p ( S | π ). Following earlier notation, let π S : = p ( S | π ) denotethe probability mass function of the i.i.d. discrete latent variables { S i } , with the parameter π = ( π , π , . . . , π K ) residing in the K -dim simplex S K = { π ∈ [0 , K : (cid:80) k π k = 1 } . Fi-nally, we assume the variational family Γ θ of the parameter decomposes into Γ µ ⊗ S K , whereΓ µ denotes variational family for parameter µ .Let p ( Y | θ ) = (cid:80) Ks =1 π s p ( Y | θ, S = s ) denote the marginal probability density functionof the i.i.d. observations { Y i } . The i.i.d. assumption implies a simplified structure of variousquantities encountered before, e.g. π S n = (cid:81) ni =1 π s i , p ( Y n | µ, S n ) = (cid:81) ni =1 π s i p ( Y i | µ, S i ) and p ( Y n | θ ) = (cid:81) ni =1 p ( Y i | θ ). Moreover, under these assumptions, n − D ( n ) α ( θ, θ ∗ ) = D α (cid:2) p ( · | θ ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | θ ∗ ) (cid:3) .As discussed in the previous subsection, the decoupling of the roles of q θ and q S n in themodel fit term aid bounding Ψ n,α . Specifically, we first choose a (cid:101) q S n which controls theJensen gap ∆ J , and then make a choice of q θ which controls Ψ n,α ( q θ , (cid:101) q S n ). The choice of q θ requires a delicate balance between placing enough mass near θ ∗ and controlling the KLdivergence from the prior. 12or a fixed q θ , if we choose q S n to be the full conditional distribution of S n given θ , i.e., q S n ( s n | θ ) = n (cid:89) i =1 q S i ( s i | θ ) = n (cid:89) i =1 π s i p ( Y i | µ, s i ) p ( Y i | θ ) , s n ∈ { , , . . . , K } n , then the normalizing constant of q S i ( · | θ ) is (cid:80) s i π s i p ( Y i | µ, S i ) = p ( Y i | θ ), and as a result,the Jensen gap ∆ J = 0. The mean-field approximation precludes us from choosing q S n depen-dent on θ , and hence the Jensen gap cannot be made exactly zero in general. However, thisnaturally suggests replacing θ by θ ∗ in the above display and choosing (cid:101) q S i ∝ π ∗ s i p ( Y i | µ ∗ , S i ).This leads us to the following corollary. Corollary 3.2 (i.i.d. observations) . It holds with P nθ ∗ probability at least (1 − ζ ) that for anyprobability measure q θ ∈ Γ θ with q θ (cid:28) p θ (cid:90) (cid:110) D α (cid:2) p ( · | θ ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | θ ∗ ) (cid:3)(cid:111) (cid:98) q θ,α ( θ ) dθ ≤ αn (1 − α ) Ψ n,α ( q θ , (cid:101) q S n ) + 1 n (1 − α ) log(1 /ζ ) , = αn (1 − α ) (cid:20) − (cid:90) Θ n (cid:88) i =1 (cid:88) s i (cid:101) q S i ( s i ) log p ( Y i | µ, s i ) π s i p ( Y i | µ ∗ , s i ) π ∗ s i q θ ( dθ ) + D ( q θ || p θ ) α + log(1 /ζ ) α (cid:21) , (3.2) where (cid:101) q S n is the probability distribution over S n defined as (cid:101) q S n ( s n ) = n (cid:89) i =1 (cid:101) q S i ( s i ) = n (cid:89) i =1 π ∗ s i p ( Y i | µ ∗ , s i ) p ( Y i | θ ∗ ) , s n ∈ { , , . . . , K } n . (3.3)The second line of (3.2) follows from the first since∆ J ( q θ , (cid:101) q S n ) = − (cid:90) Θ n (cid:88) i =1 (cid:88) s i (cid:101) q S i ( s i ) log p ( Y i | µ, s i ) π s i p ( Y i | µ ∗ , s i ) π ∗ s i q θ ( dθ ) + (cid:90) (cid:96) n ( θ, θ ∗ ) q θ ( dθ ) . After choosing (cid:101) q S n as (3.3) in Corollary 3.2, we can make the first term in the r.h.s. of (3.2)small by choosing the variational factor q θ of θ concentrated around θ ∗ . In the rest of thissubsection, we will apply Corollary 3.2 to derive more concrete variational Bayes risk boundsunder some further simplifying assumptions.As a first application, assume there is no latent variable in the model, that is, W n = θ = µ .As discussed before, the α -VB solution in this case coincides with the nearest KL point to the α -fractional posterior of the parameter. A reviewer pointed out a recent preprint by Alquierand Ridgeway [2] where they exploit risk bounds for fractional posteriors developed in [4]to analyze tempered posteriors and their variational approximations, which coincides withthe α -VB solution when W n = θ . The following Theorem 3.3 arrives at a similar conclusion13o Corollary 2.3 of [2]. We reiterate here that our main motivation is models with latentvariables not considered in [2], and Theorem 3.3 follows as a corollary of our general resultin Theorem 3.1. Theorem 3.3 (No latent variable) . It holds with P nθ ∗ probability at least (1 − ζ ) that for anyprobability measure q θ ∈ Γ θ with q θ (cid:28) p θ (cid:90) (cid:110) D α (cid:2) p ( · | θ ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | θ ∗ ) (cid:3)(cid:111) (cid:98) q θ,α ( θ ) dθ = αn (1 − α ) (cid:20) − (cid:90) Θ log p ( Y n | θ ) p ( Y n | θ ∗ ) q θ ( θ ) dθ + D ( q θ || p θ ) α + log(1 /ζ ) α (cid:21) , (3.4)We will illustrate some particular choices of q θ for typical variational families Γ Θ in theexamples in Section 4.As a second application, we consider a special case when Γ θ is restriction-free, which is anideal example for conveying the general idea of how to choose q θ to control the upper boundin (3.2). To that end, define two KL neighborhoods around ( π ∗ , µ ∗ ) with radius ( ε π , ε µ ) as B n ( π ∗ , ε π ) = (cid:110) D ( π ∗ || π ) ≤ ε π , V ( π ∗ || π ) ≤ ε π (cid:111) , B n ( µ ∗ , ε µ ) = (cid:110) sup s D (cid:2) p ( · | µ ∗ , s ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | µ, s ) (cid:3) ≤ ε µ , sup s V (cid:2) p ( · | µ ∗ , s ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | µ, s ) (cid:3) ≤ ε µ (cid:111) , (3.5)where we used the shorthand D ( π ∗ || π ) = (cid:80) s π ∗ s log( π ∗ s /π s ) to denote the KL divergencebetween categorical distributions with parameters π ∗ ∈ S K and π ∈ S K in the K -dimsimplex S K . By choosing q θ as the restriction of p θ into B n ( π ∗ , ε π ) × B n ( µ ∗ , ε µ ), we obtainthe following theorem. Here, we make the assumption of independent priors on µ and π , i.e., p θ = p µ ⊗ p π , to simplify the presentation. Theorem 3.4 (Parameter restriction-free) . For any fixed ( ε π , ε µ ) ∈ (0 , and D > , with P ( n ) θ ∗ probability at least − / { ( D − n ( ε π + ε µ ) } , it holds that (cid:90) (cid:110) D α (cid:2) p ( · | θ ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | θ ∗ ) (cid:3)(cid:111) (cid:98) q θ,α ( dθ ) ≤ D α − α ( ε π + ε µ ) + (cid:110) − n (1 − α ) log P π (cid:2) B n ( π ∗ , ε π ) (cid:3)(cid:111) + (cid:110) − n (1 − α ) log P µ (cid:2) B n ( µ ∗ , ε µ ) (cid:3)(cid:111) . (3.6)Although the results in this section assume discrete latent variables, similar results canbe seamlessly obtained for continuous latent variables; see the supplemental document formore details. We will apply this theorem for analysing mean-field approximations for theGaussian mixture model and the latent Dirichlet allocation in Section 4.14bserve that the variational risk bound in Theorem 3.4 depends only on prior massassigned to appropriate KL neighborhoods of the truth. This renders an application of The-orem 3.4 to various practical problems particularly straightforward. As we shall see in thenext subsection, the α = 1 case, i.e. the regular VB, requires more stringent conditionsinvolving the existence of exponentially consistent tests to separate points in the parameterspace. The testing condition is even necessary for the actual posterior to contract; see, e.g.,[4], and hence one cannot avoid the testing assumption for its usual variational approxima-tion. Nevertheless, we show below that once the existence of such tests can be verified, theregular VB approximation can also be shown to contract optimally. α = 1 case We consider any loss function r ( θ, θ ∗ ) satisfying the following assumption. Assumption T (Statistical identifiability):
For some ε n > ε ≥ ε n , thereexists a sieve set F n,ε ⊂ Θ and a test function φ n,ε : Y n → [0 ,
1] such that P θ ( F cn,ε ) ≤ e − c n ε , (3.7) E θ ∗ [ φ n,ε ] ≤ e − c n ε n , (3.8) E θ [1 − φ n,ε ] ≤ e − c n r ( θ, θ ∗ ) , ∀ θ ∈ F n,ε satisfies r ( θ, θ ∗ ) ≥ ε . (3.9)Roughly speaking, the sieve set F n,ε can be viewed as the effective support of the priordistribution at sample size n , and ε n the contraction rate of the usual posterior distribu-tion. The first condition (3.7) allows us to focus attention to this important region in theparameter space that is not too large, but still possesses most of the prior mass. The lasttwo conditions (3.8) and (3.9) ensure the statistical identifiability of the parameter underthe loss r ( · , · ) through the existence of a test function φ n,ε , and require a sufficiently fastdecay of the Type I/II error. In the case when Θ is compact and r ( θ, θ (cid:48) ) = h ( θ || θ ∗ ) isthe squared Hellinger distance between p θ and p θ ∗ , such a test φ n,ε always exists [17]. Asimilar set of assumptions are used for showing the concentration of the usual posterior (forexample, see [18] and [17]), with the existence of such sieve sets and test functions verified fornumerous model-prior combinations. The only difference in our case is that Assumption Trequires the existence of the pair ( F n,ε , φ n,ε ) for all ε ≥ ε n , not just at ε = ε n . However, thisextra requirement appears mild since in most cases a construction of ( F n,ε , φ n,ε ) at ε = ε n naturally extends to any ε ≥ ε n .Our main result for the usual VB ( α = 1) provides a finite-sample upper bound to the15ariational Bayes risk for any loss function r ( θ, θ ∗ ) satisfying Assumption T. Here, we use Q θ to denote the probability distribution associated with any member q θ in the variationaldensity family Γ. Theorem 3.5.
Under Assumption T, for any ε ≥ ε n , we have that with P ( n ) θ ∗ probability atleast − e − c n ε n / , it holds that for any probability measure q θ ∈ Γ θ with q θ (cid:28) p θ and anyprobability measure q S n ∈ Γ S n on S n that n (cid:20) (cid:98) Q θ ( F cn,ε ) log (cid:98) Q θ ( F cn,ε ) P θ ( F cn,ε ) + (1 − (cid:98) Q θ ( F cn,ε )) log 1 − (cid:98) Q θ ( F cn,ε )1 − P θ ( F cn,ε ) (cid:21) + c (cid:90) θ ∈F n,ε , r ( θ, θ ∗ ) ≥ ε r ( θ, θ ∗ ) (cid:98) q θ ( θ ) dθ ≤ n Ψ n ( q θ , q S n ) + c ε n n . (3.10)The first term on the l.h.s. of inequality (3.10) relates the variational complementaryprobability (cid:98) Q θ ( F cn,ε ) to the prior complementary probability P θ ( F cn,ε ). As a consequence,an upper bound of this term controls the remainder variational probability mass outsidethe sieve F n,ε . The second term (cid:82) θ ∈F n,ε , r ( θ, θ ∗ ) ≥ ε r ( θ, θ ∗ ) (cid:98) q θ ( θ ) dθ in (3.10) is the variationalBayes risk over to the intersection between F n,ε and the set of all θ such that the loss r ( θ, θ ∗ )is at least ε .In [32], we proved a risk bound for the α = 1 case under the much stronger assumptionof a compact parameter space and the existence of a global test φ n with type-I and IIerror rates bounded above by e − Cnε n . Under those assumptions, the result in [32] can berecovered from our more general result in Theorem 3.5 by setting F n,ε = Θ, and φ n,ε = φ n ;the global test, for all (cid:15) . Such stronger assumptions usually hold when the parameter spaceΘ is a compact subset of the Euclidean space — however, in other cases such as unboundedparameter spaces or infinite dimensional functional spaces, such a global test function φ n may not exist, signifying the necessity of Theorem 3.5. We also point out the preprint [46],which appeared while this manuscript was in revision, where they consider the usual VBand their analysis is based on a direct application of the variational lemma in the proof ofTheorem 3.1 in the supplementary document. However, their results require a stronger priorconcentration condition and their analysis does not involve latent variable models.Similar to the development for α <
1, we can further simplify Ψ n by introducing moreassumptions. Due to the space constraint, we only provide a counterpart of Theorem 3.4 un-der the assumptions made therein. Recall the definition of two KL neighbourhoods B n ( π ∗ , ε )and B n ( µ ∗ , ε ) defined in (3.5). 16 ssumption P (Prior concentration): There exists some constant
C > P θ (cid:0) B n ( π ∗ , ε n ) (cid:1) ≥ exp (cid:0) − C n ε n ) and P θ (cid:0) B n ( µ ∗ , ε n ) (cid:1) ≥ exp (cid:0) − C n ε n ) . Under Assumptions T and P, Theorem 3.5 leads to a high probability bound on the overvariational Bayes risk for loss r ( θ, θ ∗ ), as summarized in the following Theorem. Theorem 3.6 (Parameter restriction-free) . Under Assumptions T and P, it holds with P ( n ) θ ∗ probability at least − / { ( D − n ε n } that for any ε ∈ [ ε n , e c (cid:48) nε n ] (for some constant c (cid:48) > ), (cid:98) Q θ (cid:16) r ( θ, θ ∗ ) ≥ ε (cid:17) ≤ C ε n ε . In particular, this implies for any
R < e c (cid:48) nε n , (cid:90) θ : r ( θ, θ ∗ ) ≤ R r ( θ, θ ∗ ) (cid:98) q θ ( θ ) dθ ≤ C ε n (cid:0) R/ε n ) (cid:1) . In particular, if the sequence { ε n : n ≥ } satisfies n ε n → ∞ , then selecting ε = M n ε n for M n → ∞ ( M n ≤ ε − n ) leads to the asymptotic variational posterior concentration: (cid:98) Q θ (cid:16) r ( θ, θ ∗ ) ≤ M n ε (cid:17) → n → ∞ . The extra truncation r ( θ, θ ∗ ) ≤ R in the variational risk bound in the theorem is due tothe quadratic decay of our upper bound to (cid:98) Q θ (cid:0) r ( θ, θ ∗ ) ≥ ε (cid:1) . Since the risk upper boundonly has a logarithmic dependence on the truncation level R , one can simply set it at a fixedlarge number to ensure an order O ( ε n ) risk bound in practice. In fact, this truncation canbe eliminated under a stronger assumption (as in [32]) that there is a global test function φ n : Y n → [0 , θ ∈ Θ satisfying r ( θ, θ ∗ ) ≥ ε n , E θ [1 − φ n ] ≤ e − c n r ( θ, θ ∗ ) . This can be seen from Theorem 3.5 by setting F n = Θ and ε = ε n in inequality (3.10), which17mplies c (cid:90) Θ r ( θ, θ ∗ ) (cid:98) q θ ( θ ) dθ ≤ c ε n + c (cid:90) r ( θ, θ ∗ ) ≥ ε n r ( θ, θ ∗ ) (cid:98) q θ ( θ ) dθ ≤ n Ψ n ( q θ , q S n ) + 3 c ε n n . α -VB using stronger divergences In this subsection, we consider an extension of our theoretical development for α -VB wherethe KL divergence in the objective function is replaced by a stronger divergence ¯ D [ p || q ] ≥ D [ p || q ], for example, χ divergence and R´enyi divergence [29], and the corresponding vari-ational approximation ¯ q W n : = argmin q Wn ∈ Γ ¯ D (cid:2) q W n ( · ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | Y n ) (cid:3) . As another example, in some applications of variational inference [47], the minimization of theKL divergence over the variational density q W n to the conditional density p ( W n | Y n ) may notadmit a closed-form updating formula, and some surrogate ELBO ¯ L ( q W n ) as a lower boundto the ELBO L ( q W n ) is employed. Under the perspective of ELBO decomposition (1.4), thisreplacement is equivalent to using a stronger metric¯ D (cid:2) q W n (cid:12)(cid:12)(cid:12)(cid:12) p ( · | Y n ) (cid:3) : = log p ( Y n ) − ¯ L ( q W n ) ≥ D (cid:2) q W n (cid:12)(cid:12)(cid:12)(cid:12) p ( · | Y n ) (cid:3) . The following theorem provides a variational Bayes risk upper bound to ¯ q θ . To simplifythe presentation, the theorem is stated for the α < α = 1 isstraightforward. Define the equivalent objective function¯Ψ α ( q θ , q S n ) = Ψ n,α ( q θ , q S n ) + (cid:16) ¯ D (cid:2) q W n (cid:12)(cid:12)(cid:12)(cid:12) p ( · | Y n ) (cid:3) − D (cid:2) q W n (cid:12)(cid:12)(cid:12)(cid:12) p ( · | Y n ) (cid:3)(cid:17) , and the corresponding α -VB solution ¯ q θ,α = argmin q Wn ∈ Γ ¯Ψ α ( q θ , q S n ). When ¯ D is the KLdivergence D , objective function ¯Ψ α reduces to the Ψ n,α in (2.6). Theorem 3.7.
For any ζ ∈ (0 , , it holds with P nθ ∗ probability at least (1 − ζ ) that for anyprobability measure q θ ∈ Γ θ with q θ (cid:28) p θ and any probability measure q S n ∈ Γ S n on S n , (cid:90) n D ( n ) α ( θ, θ ∗ ) ¯ q θ,α ( dθ ) ≤ αn (1 − α ) ¯Ψ α ( q θ , q S n ) + 1 n (1 − α ) log(1 /ζ ) . This theorem provides a simple replacement rule for α -VB—if the α -VB objective func-18ion Ψ n,α is replaced with a upper bound ¯Ψ α , then a variational Bayes risk bound obtainedby replacing Ψ n,α with the upper bound ¯Ψ α holds. We will apply this replacement ruleto obtain a minimax variational risk bound for the mixture of Gaussian approximation inSection 4. In this section, we apply our theory in Section 3 to concrete examples: mean field ap-proximation to (low) high-dimensional Bayesian linear regression, (mixture of) Gaussianapproximation to regular parametric models, mean field approximation to Gaussian mixturemodels, and mean field approximation to latent Dirichlet allocation. To simplify the pre-sentation, all results are stated for α -VB with α < α subscript in (cid:98) q θ,α is dropped.Extensions to the α = 1 case are discussed in the supplement. Example: (Mean field approximation to low-dimensional Bayesian linear regres-sion)
Consider the following Bayesian linear model Y n = Xβ + w, w ∼ N (0 , σ I n ) , (4.1)where Y n ∈ R n is the n -dim response vector, X ∈ R n × d the design matrix, β ∈ R d theunknown regression coefficient vector of interest, and σ the noise level. In this example, weconsider the low-dimensional regime where d (cid:28) n , and focus on independent prior p β ⊗ p σ for parameter pair θ = ( β, σ ) for technical convenience (the result also applies to non-independent priors).We apply the mean-field approximation by using the following variational family q ( β, σ ) = q β ( β ) q σ ( σ )to approximate the joint α -fractional posterior distribution of θ = ( β, σ ) with (cid:98) q θ = (cid:98) q β ⊗ (cid:98) q σ .This falls into our framework when there is no latent variable and W n = θ . Computational-wise, a normal prior for θ and an inverse gamma prior for σ are attractive since they are“conjugate” priors — the resulting variational densities (cid:98) q β and (cid:98) q σ still fall into the sameparametric families. An application of Theorem 3.3 leads to the following result. Corollary 4.1.
Assume that the prior density is continuous, and thick around the truth θ ∗ = ( β ∗ , σ ∗ ) , that is, p θ ( θ ∗ ) > and p σ ( σ ∗ ) > . If d/n → as n → ∞ , then with robability tending to one as n → ∞ , (cid:26) (cid:90) h (cid:2) p ( · | θ ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | θ ∗ ) (cid:3) (cid:98) q θ ( θ ) dθ (cid:27) / (cid:46) (cid:115) dn min { α, − α } log( d n ) . The convergence rate of O ( (cid:112) n − d log( dn )) under the Hellinger distance implies that the α -VB estimator (cid:98) β VB ,α = (cid:82) β (cid:98) q β ( β ) dβ converges towards β ∗ relative to the (cid:96) norm at rate (cid:112) n − d log( dn ) (under the condition that n − X T X has minimal eigenvalue bounded awayfrom zero), which is the minimax rate up to logarithm factors. A similar n − / convergencerate has been obtained in [45] by directly analyzing the stationary point of an alternatingminimization algorithm. However, their analysis requires the closed-form updating formulabased on a conjugate normal prior for β and an inverse gamma prior for σ , and may notbe applicable to other priors. On the other hand, Corollary 4.1 only requires the minimalconditions of prior thickness and continuity. Example: (Mean field approximation to high-dimensional Bayesian linear regres-sion with spike-and-slab priors)
In this example we continue to consider the Bayesianlinear model (4.1), but we are interested in the high-dimensional regime where d (cid:29) n . Fol-lowing standard practice to make sparsity assumptions in the d (cid:29) n regime, let s (cid:28) n denotethe sparsity level, i.e., the number of non-zero coefficients, of the true regression parameter β ∗ . We consider the popularly used spike-and-slab priors [16] on β . Following [16], we in-troduce a latent indicator variable z j = I ( β j (cid:54) = 0) for each β j to indicate whether the j thcovariate X j is included in the model, and call z = ( z , . . . , z d ) ∈ { , } d the latent indicatorvector. We use the notation β z to denote the vector of nonzero components of β selected by z , that is β z = ( β j : z j = 1). Consider the following sparsity inducing hierarchical prior p β, z over ( β, z ): z j iid ∼ d δ + (cid:16) − d (cid:17) δ , j = 1 , . . . , d,β z | z ∼ p β | z , and σ ∼ p σ , (4.2)where the prior probability of { z j = 1 } is chosen as d − so that on an average only O (1)covariates are included in the model. Let z ∗ denote the indicator vector associated with thetruth β ∗ .By viewing the latent variable indicator vector z as a parameter, we apply the block20ean-field approximation [13] by using the family q ( β, σ, z ) = q σ ( σ ) d (cid:89) j =1 q z j ,β j ( z j , β j )to approximate the joint α -fractional posterior distribution of θ = ( β, σ, z ) with (cid:98) q θ ( θ ) = (cid:98) q σ ( σ ) (cid:81) dj =1 (cid:98) q z j ,β j ( z j , β j ). Although we have a high-dimensional latent variable vector z , thelatent variable is associated with the parameter β , and not with the observation Y n . Conse-quently, this variational approximation still falls into our framework without latent variable,that is, W n = θ = ( z, β ) and ∆ J ≡
0. It turns out that the spike and slab prior with Gaus-sian slab is particularly convenient for computation — it is “conjugate” in that the resultingvariational approximation falls into the same spike and slab family [13]. An application ofTheorem 3.3 leads to the following result.
Corollary 4.2.
Suppose p β | z ∗ is continuous and thick at β ∗ z ∗ , and p σ is continuous and thickat σ ∗ . If s log d/n → as n → ∞ , then it holds with probability tending to one as n → ∞ that (cid:26) (cid:90) h (cid:2) p ( · | θ ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | θ ∗ ) (cid:3) (cid:98) q θ ( θ ) dθ (cid:27) / (cid:46) (cid:114) sn min { α, − α } log( d n ) . Corollary 4.2 implies a convergence rate (cid:112) n − s log( dn ) of the variational-Bayes estimator (cid:98) β VB ,α under the restricted eigenvalue condition [5], which is the minimax rate up to log termsfor high-dimensional sparse linear regression. To our knowledge, [31] is the only literaturethat studies the mean-field approximation to high-dimensional Bayesian linear regressionwith spike and slab priors. They show estimation consistency by directly analyzing aniterative algorithm for solving the variational optimization problem with α = 1 and a specificprior. As before, Corollary 4.2 holds under very mild conditions on the prior and does notrely on having closed-form updates of any particular algorithm.Here, we considered the block mean-field instead of the full mean-field approximationwhich further decomposes q z j ,β j into q z j ⊗ q β j . In fact, the latter resembles a ridge regressionestimator, and the KL term α − D ( q θ || p θ ) appearing in the upper bound in (3.2) cannotattain the minimax order (cid:112) n − s log d . Example: (Gaussian approximation to regular parametric models)
Consider afamily of regular parametric models P = { P ( n ) θ : θ ∈ Θ } where n is the sample size, andthe likelihood function p ( n ) θ is indexed by a parameter θ belonging to the parameter spaceΘ ⊂ R d , which we assume to be compact. Let p θ denote the prior density of over Θ, and21 n = ( Y , . . . , Y n ) be the observations from P ( n ) θ ∗ , with θ ∗ being the truth. We apply theGaussian approximation by using the Gaussian family Γ G (restricted to Θ) q ( θ ) ∝ N ( θ ; µ, Σ) I Θ ( θ ) , µ ∈ R d and Σ is a d × d positive definite matrix . Details are postponed to the supplemental document for space constraints.
Example: (Mixture of Gaussian variational approximation to regular parametricmodels):
Still consider the regular parametric model P = { P ( n ) θ : θ ∈ Θ } as in the previousexample. Now we consider the more flexible variational family Γ MG composed of q ( θ ) = J (cid:88) j =1 w j N ( θ ; µ j , Σ j ) , J (cid:88) j =1 w j = 1 , µ j ∈ R d , Σ j ∈ R d × d is p.d. , as all mixtures of Gaussians with J components, where J is a pre-specified number. Let q j denote the j th component N ( θ ; µ j , Σ j ) of the variational density function q , and E p denotes the expectation under a density function p . Since any probability distribution can beapproximated by a mixture of Gaussians within arbitrarily small error with sufficient largenumber J of components, this enlarged variational family may reduce the approximationerror from using Γ G and become capable of capturing multimodality and heavy tail behaviourof the posterior [47]. However, this additional flexibility in shape comes with the price ofintractability of the entropy term E q [ − log q ( θ )]. To facilitate computation, [47] conductedan additional application of Jensen’s inequality E q [ − log q ( θ )] = J (cid:88) j =1 w j E q j [ − log q ( θ )] ≥ − J (cid:88) j =1 w j log E q j [ q ( θ )] , yielding a lower bound to the ELBO as L ( q ) = E q (cid:20) log p ( Y n , θ ) q ( θ ) (cid:21) ≥ E q [log p ( Y n , θ )] − J (cid:88) j =1 w j log E q j [ q ( θ )] (cid:124) (cid:123)(cid:122) (cid:125) ¯ L ( q ) . ¯ L ( q ) is more convenient to work with since its second term admits a simple analytic form.Using such a surrogate ELBO places us directly in the framework of § Corollary 4.3.
For any measure q θ = (cid:80) Jj =1 w j q j ∈ Γ MG , it holds with probability at least − ε that (cid:90) (cid:110) n D ( n ) α ( θ, θ ∗ ) (cid:111) (cid:98) q θ ( θ ) dθ ≤ − αn (1 − α ) (cid:90) Θ q θ ( θ ) log p ( Y n | θ ) p ( Y n | θ ∗ ) dθ + 1 n (1 − α ) J (cid:88) j =1 w j (cid:16) log E q j [ q θ ( θ )] − E q j [log p θ ( θ )] (cid:17) + 1 n (1 − α ) log(1 /ε ) . (4.3) In particular, under Assumption P, it holds with probability tending to one as n → ∞ that (cid:26) (cid:90) h (cid:2) p ( · | θ ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | θ ∗ ) (cid:3) (cid:98) q θ ( θ ) dθ (cid:27) / (cid:46) (cid:115) dn min { α, − α } log( d n ) . As a concrete example of the development in Section 3.3, this corollary suggests thatthe additional Jensen gap due to the E q [ − log q ( θ )] term is reflected in the new variationalinequality (4.3). More precisely, the KL divergence term D [ q θ || p θ ] is replaced by its upperbound (cid:80) Jj =1 w j (cid:0) log E q j [ q θ ( θ )] − E q j [log p θ ( θ )] (cid:1) , which can be bounded by reducing it into asingle Gaussian component case ( w = 1, and w j = 0 for j = 2 , . . . , J ). Example: (Mean field approximation to Gaussian mixture model)
Suppose thetrue data generating model is the d -dimensional Gaussian mixture model with K compo-nents, Y ∼ K (cid:88) k =1 π k N ( µ k , I d ) , where µ k ∈ R d is the mean vector associated with the k th component and π = ( π , . . . , π K ) ∈S K is the mixing probability. Here, for simplicity we assume the covariance matrix of eachGaussian component to be I d . µ = ( µ , . . . , µ K ) and π together forms the parameter θ =( µ, π ) of interest. By data augmentation, we can rewrite the model into the followinghierarchical form by introducing the latent class variable S , S ∼ Categorical( π , π , . . . , π K ) , Y | S = s ∼ N ( µ s , I d ) . Let Y n = ( Y , . . . , Y n ) be n i.i.d. copies of Y with parameter θ ∗ = ( µ ∗ , π ∗ ), and S n =( S , . . . , S n ) ∈ { , . . . , K } n denote the corresponding latent variables. For simplicity, weassume that independent prior p µ ⊗ p π are specified for ( µ, π ).We apply the mean field approximation by using the family of density functions of the23orm q ( π, µ, S n ) = q π ( π ) q µ ( µ ) q S n ( s n ) = q π ( π ) q µ ( µ ) n (cid:89) i =1 q S i ( s i )to approximate the joint α -fractional posterior distribution of ( π, µ, S n ), producing the α -mean-field approximation (cid:98) q θ ⊗ (cid:98) q S n , where ( (cid:98) q θ , (cid:98) q S n ) are defined in (2.7). This variationalapproximation fits into the framework of Theorem 3.4. Therefore, an application of thistheorem leads to the following result. Assumption R: (regularity condition)
There exists some constant δ >
0, such thateach component of π ∗ ∈ S K is at least δ . Corollary 4.4.
Suppose Assumption R holds, and the prior densities p µ and p π are thickand continuous at µ ∗ and π ∗ respectively. If d K/n → as n → ∞ , then it holds withprobability tending to one as n → ∞ that (cid:26) (cid:90) h (cid:2) p ( · | θ ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | θ ∗ ) (cid:3) (cid:98) q θ ( θ ) dθ (cid:27) / (cid:46) (cid:115) d Kn min { α, − α } log( d n ) . As a related result, [42] show that the with proper initialization, the coordinate descentalgorithm for solving the variational optimization problem (2.7) with α = 1 under conjugatepriors converges to a local minimum that is O ( n − ) away from the maximum likelihood esti-mate of ( µ, π ) by directly analyzing the algorithm using the contraction mapping theorem.In contrast, our proof does not require any structural assumptions on the priors, and can beeasily extended to a broader class of mixture models beyond Gaussians. Example: (Mean field approximation to latent Dirichlet allocation)
As our fi-nal example, we consider Latent Dirichlet allocation (LDA, [11]), a conditionally conjugateprobabilistic topic model [9] for uncovering the latent “topics” contained in a collection ofdocuments. LDA treats documents as containing multiple topics, where a topic is a distri-bution over words in a vocabulary. Following the notation of [21], let K be a specific numberof topics and V the size of the vocabulary. LDA defines the following generative process:1. For each topic in k = 1 , . . . , K ,(a) draw a distribution over words β k ∼ Dir V ( η β ).2. For each document in d = 1 , . . . , D , 24a) draw a vector of topic proportions γ d ∼ Dir K ( η γ ).(b) For each word in n = 1 , . . . , N ,i. draw a topic assignment z dn ∼ multi( γ d ), thenii. draw a word w dn ∼ multi( β z dn ).Here η β ∈ R + is a hyper-parameter of the symmetric Dirichlet prior on the topics β , and η γ ∈ R K + are hyper-parameters of the Dirichlet prior on the topic proportions for eachdocument. z dn ∈ { , . . . , K } is the latent class variable over topics where z dn = k indicatesthe n th word in document d is assigned to the k th topic. Similarly, w dn ∈ { , . . . , V } is thelatent class variable over the words in the vocabulary where w dn = v indicates that the n thword in document d is the v th word in the vocabulary. To facilitate adaptation to sparsityusing Dirichlet distributions when V, K (cid:29)
1, we choose η β = 1 /V c and η γ = 1 /K c for somefixed number c > N as the sample size, and D as the “dimension” of the parameters in the model. Underour vanilla notation, we are interested in learning parameters θ = ( π, µ ), with π = { γ d : d = 1 , . . . , D } and µ = { β k : k = 1 , . . . , K } , from the posterior distribution P ( π, µ, z | Y n ),where S N = { S n : n = 1 , . . . , N } with S n = { z dn : d = 1 , . . . , D } are latent variables, and Y N = { Y n : n = 1 , . . . , N } with Y n = { w dn : d = 1 , . . . , D } are the data, and the priorsfor ( π, µ ) are independent Dirichlet distributions Dir K ( η γ ) and Dir V ( η β ) whose densities aredenoted by p π and p µ . The conditional distribution p ( Y N | µ, S N ) of the observation giventhe latent variable is (cid:0) w dn | µ, z dn (cid:1) ∼ multi( β z dn ) , d = 1 , . . . , D and n = 1 , . . . , N. Finally, the α -mean-field approximation considers using the family of probability densityfunctions of forms q ( µ, π, S N ) = q π ( π ) q µ ( µ ) N (cid:89) n =1 q S n ( S n ) = K (cid:89) k =1 q β k ( β k ) D (cid:89) d =1 (cid:32) q γ d ( γ d ) N (cid:89) n =1 q z dn ( z dn ) (cid:33) to approximate the joint α -fractional posterior of ( µ, π, S N ). Since for LDA, each observation Y n is composed of D independent observations, it is natural to present the variational in-equality with the original loss function D α (cid:2) p ( · | θ ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | θ ∗ ) (cid:3) = (cid:80) Dd =1 D α (cid:2) p d ( · | θ ) (cid:12)(cid:12)(cid:12)(cid:12) p d ( · | θ ∗ ) (cid:3) re-scaling by a factor of D − , where p d ( · | θ ) denotes the likelihood function of the d th ob-servation w dn in Y n . We make the following assumption.25 ssumption S: (sparsity and regularity condition) Suppose for each k , β ∗ k is d k (cid:28) V sparse, and for each d , γ ∗ d is e d (cid:28) K sparse. Moreover, there exists some constant δ > β ∗ k or γ ∗ d is at least δ . Corollary 4.5.
Under Assumption S, it holds with probability at least − C/ (cid:0) N (cid:80) Dd =1 ε γ d + N (cid:80) Kk =1 ε β k (cid:1) that (cid:90) (cid:110) D − D α (cid:2) p ( · | θ ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | θ ∗ ) (cid:3)(cid:111) (cid:98) q θ ( θ ) dθ (cid:46) α − α (cid:26) D D (cid:88) d =1 ε γ d + 1 D K (cid:88) k =1 ε β k (cid:27) + 1 N (1 − α ) (cid:26) D D (cid:88) d =1 e d log Kε γ d + 1 D K (cid:88) k =1 d k log Vε β k (cid:27) , for any ε γ = ( ε γ , . . . , ε γ d ) and ε β = ( ε β , . . . , ε β K ) . Therefore, if (cid:0) (cid:80) Dd =1 e d + (cid:80) Kk =1 d k (cid:1) / ( DN ) → as N → ∞ , then it holds with probability tending to one that as N → ∞ (cid:26) (cid:90) D − h (cid:2) p ( · | θ ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | θ ∗ ) (cid:3) (cid:98) q θ ( θ ) dθ (cid:27) / (cid:46) (cid:115) (cid:80) Dd =1 e d DN min { α, − α } log( DKN ) + (cid:80) Kk =1 d k DN min { α, − α } log( KV N ) . Corollary 4.5 implies estimation consistency as long as the “effective” dimensionality (cid:80) Dd =1 e d + (cid:80) Kk =1 d k of the model is o ( DN ) as the “effective sample size” DN → ∞ . Inaddition, the upper bound depends only logarithmically on the vocabulary size V due to thesparsity assumption. The primary motivation behind this work is to investigate whether point estimates obtainedfrom mean-field or other variational approximations to a Bayesian posterior enjoy the samestatistical accuracy as those obtained from the true posterior, and we answer the questionin the affirmative for a wide range of statistical models. To that end, we have analyzed aclass of variational objective functions indexed by a temperature parameter α ∈ (0 , α = 1 corresponding to the usual VB, and obtained risk bounds for the variational solutionwhich can be used to show (near) minimax optimality of variational point estimates. Ourtheory was applied to a number of examples, including the mean-field approximation toBayesian linear regression with and without variable selection, Gaussian mixture models,latent Dirichlet allocation, and (mixture of) Gaussian variational approximation in regularparameter models. This broader class of objective functions can be fitted in practice with26o additional difficulty compared to the usual VB. Hence, the proposed framework leads toa class of efficient variational algorithms with statistical guarantees.The theory for the α < α = 1 (usual VB) case lead to interesting contrasts.For α <
1, a prior mass condition suffices to establish the risk bounds for the Hellinger (andmore generally, R´enyi divergences). However, the α = 1 case requires additional conditionsto be verified. When all conditions are met, there is no difference in terms of the rate ofconvergence for α < α = 1. Hence, from a practical standpoint, the procedure with α < Convention
As a convention, all equations defined in this supplementary document are numbered (S1),(S2), . . . , while equations cited from the main document retain their numbering in the maindocument. Similar for theorems, corollaries, lemmas etc.In § S2, we provide an empirical study to compare the α -VB approach for α < §
4. In § S3, we illustrate applyingour theory for continuous latent variable models. § S4 contains the Gaussian approximationexample whose details were skipped in § § S5 provides proofs of alltheoretical results.
B Numerical Examples
In this section, we illustrate the α -VB procedure through several representative simula-tion examples. Since the objective functions Ψ( q θ ) and Ψ( q θ , q S n ) differ from usual VBonly through the presence of α , standard coordinate ascent variational inference (CAVI)algorithms[7, 10] can be implemented with simple modifications in the iterative updates.We implemented α -VB with different choices of α between 0 . α . B.1 Bayesian high-dimensional linear regression
Consider sparsity inducing hierarchical prior p β, z over ( β, z ) as (cid:81) dj =1 p β j ,z j with p β j , = N ( β j ; 0 , ν σ ) , p β j , = δ ( β j ); z j ∼ Bernoulli(1 /d ) , where δ denotes the point mass measure at 0. Apply the variational approximation by usingthe family q ( β, σ, z ) = q σ ( σ ) d (cid:89) j =1 q z j ,β j ( z j , β j )where q z j ,β j ( z j , β j ) = (cid:81) dj =1 N ( β j ; µ j , σ j ) z j δ ( β j ) − z j [ q z j (1)] z j [ q z j (0)] − z j . Let φ j = q z j (1) for j = 1 , . . . , p .An implementation of the α -VB algorithm for Bayesian high-dimensional linear regres-sion ( α -VB-HDR) is described in Algorithm 1 and follows the batch-wise variational Bayesalgorithm in Algorithm 2 of [22]. We sample n = 100 observations from the linear regression28 = 0 . α = 0 . α = 0 . α = 1Figure 1: Plot of the 500 coefficients; true β in red and estimated means of β using α -VB-HDR in black.model with d = 500, with the entries of the covariate matrix X sampled independently from N (0 , . ), and error standard deviation σ = 1. The first 4 coefficients are non-zero and areset equal to (5 , − , − , α -VB-HDR for differentvalues of α . In all the cases, the convergence of ELBO occurs within less than 20 iterates. Algorithm 1: α -VB-HDR Set ˜ σ = σ/ √ α . Initialize ( µ , . . . , µ d ) , ( σ , . . . , σ d ), ( φ , . . . , φ d ) = (1 , . . . , (cid:48) and Φ = Diag( φ , . . . , φ d ). while ELBO does not converge do ( µ , . . . , µ d ) (cid:48) = ( X (cid:48) X + Φ /ν ) − X (cid:48) Y for j = 1 , . . . , d do σ j = Diag( X (cid:48) X ) j σ + φ j ν ˜ σ φ j = Logit − (cid:26) Logit(1 /d ) + 12 log (cid:18) σ j ν ˜ σ (cid:19) + µ j σ j (cid:27) end end .2 Gaussian mixture models We sample n = 1000 bi-variate observations from Y ∼ K (cid:88) k =1 π k N ( µ k , I ) , for π k = 1 / k = 1 , . . . , K = 3. µ k are drawn from N (0 , I ) for k = 1 , . . . ,
3. Let π ( µ k ) = N ( µ , σ I ). We use µ = (0 , (cid:48) and σ = 50. For simplicity, we assume π k to beknown in the study. We apply the mean field approximation by using the family of densityfunctions of the form q ( µ, S n ) = q µ ( µ ) q S n ( s n ) = q µ ( µ ) n (cid:89) i =1 q S i ( s i )Following [10], we develop α -VB algorithm for Gaussian mixture models ( α -VB-GMM),described in Algorithm 2. The derivation follows very closely to the case when α = 1 andhence the details are omitted. Numerical results are shown in Figure 2. In all the cases, theconvergence of ELBO occurs within less than 10 iterates. It is evident that for α close to 1, α -VB-GMM can recover the true density almost perfectly. Algorithm 2: α -VB-GMM Initialize ˜ µ k , ˜ σ k , k = 1 , . . . , K and s i , i = 1 , . . . , n . while ELBO does not converge do for i = 1 , . . . , n do q S i ( s i ) ∝ exp { α log π s i + αy s i E ( µ s i ) − αE ( µ s i / } end for k = 1 , . . . , K do Update˜ µ k = µ /σ + (cid:80) ki =1 q S i ( s i = k ) y s i /σ + (1 /α ) (cid:80) ki =1 q S i ( s i = k ) , ˜ σ k = 11 /σ + (1 /α ) (cid:80) ki =1 q S i ( s i = k )Set q µ k = N ( µ k ; ˜ µ k , ˜ σ k I ) end end a) True mixture density (b) α = 0 . α = 0 .
95 (d) α = 1 Figure 2:
Contour plots for the true and predicted density using α -VB-GMM. The colors in Figure2a represent different cluster components. .3 Latent Dirichlet Allocation We implemented a version of LDA which is exactly same as Section 5.2 of [11]. The approachis the same as the one described here with one minor difference. The parameter η γ is setto 1 /K , but η β is estimated using an empirical Bayes approach described in Section 5.3of [11] instead of fixing it to be 1 /V . To implement α -VB, we note that the only changerequired will be to Equation (6) of Section 5.2 where we replace φ ni ∝ β iw n exp { E q [log θ i | γ ] } to φ ni ∝ β iw n exp { αE q [log θ i | γ ] } . We provide an illustrative example of the use of an LDAmodel on a real data comprising of the first 5 out of 16,000 documents from a subset ofthe TREC AP corpus [19]. The maximum number of topics is set to 10. The top wordsfrom some of the resulting multinomial distributions p ( w | z ) are illustrated in Table 1. Thedistributions seem to capture some of the underlying topics in the corpus with decreasingword similarity as α decreases.Table 1: Top 5 words for each of the 10 extracted topics for α = 0 . , . , . , α = 0 .
5, next 5 rows corresponding to α = 0 . T1 T2 T3 T4 T5 T6 T7 T8 T9 T10history police year peres liberace school classroom i peres firstago shot people israel back teacher teacher mrs official yearyork gun get bechtel mrs guns boy jewelry rappaport ministerpresident students first offer museum boys shot museum pipeline newtodays door just memo man saturday baptist bloomberg offer invasionhistory police year peres liberace school shot i peres yearpresident students people offer mrs teacher classroom police official firstago school get bechtel bloomberg guns teacher mrs rappaport newyork gun volunteers memo back shot baptist museum pipeline invasiontodays yearold mail israel door boys marino jewelry offer ministerhistory school first memo liberace first shot mrs peres peoplepresident police year effect door year baptist i offer getago teacher just wage back day marino police official yearfirst students died quoted mrs died teacher museum rappaport thompsonyear boys day bechtel bloomberg people kids bloomberg bechtel programago police get memo liberace teacher shot i peres peoplepresident school volunteers bechtel mrs school police police official yearhistory students year peres bloomberg shot baptist mrs offer thompsonfirst teacher mail offer back guns teacher museum rappaport programyear boys people israel door students classroom jewelry pipeline get
C Extension to continuous latent variables
As discussed in the main draft, we extend results on mean field approximations to modelsfrom discrete latent variables to continuous latent variables. For simplicity, we only focuson the α < α in (2.6),32he only change takes place in the quantity ∆ J , where the approximation to the likelihoodis now made with continuous latent variables. We present a version where i.i.d. copies T n = ( T , . . . , T n ) of the latent variable T ∈ T are continuous and there is no restrictionson the variational factor q T i for each latent variable T i . In this setting, the α -VB objectivefunction is simplified toΨ α ( q θ , q T n ) = − (cid:90) Θ q θ ( θ ) n (cid:88) i =1 (cid:90) T q T i ( t i ) log p ( Y i | µ, t i ) p T i ( t i | π ) p ( Y i | θ ∗ ) q T i ( t i ) dt i dθ + α − D ( q θ || p θ ) , (C.1)where we assume that the distribution family p T ( · | π ) for the latent variable is indexed byits own parameter π , and recall that µ is the parameter in the likelihood function p ( Y | µ, T )of response Y given the latent variable T , and θ = ( π, µ ) are the parameters.Similar to the discrete case, for continuous latent variables, we define the following twoKL neighborhoods of π ∗ and µ ∗ B conn ( π ∗ , ε π ) = (cid:110) D (cid:2) p T ( · | π ∗ ) || p T ( · | π ) (cid:3) ≤ ε π , V (cid:2) p T ( · | π ∗ ) || p T ( · | π ) (cid:3) ≤ ε π (cid:111) , B conn ( µ ∗ , ε µ ) = (cid:110) sup t D (cid:2) p ( · | µ ∗ , t ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | µ, t ) (cid:3) ≤ ε µ , sup t V (cid:2) p ( · | µ ∗ , t ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | µ, t ) (cid:3) ≤ ε µ (cid:111) . We now state a theorem with the same combined conclusions of Corollary 3.2 and Theorem3.4 for the continuous case. The proof is similar and hence omitted.
Theorem C.1.
For any measure q θ over θ satisfying q θ (cid:28) p θ , it holds with probability atleast (1 − ζ ) that (cid:90) (cid:110) D α (cid:2) p ( · | θ ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | θ ∗ ) (cid:3)(cid:111) (cid:98) q θ,α ( θ ) dθ ≤ − αn (1 − α ) (cid:90) Θ q θ ( θ ) n (cid:88) i =1 (cid:26) (cid:90) T (cid:101) q T i ( t i ) log p ( Y i | µ, t i ) p T i ( t i | π ) p ( Y i | µ ∗ , t i ) p T i ( t i | π ∗ ) dt i (cid:27) dθ + 1 n (1 − α ) D ( q θ || p θ ) + 1 n (1 − α ) log(1 /ζ ) , (C.2) where (cid:101) q T i is a probability distribution over T satisfying (cid:101) q T i ( t i ) = p T i ( t i | π ∗ ) p ( Y i | µ ∗ , t i ) p ( Y i | θ ∗ ) , t i ∈ T . (C.3)33 oreover, for any fixed ( ε π , ε µ ) ∈ (0 , , with P θ ∗ probability at least − / { ( D − n ( ε π + ε µ ) } , it holds that (cid:90) (cid:110) D α (cid:2) p ( · | θ ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | θ ∗ ) (cid:3)(cid:111) (cid:98) q θ,α ( θ ) dθ ≤ D α − α ( ε π + ε µ )+ (cid:110) − n (1 − α ) log P π (cid:2) B conn ( π ∗ , ε π ) (cid:3)(cid:111) + (cid:110) − n (1 − α ) log P µ (cid:2) B conn ( µ ∗ , ε µ ) (cid:3)(cid:111) . (C.4)In presence of continuous latent variables, if the mean-field variational family is furtherconstrained by restricting each factor q T i corresponding to the latent variable T i to belongto a parametric family Γ T i , such as the exponential family, then the Bayes risk bound ofTheorem C.1 still applies as long as the family Γ T i for q T i contains densities of form (C.3)—which is the case if the conditional distribution p ( T i | π ) also belongs to Γ T i and the model p ( Y i | µ, T i ) is conjugate with respect to family Γ T i . D Gaussian approximation to regular parametric mod-els
We discuss the details of this example from § P = { P ( n ) θ : θ ∈ Θ } where n is the samplesize, and the likelihood function p ( n ) θ is indexed by a parameter θ belonging to the parameterspace Θ ⊂ R d , which we assume to be compact. Let p θ denote the prior density of over Θ,and Y n = ( Y , . . . , Y n ) be the observations from P ( n ) θ ∗ , with θ ∗ being the truth. We apply theGaussian approximation by using the Gaussian family Γ G (restricted to Θ) q ( θ ) ∝ N ( θ ; µ, Σ) I Θ ( θ ) , µ ∈ R d and Σ is a d × d positive definite matrix . The Gaussian variational approximation (cid:98) q θ as (cid:98) q θ : = argmin q θ ∈ Γ G (cid:26) − α (cid:90) Θ (cid:90) q θ ( θ ) log p ( n ) θ ( Y n ) dθ + D ( q θ || p θ ) (cid:27) . we make the following assumption. Assumption P: (prior thickness and regularity condition)
The prior density p θ satisfies inf θ ∈ Θ p θ ( θ ) >
0, and there exists some constant C such that D (cid:2) p ( · | θ ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | θ ) (cid:3) ≤ (cid:107) θ − θ (cid:107) and V (cid:2) p ( · | θ ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | θ ) (cid:3) ≤ C (cid:107) θ − θ (cid:107) holds for all ( θ , θ ) ∈ Θ . Corollary D.1.
Under Assumption P, it holds with probability tending to one as n → ∞ that (cid:26) (cid:90) h (cid:2) p ( · | θ ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | θ ∗ ) (cid:3) (cid:98) q θ ( θ ) dθ (cid:27) / (cid:46) (cid:115) dn min { α, − α } log( d n ) . Under the model identifiability condition h (cid:2) p ( · | θ ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | θ ∗ ) (cid:3) (cid:38) (cid:107) θ − θ ∗ (cid:107) , Corollary D.1implies a convergence rate (cid:112) n − d log( dn ) for the variational-Bayes estimator (cid:98) θ B of θ . Byexamining the proof of the corollary, we find that the normality form in the variationalapproximation does not play a critical role in the proof—similar Bayesian risk upper boundshold under some additional conditions for a broader class of variational distributions aswell, such as any location-scale distribution family with sub-exponential tails. It is a well-known fact [41, 43] that the covariance matrices from the variational approximations aretypically “too small” compared with those for the sampling distribution of the maximumlikelihood estimator, which combined with the Bernstein von-mises theorem implies thatthe variational approximation (cid:98) q θ may not converge to the true posterior distribution. Thisfact combined with the result in Corollary D.1 indicates: 1. minimizing the KL divergenceover the variational family forces the variational distribution (cid:98) q θ to concentrate around thetruth θ ∗ at the optimal rate (due to the heavy penalty on the tails in the KL divergence);2. however, the local shape of (cid:98) q θ around θ ∗ can be far away from that of the true posteriordue to dis-match between the distributions in the variational family and the true posterior. E Proofs
In this section, we present proofs of all technical results in the main document.
E.1 Proof of Theorem 3.1
We first state a key variational lemma that plays a critical role in the proof.
Lemma E.1.
Let µ θ be a probability measure over θ and µ S n be a probability measure over S n , and h ( θ, S n ) a measurable function such that for any fixed S n , e h ( · ,S n ) ∈ L ( µ θ ) . Then, log (cid:90) (cid:88) s n e h ( θ,s n ) µ S n ( s n ) µ θ ( dθ )= sup ρ n ( θ,S n ) (cid:20) (cid:90) (cid:88) s n h ( θ, s n ) ρ n ( dθ, s n ) − D ( ρ n ( θ, S n ) || µ θ ⊗ µ S n ) (cid:21) , here the supremum is over all probability measures ρ n ( θ, S n ) (cid:28) µ θ ⊗ µ S n . Further, thesupremum on the right hand side is attained when ρ n ( dθ, s n ) µ ( dθ ) µ S n ( s n ) = e h ( θ,S n ) (cid:82) (cid:80) s n e h ( θ,s n ) µ S n ( s n ) µ ( dθ ) . Proof.
Use the well-known variational dual representation of the KL divergence (see, e.g.,Corollary 4.15 of [12]) which states that for any probability measure µ and any measurablefunction h with e h ∈ L ( µ ), one haslog (cid:90) e h ( η ) µ ( dη ) = sup ρ (cid:28) µ (cid:20) (cid:90) h ( η ) ρ ( dη ) − D ( ρ || µ ) (cid:21) , where the supremum is over all probability distributions ρ (cid:28) µ , and equality is attainedwhen dρ/dµ ∝ e h . This fact simply follows upon an application of Jensen’s inequality. Inthe current context, set η = ( θ, s n ), µ = µ θ ⊗ µ S n and ρ ( dη ) = ρ n ( dθ, s n ) to obtain theconclusion of Lemma E.1.Return to the proof of the theorem. By applying Jensen’s inequality to function x (cid:55)→ x α ( α < q S n , E θ ∗ (cid:20) (cid:88) s n q S n ( s n ) exp (cid:110) α log p ( Y n | µ, s n ) π s n p ( Y n | θ ∗ ) q S n ( s n ) (cid:111)(cid:21) = (cid:90) R n (cid:88) s n q S n ( s n ) (cid:26) p ( Y n | µ, s n ) π s n p ( Y n | θ ∗ ) q S n ( s n ) (cid:27) α p ( Y n | θ ∗ ) dY n ≤ (cid:90) R n (cid:26) (cid:88) s n q S n ( s n ) p ( Y n | µ, s n ) π s n p ( Y n | θ ∗ ) q S n ( s n ) (cid:27) α p ( Y n | θ ∗ ) dY n ≤ (cid:90) (cid:26) p ( Y n | θ ) p ( Y n | θ ∗ ) (cid:27) α p ( Y n | θ ∗ ) dY n = e − (1 − α ) D ( n ) α ( θ,θ ∗ ) , with D ( n ) α ( θ, θ ∗ ) defined in the first display of § ζ ∈ (0 , E θ ∗ (cid:20) (cid:88) s n q S n ( s n ) exp (cid:110) α log p ( Y n | µ, s n ) π s n p ( Y n | θ ∗ ) q S n ( s n )+ (1 − α ) D ( n ) α ( θ, θ ∗ ) − log(1 /ζ ) (cid:111)(cid:21) ≤ ζ. Integrating both side of this inequality with respect to p θ and interchanging the integrals36sing Fubini’s theorem, we obtain E θ ∗ (cid:20) (cid:90) Θ (cid:88) s n p θ ( θ ) q S n ( s n ) exp (cid:110) α log p ( Y n | µ, s n ) π s n p ( Y n | θ ∗ ) q S n ( s n )+ (1 − α ) D ( n ) α ( θ, θ ∗ ) − log(1 /ζ ) (cid:111) dθ (cid:21) ≤ ζ. Now, apply Lemma E.1 with µ θ = p θ , µ S n = q S n and h ( θ, s n ) = α log p ( Y n | µ, s n ) π s n p ( Y n | θ ∗ ) q S n ( s n ) + (1 − α ) D ( n ) α ( θ, θ ∗ ) − log(1 /ζ ) , to obtain that E θ ∗ exp sup ρ ( θ,S n ) (cid:20) (cid:90) Θ (cid:88) s n (cid:26) α log p ( Y n | µ, s n ) π s n p ( Y n | θ ∗ ) q S n ( s n ) + (1 − α ) D ( n ) α ( θ, θ ∗ ) − log(1 /ζ ) (cid:27) ρ ( dθ, s n ) − D ( ρ || p θ ⊗ q S n ) (cid:21) ≤ ζ. If we choose ρ = q θ ⊗ q S n in the preceding display for any (possibly data dependent) q θ (cid:28) p θ ,then E θ ∗ exp (cid:20) (cid:90) Θ (cid:88) s n (cid:26) α log p ( Y n | µ, s n ) π s n p ( Y n | θ ∗ ) q S n ( s n ) + (1 − α ) D ( n ) α ( θ, θ ∗ ) − log(1 /ζ ) (cid:27) q θ ( dθ ) q S n ( s n ) − D ( q θ || p θ ) (cid:21) ≤ ζ. By applying Markov’s inequality, we further obtain that with P θ ∗ probability at least (1 − ζ ),(1 − α ) (cid:90) D ( n ) α ( θ, θ ∗ ) q θ ( dθ ) ≤ − α (cid:90) Θ (cid:88) s n (cid:26) log p ( Y n | µ, s n ) π s n p ( Y n | θ ∗ ) q S n ( s n ) (cid:27) q θ ( dθ ) q S n ( s n ) + D ( q θ || p θ ) + log(1 /ζ )= α Ψ α ( q θ , q S n ) + log(1 /ζ ) , since, from (2.2) – (2.3) and (2.6),Ψ α ( q θ , q S n ) = − (cid:88) s n (cid:20) q S n ( s n ) log p ( Y n | µ, s n ) π s n p ( Y n | θ ∗ ) q S n ( s n ) (cid:21) q θ ( dθ ) + α − D ( q θ || p θ ) . Since the inequality in the penultimate display holds for any (possibly data dependent)37 θ (cid:28) p θ and q S n , we obtain, in particular,(1 − α ) (cid:90) D ( n ) α ( θ, θ ∗ ) (cid:98) q θ,α ( dθ ) ≤ α Ψ α ( (cid:98) q θ,α , (cid:98) q S n ,α ) + log(1 /ζ ) . The conclusion of the Theorem follows since Ψ α ( (cid:98) q θ,α , (cid:98) q S n ,α ) ≤ Ψ α ( q θ , q S n ) for any q θ (cid:28) p θ and q S n . E.2 Proof of Theorem 3.4
We choose q θ as the probability density function q ∗ θ of Q ∗ θ = P π (cid:2) · ∩ B n ( π ∗ , ε π ) (cid:3) ⊗ P µ (cid:2) · ∩ B n ( µ ∗ , ε µ ) (cid:3) P π (cid:2) B n ( π ∗ , ε π ) (cid:3) · P µ (cid:2) B n ( µ ∗ , ε µ ) (cid:3) , the product measure of restrictions of the priors ( P π , P µ ) for ( π, µ ) to two KL neighborhoods B n ( π ∗ , ε π ) and B n ( µ ∗ , ε µ ) around ( π ∗ , µ ∗ ).Next, we will characterize the first two moments of the first term on the r.h.s. in inequal-ity (3.2) under this choice of q ∗ θ . By applying Fubini’s theorem, we have E θ ∗ (cid:20) (cid:90) Θ q ∗ θ ( θ ) n (cid:88) i =1 (cid:88) s i (cid:101) q S i ( s i ) log p ( Y i | µ, s i ) π s i p ( Y i | µ ∗ , s i ) π ∗ s i dθ (cid:21) = (cid:90) Θ E θ ∗ (cid:20) n (cid:88) i =1 (cid:88) s i (cid:101) q S i ( s i ) log p ( Y i | µ, s i ) π s i p ( Y i | µ ∗ , s i ) π ∗ s i (cid:21) q ∗ θ ( θ ) dθ. By plugging-in the expression of (cid:101) q S i ( s i ) and applying Fubini’s theorem, we obtain E θ ∗ (cid:20) n (cid:88) i =1 (cid:88) s i (cid:101) q S i ( s i ) log p ( Y i | µ, s i ) π s i p ( Y i | µ ∗ , s i ) π ∗ s i (cid:21) = n E θ ∗ (cid:20)(cid:101) q S ( s ) log p ( Y | µ, s ) π s p ( Y | µ ∗ , s ) π ∗ s (cid:21) = − n D ( π ∗ || π ) − n (cid:88) s π ∗ s D (cid:2) p ( · | µ ∗ , s ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | µ, s ) (cid:3) , where recall shorthand D ( π ∗ || π ) = (cid:80) s π ∗ s log( π ∗ s /π s ) as the KL divergence between cate-gorical distributions with parameters π ∗ and π . Combining the two preceding displays and38nvoking the definitions of B n ( π ∗ , ε π ) and B n ( µ ∗ , ε µ ), we obtain E θ ∗ (cid:20) − (cid:90) Θ q ∗ θ ( θ ) n (cid:88) i =1 (cid:88) s i (cid:101) q S i ( s i ) log p ( Y i | µ, s i ) π s i p ( Y i | µ ∗ , s i ) π ∗ s i dθ (cid:21) ≤ n ε π + n ε µ . Similarly, by applying Fubini’s theorem, we haveVar θ ∗ (cid:20) (cid:90) Θ q ∗ θ ( θ ) n (cid:88) i =1 (cid:88) s i (cid:101) q S i ( s i ) log p ( Y i | µ, s i ) π s i p ( Y i | µ ∗ , s i ) π ∗ s i dθ (cid:21) = n Var θ ∗ (cid:20) (cid:90) Θ q ∗ θ ( θ ) (cid:88) s (cid:101) q S ( s ) log p ( Y | µ, s ) π s p ( Y | µ ∗ , s ) π ∗ s dθ (cid:21) ≤ n E θ ∗ (cid:20) (cid:90) Θ q ∗ θ ( θ ) (cid:88) s (cid:101) q S ( s ) log p ( Y | µ, s ) π s p ( Y | µ ∗ , s ) π ∗ s dθ (cid:21) i ) ≤ n (cid:90) Θ E θ ∗ (cid:20) (cid:88) s (cid:101) q S ( s ) log p ( Y | µ, s ) π s p ( Y | µ ∗ , s ) π ∗ s (cid:21) q ∗ θ ( θ ) dθ ( ii ) ≤ n (cid:90) Θ E θ ∗ (cid:20) (cid:88) s (cid:101) q S ( s ) log p ( Y | µ, s ) π s p ( Y | µ ∗ , s ) π ∗ s (cid:21) q ∗ θ ( θ ) dθ, where steps (i) and (ii) follows by Jensen’s inequality and Fubini’s theorem. By plugging-inthe expression of (cid:101) q S i ( s i ) and applying Fubini’s theorem, we obtain E θ ∗ (cid:20) (cid:88) s (cid:101) q S ( s ) log p ( Y | µ, s ) π s p ( Y | µ ∗ , s ) π ∗ s (cid:21) q ∗ θ ( θ ) dθ ≤ n V ( π ∗ || π ) + 2 n (cid:88) s π ∗ s V (cid:2) p ( · | µ ∗ , s ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | µ, s ) (cid:3) , where recall the shorthand V ( π ∗ || π ) = (cid:80) s π ∗ s log ( π ∗ s /π s ) to denote the V -divergence be-tween categorical distributions with parameters π ∗ and π , and we applied the inequality( x + y ) ≤ x + 2 y . By combining the two preceding displays and invoking the definitionsof B n ( π ∗ , ε π ) and B n ( µ ∗ , ε µ ), we obtainVar θ ∗ (cid:20) (cid:90) Θ q ∗ θ ( θ ) n (cid:88) i =1 (cid:88) s i (cid:101) q S i ( s i ) log p ( Y i | µ, s i ) π s i p ( Y i | µ ∗ , s i ) π ∗ s i dθ (cid:21) ≤ n ε π + 2 n ε µ . P θ ∗ (cid:26) (cid:90) Θ q ∗ θ ( θ ) n (cid:88) i =1 (cid:88) s i (cid:101) q S i ( s i ) log p ( Y i | µ, s i ) π s i p ( Y i | µ ∗ , s i ) π ∗ s i dθ ≤ − D n ( ε π + ε µ ) (cid:27) ( i ) ≤ P θ ∗ (cid:26) (cid:90) Θ q ∗ θ ( θ ) n (cid:88) i =1 (cid:88) s i (cid:101) q S i ( s i ) log p ( Y i | µ, s i ) π s i p ( Y i | µ ∗ , s i ) π ∗ s i dθ − E θ ∗ (cid:104) (cid:90) Θ q ∗ θ ( θ ) n (cid:88) i =1 (cid:88) s i (cid:101) q S i ( s i ) log p ( Y i | µ, s i ) π s i p ( Y i | µ ∗ , s i ) π ∗ s i dθ (cid:105) ≤ − ( D − n ( ε π + ε µ ) (cid:27) ≤ Var θ ∗ (cid:2) (cid:82) Θ q ∗ θ ( θ ) (cid:80) ni =1 (cid:80) s i (cid:101) q S i ( s i ) log p ( Y i | µ,s i ) π si p ( Y i | µ ∗ ,s i ) π ∗ si dθ (cid:3) ( D − n ( ε π + ε µ ) ii ) ≤ D − n ( ε π + ε µ ) , where in steps (i) and (ii), we have respectively used the derived first and second momentbounds.Finally, we have D ( q ∗ θ || p θ ) = − (cid:20) log P π (cid:2) B n ( π ∗ , ε π ) (cid:3) + log P µ (cid:2) B n ( µ ∗ , ε µ ) (cid:3)(cid:21) , since for any probability measure µ , a measurable set A with µ ( A ) >
0, and (cid:101) µ ( · ) = µ ( · ∩ A ) /µ ( A ) the restriction of µ to A , D ( (cid:101) µ || µ ) = − log µ ( A ).The claimed bound in the theorem is now a direct consequence of the preceding twodisplays and Corollary 3.2 with the choice q θ = q ∗ θ . E.3 Proof of Theorem 3.5
Recall that (cid:96) n ( θ ) = log p ( Y n | θ ) is the marginal log-likelihood function (after marginalizingout latent variables), and (cid:96) n ( θ, θ ∗ ) = (cid:96) n ( θ ) − (cid:96) n ( θ ∗ ) the log-likelihood ratio function. Clearly, E θ ∗ exp { (cid:96) n ( θ, θ ∗ ) } = 1. The type II error bound (3.9) in Assumption T implies for fixed ε > ε n , any θ ∈ F n,ε , and any (possibly data dependent)probability measure q S n , E θ ∗ (cid:20) (cid:88) s n q S n ( s n ) exp (cid:110) log p ( Y n | µ, s n ) π s n p ( Y n | θ ∗ ) q s n ( s n ) (cid:111) (1 − φ n,ε ) (cid:21) = E θ ∗ (cid:104) exp (cid:8) (cid:96) n ( θ, θ ∗ ) (cid:9) (1 − φ n,ε ) (cid:105) ≤ exp (cid:8) − c n r ( θ, θ ∗ ) I (cid:2) r ( θ, θ ∗ ) ≥ ε (cid:3)(cid:9) . η ∈ (0 , E θ ∗ (cid:104) (cid:88) s n q S n ( s n ) exp (cid:110) log p ( Y n | µ, s n ) π s n p ( Y n | θ ∗ ) q s n ( s n ) + c n r ( θ, θ ∗ ) I (cid:2) r ( θ, θ ∗ ) ≥ ε (cid:3) − log(1 /η ) (cid:111) (1 − φ n,ε ) (cid:105) ≤ η. Let P θ, F n,ε ( · ) = P θ ( · ∩ F n,ε ) /P θ ( F n,ε ) denote the restriction of the prior P θ on F n,ε . Integrat-ing both side of this inequality with respect to P θ, F n,ε on F n,ε and interchanging the integralsusing Fubini’s theorem, we obtain E θ ∗ (cid:104) (1 − φ n,ε ) (cid:90) F n,ε (cid:88) s n q S n ( s n ) exp (cid:110) log p ( Y n | µ, s n ) π s n p ( Y n | θ ∗ ) q s n ( s n )+ c n r ( θ, θ ∗ ) I (cid:2) r ( θ, θ ∗ ) ≥ ε (cid:3) − log(1 /η ) (cid:111) P θ, F n,ε ( dθ ) (cid:105) ≤ η. Now, Lemma E.1 implies for any ρ (cid:28) P θ, F n,ε , E θ ∗ (cid:104) (1 − φ n,ε ) exp (cid:110) (cid:90) F n,ε (cid:88) s n q S n ( s n ) (cid:16) log p ( Y n | µ, s n ) π s n p ( Y n | θ ∗ ) q s n ( s n )+ c n r ( θ, θ ∗ ) I (cid:2) r ( θ, θ ∗ ) ≥ ε (cid:3) − log(1 /η ) (cid:17) ρ ( dθ ) − D ( ρ || P θ, F n,ε ) (cid:111)(cid:105) ≤ η. Take ρ to be the restriction (cid:98) Q F n,ε of (cid:98) Q over F n,ε , we obtain E θ ∗ (cid:104) (1 − φ n,ε ) exp (cid:110) (cid:98) Q ( F n,ε ) (cid:90) F n,ε (cid:88) s n q S n ( s n ) (cid:16) log p ( Y n | µ, s n ) π s n p ( Y n | θ ∗ ) q s n ( s n )+ c n r ( θ, θ ∗ ) I (cid:2) r ( θ, θ ∗ ) ≥ ε (cid:3) − log(1 /η ) (cid:17) (cid:98) Q ( dθ ) − D ( (cid:98) Q F n,ε || P θ, F n,ε ) (cid:111)(cid:105) ≤ η. By applying Markov’s inequality, we further obtain that with P θ ∗ probability at least (1 −√ η ),(1 − φ n,ε ) exp (cid:110) (cid:98) Q ( F n,ε ) (cid:90) F n,ε (cid:88) s n q S n ( s n ) (cid:16) log p ( Y n | µ, s n ) π s n p ( Y n | θ ∗ ) q s n ( s n )+ c n r ( θ, θ ∗ ) I (cid:2) r ( θ, θ ∗ ) ≥ ε (cid:3) − log(1 /η ) (cid:17) (cid:98) Q ( dθ ) − D ( (cid:98) Q F n,ε || P θ, F n,ε ) (cid:111) ≤ η − / . Denote the big exponential term in the above display by A n . Then the above display isequivalent to (1 − φ n,ε ) A n ≤ η − / . φ n,ε ≤ e − c n ε n / holds with P θ ∗ probability at least (1 − e − c n ε n / ), implying φ n,ε A n ≤ e − c n ε n / A n . Combining the two preceding displays, we obtain that with P θ ∗ probability at least (1 − e − c n ε n / ) (taking η = e − c n ε n ), A n = (1 − φ n,ε ) A n + φ n,ε A n ≤ e c n ε n / + e − c n ε n / A n , leading to the following bound for A n as A n ≤ − e − c n ε n / e c n ε n / ≤ e c n ε n / . Consequently, using the definition of A n , we get1 (cid:98) Q ( F n,ε ) (cid:90) F n,ε (cid:88) s n q S n ( s n ) (cid:16) log p ( Y n | µ, s n ) π s n p ( Y n | θ ∗ ) q s n ( s n ) + c n r ( θ, θ ∗ ) I (cid:2) r ( θ, θ ∗ ) ≥ ε (cid:3)(cid:17) (cid:98) Q ( dθ ) − D ( (cid:98) Q F n,ε || P θ, F n,ε ) ≤ c n ε n / . Rearranging terms, we obtain c n (cid:90) θ ∈F n,ε , r ( θ, θ ∗ ) ≥ ε r ( θ, θ ∗ ) (cid:98) Q ( dθ ) − (cid:98) Q ( F n,ε ) D ( (cid:98) Q F n,ε || P θ, F n,ε ) ≤ (cid:90) F n,ε − (cid:88) s n q S n ( s n ) log p ( Y n | µ, s n ) π s n p ( Y n | θ ∗ ) q S n ( s n ) (cid:98) Q ( dθ ) + (cid:0) c n ε n / (cid:1) (cid:98) Q ( F n,ε ) . (E.1)Similarly, for each θ ∈ F cn,ε , from the identity E θ ∗ (cid:104) exp (cid:8) (cid:96) n ( θ, θ ∗ ) (cid:9)(cid:105) = 1 and Lemma E.1,we can obtain that for any measure ρ (cid:28) P θ, F cn,ε , E θ ∗ (cid:104) exp (cid:110) (cid:90) F cn,ε (cid:88) s n q S n ( s n ) (cid:16) log p ( Y n | µ, s n ) π s n p ( Y n | θ ∗ ) q s n ( s n ) − log(1 /η ) (cid:17) ρ ( dθ ) − D ( ρ || P θ, F cn,ε ) (cid:111)(cid:105) ≤ η. Take ρ to be the restriction (cid:98) Q F cn,ε of (cid:98) Q over F cn,ε and η = e − c n ε n , we can get that with P θ ∗ − e − c n ε n / )1 (cid:98) Q ( F cn,ε ) (cid:110) (cid:90) F cn,ε (cid:88) s n q S n ( s n ) log p ( Y n | µ, s n ) π s n p ( Y n | θ ∗ ) q s n ( s n ) (cid:98) Q ( dθ ) − D ( (cid:98) Q F cn,ε || P θ, F cn,ε ) (cid:111) ≤ c n ε n / , which implies0 ≤ (cid:90) F cn,ε − (cid:88) s n q S n ( s n ) log p ( Y n | µ, s n ) π s n p ( Y n | θ ∗ ) q S n ( s n ) (cid:98) Q ( dθ ) + (cid:98) Q ( F cn,ε ) D ( (cid:98) Q F cn,ε || P θ, F cn,ε ) (E.2)+ (cid:0) c n ε n / (cid:1) (cid:98) Q ( F cn,ε ) . (E.3)Finally, by combining equations (E.1) and (E.2), and using the identity D ( (cid:98) Q || P θ ) = (cid:90) (cid:98) q ( θ ) log (cid:98) q ( θ ) π ( θ ) dθ = (cid:98) Q ( F n,ε ) (cid:90) F n,ε (cid:98) q F n,ε ( θ ) log (cid:98) q F n,ε ( θ ) π F n,ε ( θ ) dθ + (cid:98) Q ( F cn,ε ) (cid:90) F cn,ε (cid:98) q F cn,ε ( θ ) log (cid:98) q F cn,ε ( θ ) π F cn,ε ( θ ) dθ + (cid:98) Q ( F cn,ε ) log (cid:98) Q ( F cn,ε ) P θ ( F cn,ε ) + (1 − (cid:98) Q ( F cn,ε )) log 1 − (cid:98) Q ( F cn,ε )1 − P θ ( F cn,ε ) , we have that with P θ ∗ probability at least (1 − e − c n ε n / ), c n (cid:90) θ ∈F n,ε , r ( θ, θ ∗ ) ≥ ε r ( θ, θ ∗ ) (cid:98) Q ( dθ )+ (cid:98) Q ( F cn,ε ) log (cid:98) Q ( F cn,ε ) P θ ( F cn,ε ) + (1 − (cid:98) Q ( F cn,ε )) log 1 − (cid:98) Q ( F cn,ε )1 − P θ ( F cn,ε ) ≤ (cid:90) − (cid:88) s n q S n ( s n ) log p ( Y n | µ, s n ) π s n p ( Y n | θ ∗ ) q S n ( s n ) (cid:98) Q ( dθ ) + D ( (cid:98) Q || P θ ) + c n ε n / (cid:98) q θ , q S n ) + c n ε n / . (E.4)As a consequence, the first claimed bound follows by taking q S n = (cid:98) q S n and the definition of (cid:98) q θ and (cid:98) q S n that minimizes Ψ( q θ , q S n ) over the variational family.43 .4 Proof of Theorem 3.6 Similar to the proof of Theorem 3.4, under Assumption P, there exists a event A n satisfying P θ ∗ ( A n ) ≥ − D − n ε n and measures ( Q ∗ θ , q ∗ S n ), such that under this event,Ψ( Q ∗ θ , q ∗ S n ) ≤ D n ε n . For any fixed ε ≥ ε n , denote the event under which the result of Theorem 3.5 holds as B ε . Consequently, P θ ∗ ( B ε ) ≥ − e − c n ε n , and under event A n ∩ B ε , we have (cid:26) (cid:98) Q ( F cn,ε ) log (cid:98) Q ( F cn,ε ) P θ ( F cn,ε ) + (1 − (cid:98) Q ( F cn,ε )) log 1 − (cid:98) Q ( F cn,ε )1 − P θ ( F cn,ε ) (cid:27) + c n (cid:90) θ ∈F n,ε , r ( θ, θ ∗ ) ≥ ε r ( θ, θ ∗ ) (cid:98) Q ( dθ ) ≤ C n ε n , where C > n and ε . Since both terms on the l.h.s. of theabove is nonnegative, we obtain that (cid:98) Q ( θ ∈ F n,ε , r ( θ, θ ∗ ) ≥ ε ) ≤ ε − (cid:90) θ ∈F n,ε , r ( θ, θ ∗ ) ≥ ε r ( θ, θ ∗ ) (cid:98) Q ( dθ ) ≤ C (cid:48) ε n ε , and (cid:98) Q ( F cn,ε ) ≤ C (cid:48)(cid:48) ε n ε , for some constants C (cid:48) , C (cid:48)(cid:48) > . Here, the second inequality holds by using P θ ( F cn,ε ) ≤ e − c n ε (Assumption T), and theinequality x log x + (1 − x ) log(1 − x ) ≥ − log 2 ( x ∈ (0 , ε = k ε n with k = 1 , , . . . , e cnε n / , and using a union bound,we obtain that the following holds with probability at least 1 − D − n ε n − e − cnε n / ≥ − D − n ε n , (cid:98) Q ( θ ∈ F n,ε , r ( θ, θ ∗ ) ≥ ε ) ≤ C (cid:48) ε n ε , and (cid:98) Q ( F cn,ε ) ≤ C (cid:48)(cid:48) ε n ε , for all ε = k ε n with k = 1 , , . . . , e cnε n / . Note that the preceding display implies (cid:98) Q ( r ( θ, θ ∗ ) ≥ ε ) ≤ (cid:98) Q ( θ ∈ F n,ε , r ( θ, θ ∗ ) ≥ ε ) + (cid:98) Q ( F cn,ε ) ≤ ( C (cid:48) + C (cid:48)(cid:48) ) ε n ε . For general ε ∈ [ ε n , e cnε n / ε n ), we can always find an integer k ∗ such that k ∗ ε n ≤ ε < k ∗ + 1) ε n . Using the monotonicity of (cid:98) Q ( r ( θ, θ ∗ ) ≥ ε ) in ε , we can obtain (cid:98) Q ( r ( θ, θ ∗ ) ≥ ε ) ≤ (cid:98) Q ( r ( θ, θ ∗ ) ≥ ( k ∗ ε n ) ) ≤ ( C (cid:48) + C (cid:48)(cid:48) ) 1( k ∗ ) ≤ C ε n ε . The second claimed bound follows by (cid:90) θ : r ( θ,θ ∗ ) ≤ R r ( θ, θ ∗ ) (cid:98) Q ( dθ ) = (cid:90) R (cid:98) Q ( r ( θ, θ ∗ ) ≥ t ) dt ≤ ε n + 2 (cid:90) Rε n ε (cid:98) Q ( r ( θ, θ ∗ ) ≥ ε ) dε ≤ ε n (cid:18) C (cid:90) Rε n t dt (cid:19) ≤ C ε n (cid:0) R/ε n ) (cid:1) . E.5 Proof of Theorem 3.7
According to the proof of Theorem 3.1, we have that for any ε ∈ (0 , q θ , Q S n )in the variational family, (cid:90) n D ( n ) α ( θ, θ ∗ ) q θ ( θ ) dθ ≤ αn (1 − α ) Ψ α ( q θ , q S n ) + 1 n (1 − α ) log(1 /ζ ) . Since ¯Ψ α is an upper bound of Ψ α , the above implies (cid:90) n D ( n ) α ( θ, θ ∗ ) q θ ( θ ) dθ ≤ αn (1 − α ) ¯Ψ α ( q θ , q S n ) + 1 n (1 − α ) log(1 /ζ ) . Now choosing ( q θ , q S n ) as (¯ q θ , ¯ q S n ) in the above, and the claimed bound follows since(¯ q θ , ¯ q S n ) = argmin q θ , q Sn ¯Ψ α ( q θ , q S n ) . E.6 Proof of Corollary 4.1
For the linear model, we have B n ( θ ∗ , ε ) ⊃ (cid:8) θ = ( β, σ ) : (2 nσ ) − (cid:107) X ( β − β ∗ ) (cid:107) + (cid:0) ( σ ∗ ) /σ − − log[( σ ∗ ) /σ ] (cid:1) / ≤ ε (cid:9) . Therefore, we may take the neighborhood N n ( θ ∗ , ε ) as theproduct set { β : n − ( σ ∗ ) − (cid:107) X ( β − β ∗ ) (cid:107) ≤ c ε }×{ σ : | σ − σ ∗ | ≤ c ε } , for some sufficientlysmall constants ( c , c ) such that N n ( θ ∗ , ε ) ⊂ B n ( θ ∗ , ε ). In addition, due to the product formof N n ( θ ∗ , ε ), probability density function q ∗ θ defined as q ∗ θ ∝ I N n ( θ ∗ , ε ) belongs to the meanfield approximation family Γ. Consequently, we may apply Theorem 3.3 to obtain (noting45hat the volume of the neighborhood { β : n − ( σ ∗ ) − (cid:107) X ( β − β ∗ ) (cid:107) ≤ c (cid:15) } is O [( d/ε ) − d ]) (cid:90) (cid:110) n D ( n ) θ ∗ ,α ( θ, θ ∗ ) (cid:111) (cid:98) q θ ( θ ) dθ (cid:46) α − α ε + dn (1 − α ) log dε . Setting ε = (cid:112) d/n in the preceding inequality and using the fact that max { , (1 − α ) − α } h ( p || q ) ≤ D α ( p || q ) for any density p and q yields the claimed bound. E.7 Proof of Corollary 4.2
Similar to the proof of Corollary 4.1, we choose N n ( θ ∗ , ε ) as the product set { β : β ( z ∗ ) c =0 , n − ( σ ∗ ) − (cid:107) X ( β − β ∗ ) (cid:107) ≤ c ε } × { σ : | σ − σ ∗ | ≤ c ε } , and define the joint measure q ∗ θ ⊗ q ∗ z ∗ ∝ I N n ( θ ∗ , ε ) ⊗ δ z ∗ , which belongs to the mean field approximation family Γ. Now, by applying Theorem 3.3with parameter θ = ( β, σ, z ), we obtain (by replacing d with s in the proof of Corollary 4.1for the β part) that (cid:90) (cid:110) n D ( n ) θ ∗ ,α ( θ, θ ∗ ) (cid:111) (cid:98) q θ ( θ ) dθ (cid:46) α − α ε + sn (1 − α ) log sε + 1 n (1 − α ) s log d, where the last term is due to − log p z ( z ∗ ) (cid:16) s log d . Setting ε = (cid:112) s/n leads to the claimedbound. E.8 Proof of Corollary 4.3
The first claimed bound is a direct consequence by applying Theorem 3.7 (with no latentvariables) to the new ELBO ¯ L ( q ). The second bound can be obtained by applying the firstclaimed inequality (4.3) (taking w = 1, w = · · · = w J = 0 to reduce the bound to that ofthe single Gaussian variational approximation) and the arguments in Corollary D.1 (for asingle Gaussian variational approximation). E.9 Proof of Corollary 4.4
It is easy to verify that under Assumption R, there exists some constant C dependingonly on δ such that B n ( π ∗ , √ K ε ) ⊃ { π : max k | π k − π ∗ k | ≤ C ε } (by using the inequality D ( p || q ) ≥ h ( p || q )). In addition, for Gaussian mixture model, it is easy to verify that theKL neighborhood B n ( µ ∗ , ε ) defined before Theorem 3.4 contains the set { µ : max k (cid:107) µ k − ∗ k (cid:107) ≤ ε } . As a consequence, a direct application of Theorem 3.4 with ε π = √ K ε and ε µ = ε yields (using the prior thickness assumption and the fact that the volumes of { π :max k | π k − π ∗ k | ≤ C ε } and { µ : max (cid:107) µ k − µ ∗ k (cid:107) ≤ C ε } are at least O ( ε − K ) and O (cid:0) ( √ d/ε ) dK (cid:1) respectively) that with probability tending to one as n → ∞ , (cid:90) (cid:110) D α (cid:2) p ( · | θ ) (cid:12)(cid:12)(cid:12)(cid:12) p ( · | θ ∗ ) (cid:3)(cid:111) (cid:98) q θ ( θ ) dθ (cid:46) α − α K ε + d Kn (1 − α ) log dε . Choosing ε = (cid:112) d/n in the above display yields the claimed bound. E.10 Proof of Corollary 4.5
Under the notation of Theorem 3.1 and Corollary 3.2, for each n = 1 , . . . , N , the latentvariable S n = { z dn : d = 1 , . . . , D } , we will use an extended version of Corollary 3.2from 1 latent variable per observation to D independent latent variable per observation. Infact, similar arguments as the proofs of Theorem 3.1 and Corollary 3.2 yield that for theensemble of KL neighborhoods {B N ( γ ∗ d ; ε γ d ) : d = 1 , . . . , D } of { γ ∗ d : d = 1 , . . . , D } where B N ( γ ∗ d ; ε γ d ) : = (cid:8) D ( γ ∗ d || γ d ) ≤ ε γ d , V ( γ ∗ d || γ d ) ≤ ε γ d (cid:9) , for d = 1 , . . . , D, , it holds withprobability tending to one as N → ∞ that (cid:90) (cid:110) D α (cid:2) p ( · | θ ) , p ( · | θ ∗ ) (cid:3)(cid:111) (cid:98) q θ ( θ ) dθ ≤ D α − α (cid:18) D (cid:88) d =1 ε γ d + ε µ (cid:19) + (cid:110) − N (1 − α ) D (cid:88) d =1 log P γ d (cid:2) B N ( γ ∗ d , ε γ d ) (cid:3)(cid:111) + (cid:110) − N (1 − α ) log P µ (cid:2) B N ( µ ∗ , ε µ ) (cid:3)(cid:111) , where B N ( µ ∗ , ε µ ) = (cid:8) max S n D (cid:2) p ( · | µ ∗ , S n ) || p ( · | µ, S n ) (cid:3) ≤ ε µ , max S n V (cid:2) p ( · | µ ∗ , S n ) || p ( · | µ, S n ) (cid:3) ≤ ε µ (cid:9) . Recall that each observation Y n composed of i.i.d. observations { w dn : d = 1 , . . . , D } ,where the conditional distribution of w dn given latent variable { z dn = k } only depends on β k for d = 1 , . . . , D and k = 1 , . . . , K . Therefore, when applied to LDA, the preceding displaycan be further simplified into (cid:90) (cid:110) D α (cid:2) p ( · | θ ) , p ( · | θ ∗ ) (cid:3)(cid:111) (cid:98) q θ ( θ ) dθ ≤ D α − α (cid:18) D (cid:88) d =1 ε γ d + K (cid:88) k =1 ε β k (cid:19) + (cid:110) − N (1 − α ) D (cid:88) d =1 log P γ d (cid:2) B N ( γ ∗ d , ε γ d ) (cid:3)(cid:111) + (cid:110) − N (1 − α ) K (cid:88) k =1 log P µ (cid:2) B N ( β ∗ k , ε β k ) (cid:3)(cid:111) , (E.5)47here B N ( β ∗ k , ε β k ) = (cid:8) max k D (cid:2) p ( · | β ∗ k , k ) || p ( · | β k , k ) (cid:3) ≤ ε β k , max S n V (cid:2) p ( · | β ∗ k , k ) || p ( · | β k , k ) (cid:3) ≤ ε β k (cid:9) .Return to the proof of the theorem. Let S βk denote the index set corresponding to thenon-zero components of β k for k = 1 , . . . , K , and S γd the index set corresponding to thenon-zero components of γ d for d = 1 , . . . , D . Under Assumption S, it is easy to verifythat for some sufficiently small constants c , c >
0, it holds for all d = 1 , . . . , D that B N ( γ ∗ d , ε γ d ) ⊃ (cid:8) (cid:107) ( γ d ) ( S γd ) c (cid:107) ≤ c ε γ d , (cid:107) ( γ d ) S γd − ( γ ∗ d ) S γd (cid:107) ∞ ≤ c ε γ d (cid:9) , and for all k = 1 , . . . , K that B N ( β ∗ k , ε β k ) ⊃ (cid:8) (cid:107) ( β k ) ( S βk ) c (cid:107) ≤ c ε β k , (cid:107) ( β k ) S βk − ( β ∗ k ) S βk (cid:107) ∞ ≤ c ε β k (cid:9) . Applying Theorem2.1 in [44], we obtain the following prior concentration bounds for high-dimensional Dirichletpriors P γ d (cid:8) (cid:107) ( γ d ) ( S γd ) c (cid:107) ≤ c ε γ d , (cid:107) ( γ d ) S γd − ( γ ∗ d ) S γd (cid:107) ∞ ≤ c ε γ d (cid:9) (cid:38) exp (cid:110) − C e d log Kε γ d (cid:111) , d = 1 , . . . , D ; P β k (cid:8) (cid:107) ( β k ) ( S βk ) c (cid:107) ≤ c ε β k , (cid:107) ( β k ) S βk − ( β ∗ k ) S βk (cid:107) ∞ ≤ c ε β k (cid:9) (cid:38) exp (cid:110) − C d k log Vε β k (cid:111) , k = 1 , . . . , K. Putting pieces together, we obtain (cid:90) (cid:110) D α (cid:2) p ( · | θ ) , p ( · | θ ∗ ) (cid:3)(cid:111) (cid:98) q θ ( θ ) dθ (cid:46) α − α (cid:18) D (cid:88) d =1 ε γ d + K (cid:88) k =1 ε β k (cid:19) + 1 N (1 − α ) D (cid:88) d =1 e d log Kε γ d + 1 N (1 − α ) K (cid:88) k =1 d k log Vε β k , which is the desired result. F Extension of examples to α = 1 In this section, we briefly discuss the verification of Assumption T (choice of loss functionand constructions of test function φ n,ε and sieve F n,ε ) in the examples of the paper, whichimplying the variational risk bound through applying Theorem 3.6. Mean field approximation to low-dimensional Bayesian linear regression:
To sim-plify the presentation, we assume the priors on β and σ satisfy P β ( (cid:107) β (cid:107) ≥ R ) ≤ CR − c d and P σ ( σ ∈ [ a, b ]) = 1, where [ a, b ] contains the truth σ ∗ . In addition, the design matrix X satisfies that the minimal eigenvalue of n − X T X is bounded away from zero. Recall that48n this example, θ = ( β, σ ). Under these two assumptions, it can be proved that Assump-tion T holds with φ n,ε being the likelihood ratio test φ n,ε = I ( (cid:96) ( β, β ∗ ) ≥ C (cid:48) n ε ), sieve F n,ε = {(cid:107) β (cid:107) ≤ exp( C (cid:48)(cid:48) d − n ε ) } × [ a, b ], and loss function r ( β, β ∗ ) = (cid:107) β − β ∗ (cid:107) , for all ε ≥ ε n = d log n/n , and sufficiently large constant C (cid:48) , C (cid:48)(cid:48) > Mean field approximation to high-dimensional Bayesian linear regression withsparse priors:
Similar to the previous example, we make the assumption that given z ,the conditional prior of β satisfies P β | z ( (cid:107) β (cid:107) ≥ R | z ) ≤ CR − c | z | , where | z | is the size ofbinary vector z , and the prior on σ satisfies P σ ( σ ∈ [ a, b ]) = 1. In addition, we make thesparse eigenvalue assumption that there exists some sufficiently large C >
0, such that forany Cs sparse vector u , n − (cid:107) Xu (cid:107) / (cid:107) u (cid:107) ≥ µ >
0. Recall that in this example, θ = ( z, β, σ ).Under these assumptions, it can be verified that Assumption T holds with φ n,ε being thelikelihood ratio test φ n,ε = I ( (cid:96) ( β, β ∗ ) ≥ C (cid:48) n ε ), sieve F n,ε = (cid:83) z : | z |≤ C (cid:48)(cid:48) s (cid:104) { z } × { β z c =0 , (cid:107) β − β ∗ (cid:107) ≤ exp( C (cid:48)(cid:48)(cid:48) s − n ε ) } × [ a, b ] (cid:105) , and loss function r ( β, β ∗ ) = (cid:107) β − β ∗ (cid:107) , for all ε ≥ ε n = s log( nd ) /n , and sufficiently large constant C (cid:48) , C (cid:48)(cid:48) , C (cid:48)(cid:48)(cid:48) > Remaining examples:
In all the remaining examples, the parameter space Θ of θ iscompact. In this case, we can simply take F n,ε = Θ for all ε (so P θ ( F cn,ε ) = 0), and apply ageneral recipe [18] to construct such tests: (i) construct an ε/ N = { θ , . . . , θ N } suchthat for any θ with r ( θ, θ ∗ ) > ε , there exists θ j ∈ N with r ( θ, θ j ) < ε /
2, (ii) construct atest φ n,j for H : θ = θ ∗ versus H : θ = θ j with type-I and II error rates as in Assumption T , and (iii) set φ n = max ≤ j ≤ N φ n,j . The type-II error of φ n retains the same upper bound,while the type-I error can be bounded by N e − nε . Since N can be further bounded by N (Θ , ε / , r ), the covering number of Θ by r -balls of radius ε /
2, it suffices to show that N (Θ , ε / , r ) (cid:46) e nε . When Θ is a compact subset of R d and r ( θ, θ ∗ ) (cid:38) (cid:107) θ − θ ∗ (cid:107) (the squaredEuclidean metric), then N (Θ , ε , r ) (cid:46) ε − d (cid:46) e nε as long as ε (cid:38) (cid:112) log n/n . More generally,if Θ is a space of densities and r the squared Hellinger/ L metric, then construction of thepoint-by-point tests in (i) from the likelihood ratio test statistics follows from the classicalBirg´e-Lecam testing theory [6, 28]; see also [18].To summarize, in these examples with compact parameter space, Assumption T holdswith F n,ε = Θ, r ( θ, θ ∗ ) = h ( p ( · | θ ) , p ( · | θ ∗ )), the squared Hellinger distance between p ( · | θ )and p ( · | θ ∗ ), and φ n,ε the likelihood ratio test function, for all ε (cid:38) (cid:112) log n/n . Moreover, therate ε n is determined by their respective prior concentration Assumption P.49 eferences [1] Amr Ahmed, Mohamed Aly, Joseph Gonzalez, Shravan Narayanamurthy, and AlexanderSmola. Scalable inference in latent variable models. In International conference on Websearch and data mining (WSDM) , volume 51, pages 1257–1264, 2012.[2] Pierre Alquier and James Ridgway. Concentration of tempered posteriors and of theirvariational approximations. arXiv preprint arXiv:1706.09293 , 2017.[3] Hagai Attias. A variational baysian framework for graphical models. In
Advances inneural information processing systems , pages 209–215, 2000.[4] Anirban Bhattacharya, Debdeep Pati, and Yun Yang. Bayesian fractional posteriors. arXiv preprint arXiv:1611.01125 , 2016.[5] Peter J Bickel, Ya’acov Ritov, and Alexandre B Tsybakov. Simultaneous analysis oflasso and dantzig selector.
The Annals of Statistics , pages 1705–1732, 2009.[6] Lucien Birg´e. Approximation dans les espaces m´etriques et th´eorie de l’estimation.
Probability Theory and Related Fields , 65(2):181–237, 1983.[7] Christopher M Bishop.
Pattern recognition and machine learning . springer, 2006.[8] P. G. Bissiri, C. C. Holmes, and S. G. Walker. A general framework for updating beliefdistributions.
Journal of the Royal Statistical Society: Series B (Statistical Methodol-ogy) , 2016.[9] David M Blei. Probabilistic topic models.
Communications of the ACM , 55(4):77–84,2012.[10] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A reviewfor statisticians.
Journal of the American Statistical Association , (just-accepted), 2017.[11] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation.
Journalof machine Learning research , 3(Jan):993–1022, 2003.[12] St´ephane Boucheron, G´abor Lugosi, and Pascal Massart.
Concentration inequalities: Anonasymptotic theory of independence . OUP Oxford, 2013.[13] Peter Carbonetto and Matthew Stephens. Scalable variational inference for bayesianvari- able selection in regression, and its accuracy in genetic association studies.
Bayesian Analysis , 7(1):73–108, 2012. 5014] Adrian Corduneanu and Christopher M Bishop. Variational bayesian model selectionfor mixture distributions. In
Artificial intelligence and Statistics , volume 2001, pages27–34. Waltham, MA: Morgan Kaufmann, 2001.[15] Alan E Gelfand and Adrian FM Smith. Sampling-based approaches to calculatingmarginal densities.
Journal of the American statistical association , 85(410):398–409,1990.[16] Edward I George and Robert E McCulloch. Variable selection via gibbs sampling.
Journal of the American Statistical Association , 88(423):881–889, 1993.[17] S. Ghosal and A. van der Vaart. Convergence rates of posterior distributions for noniidobservations.
Ann. Statist. , 35(1):192–223, 2007.[18] Subhashis Ghosal, Jayanta K. Ghosh, and Aad W. van der Vaart. Convergence ratesof posterior distributions.
Ann. Statist. , 28(2):500–531, 2000.[19] D Harmon. Overview of the first text retrieval conference (trec-1).
NIST Special Pub-lication , pages 500–207.[20] W Keith Hastings. Monte carlo sampling methods using markov chains and their ap-plications.
Biometrika , 57(1):97–109, 1970.[21] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic varia-tional inference.
The Journal of Machine Learning Research , 14(1):1303–1347, 2013.[22] Xichen Huang, Jin Wang, and Feng Liang. A variational algorithm for bayesian variableselection. arXiv preprint arXiv:1602.07640 , 2016.[23] K Humphreys and DM Titterington. Approximate bayesian inference for simple mix-tures. In
Proc. Computational Statistics 2000 , pages 331–336, 2000.[24] W. Jiang and M. A. Tanner. Gibbs posterior for variable selection in high-dimensionalclassification and data mining.
Ann. Statist. , pages 2207–2231, 2008.[25] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. Anintroduction to variational methods for graphical models.
Machine learning , 37(2):183–233, 1999.[26] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXivpreprint arXiv:1412.6980 , 2014. 5127] Harold J Kushner and G George Yin. Stochastic approximation algorithms and appli-cations. 1997.[28] Lucien LeCam. Convergence of estimates under dimensionality restrictions.
The Annalsof Statistics , pages 38–53, 1973.[29] Yingzhen Li and Richard E. Turner. R´enyi divergence variational inference. In
Advancesin neural information processing systems , 2016.[30] David JC MacKay. Ensemble learning for hidden markov models.[31] JOHN T Ormerod, Chong You, and SAMUEL Muller. A variational bayes approach tovariable selection.[32] Debdeep Pati, Anirban Bhattacharya, and Yun Yang. On statistical optimality ofvariational bayes.
AISTATS, To Appear , 2017.[33] Christian P Robert.
Monte carlo methods . Wiley Online Library, 2004.[34] Judith Rousseau. On the frequentist properties of bayesian nonparametric methods.
Annual Review of Statistics and Its Application , 3:211–231, 2016.[35] Naonori Ueda and Zoubin Ghahramani. Bayesian model search for mixture modelsbased on optimizing variational bounds.
Neural Networks , 15(10):1223–1241, 2002.[36] Aad W Van der Vaart.
Asymptotic statistics , volume 3. Cambridge university press,2000.[37] T. Van Erven and P. Harremos. R´enyi divergence and kullback-leibler divergence.
IEEETransactions on Information Theory , 60(7):3797–3820, 2014.[38] Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families,and variational inference.
Foundations and Trends R (cid:13) in Machine Learning , 1(1–2):1–305, 2008.[39] S. Walker and N. L. Hjort. On Bayesian consistency. Journal of the Royal StatisticalSociety: Series B (Statistical Methodology) , 63(4):811–821, 2001.[40] Bo Wang and DM Titterington. Lack of consistency of mean field and variational bayesapproximations for state space models.
Neural Processing Letters , 20(3):151–170, 2004.[41] Bo Wang and DM Titterington. Inadequacy of interval estimates corresponding tovariational bayesian approximations. In
AISTATS , 2005.5242] Bo Wang, DM Titterington, et al. Convergence properties of a general algorithm for cal-culating variational bayesian estimates for a normal mixture model.
Bayesian Analysis ,1(3):625–650, 2006.[43] Ted Westling and Tyler H McCormick. Establishing consistency and improving un-certainty estimates of variational inference through m-estimation. arXiv preprintarXiv:1510.08151 , 2015.[44] Yun Yang and David B Dunson. Minimax optimal bayesian aggregation. arXiv preprintarXiv:1403.1345 , 2014.[45] Chong You, John T Ormerod, and Samuel M¨uller. On variational bayes estimation andvariational information criteria for linear regression models.
Australian & New ZealandJournal of Statistics , 56(1):73–87, 2014.[46] Fengshuo Zhang and Chao Gao. Convergence rates of variational posterior distributions. arXiv preprint arXiv:1712.02519 , 2017.[47] O Zobay et al. Variational bayesian inference with gaussian-mixture approximations.