Evidence bounds in singular models: probabilistic and variational perspectives
EEvidence bounds in singular models: probabilistic and variationalperspectives
Anirban Bhattacharya ∗ , Debdeep Pati † , and Sean Plummer ‡ Department of Statistics, Texas A&M University, College Station, Texas, 77843, USAAugust 12, 2020
Abstract
The marginal likelihood or evidence in Bayesian statistics contains an intrinsic penalty forlarger model sizes and is a fundamental quantity in Bayesian model comparison. Over the pasttwo decades, there has been steadily increasing activity to understand the nature of this penaltyin singular statistical models, building on pioneering work by Sumio Watanabe. Unlike regularmodels where the Bayesian information criterion (BIC) encapsulates a first-order expansion ofthe logarithm of the marginal likelihood, parameter counting gets trickier in singular modelswhere a quantity called the real log canonical threshold (RLCT) summarizes the effective modeldimensionality. In this article, we offer a probabilistic treatment to recover non-asymptotic ver-sions of established evidence bounds as well as prove a new result based on the Gibbs variationalinequality. In particular, we show that mean-field variational inference correctly recovers theRLCT for any singular model in its canonical or normal form. We additionally exhibit sharp-ness of our bound by analyzing the dynamics of a general purpose coordinate ascent algorithm(CAVI) popularly employed in variational inference.
Keywords:
Bayesian; Coordinate ascent; Gibbs variational inequality; Laplace approximation;Mean-field approximation; Real log canonical threshold ∗ [email protected] † [email protected] ‡ [email protected] a r X i v : . [ m a t h . S T ] A ug Introduction
Let X ( n ) = ( X , . . . , X n ) T denote n independent and identically distributed observations from aprobability density function f ( · | θ (cid:63) ). A Bayesian analysis in this setting proceeds by setting up(i) a statistical model consisting of a family of probability distributions { p ( · | ξ ) : ξ ∈ Ω } for theindividual observations, indexed by a parameter ξ taking values in the parameter space Ω ⊆ R d ,and (ii) a prior (probability) distribution ϕ ( · ) on Ω. The posterior distribution is given byΠ( ξ | X ( n ) ) = e (cid:96) n ( ξ ) ϕ ( ξ ) m ( X ( n ) ) , (cid:96) n ( ξ ) : = n (cid:88) i =1 log p ( X i | ξ ) , (1.1)with (cid:96) n ( ξ ) the log-likelihood function. The marginal likelihood or evidence m ( X ( n ) ) = (cid:90) Ω e (cid:96) n ( ξ ) ϕ ( ξ ) dξ (1.2)is a fundamental object in Bayesian model comparison [Robert, 2007], which encapsulates an in-trinsic penalty for model complexity, and can be readily used to compare models with differentparameter dimensions. However, barring conjugate settings this integral is rarely available inclosed-form, necessitating approximate methods.A classical approach is to make analytic approximations, of which the Laplace approximation[Kass et al., 1990, Schwarz, 1978, Tierney and Kadane, 1986] is the most prominent. In regular parametric models, under mild assumptions, the Laplace approximation to the marginal likelihoodtakes the form log m ( X ( n ) ) = (cid:96) n ( (cid:98) ξ n ) − d log n R n , (1.3)where (cid:98) ξ n is the maximum likelihood estimate for ξ based on X ( n ) , d is the parameter dimension,and the remainder term R n is bounded in magnitude by a constant free of n with high probability.The quantity 2(log (cid:96) n ( (cid:98) ξ n ) − d log n/
2) is the celebrated Bayesian information criterion (BIC).The usual notion of a regular statistical model entails ξ (cid:55)→ p ( · | ξ ) is one-one and the Fisherinformation matrix E [( ∂ /∂ξ ) log f ( · | ξ )] is positive definite for all ξ ∈ Ω. In this article, ourfocus will be on singular statistical models, where at least one of the conditions for regularity2re not met. Some common examples of singular models include mixture models, factor models,hidden Markov models, latent class analysis, neural networks etc. to name a few; see Drtonand Plummer [2017] for a more comprehensive list. As a simple concrete illustration, suppose p ( x | ξ ) = α N ( x ; 0 ,
1) + (1 − α ) N ( x ; µ,
1) with ξ = ( α, µ ) ∈ [0 , × R . The map ξ (cid:55)→ p ( · | ξ ) isclearly not one-one as the entire region Ω : = { } × R ∪ [0 , × { } inside the parameter space getmapped to the N (0 ,
1) distribution. The Fisher information matrix is also not positive definite onΩ . The derivation of the Laplace approximation proceeds by localizing the integral (1.3) to a neigh-borhood of the maximum likelihood estimate (or the posterior mode) and subsequently applying asecond-order Taylor series expansion of the log-likelihood around (cid:98) ξ n to reduce the integral Eq. (1.3)to a Gaussian integral. It should perhaps then be intuitive that this approximation will face diffi-culties for singular models where the Hessian matrix can be singular. This is indeed the case andcan be verified via simulation in a straightforward manner; see, e.g., the instructive Example 1of Drton and Plummer [2017]. However, finding the precise asymptotic behavior of the marginallikelihood for general singular models is a highly non-trivial exercise. The foundational groundworkfor a general theory has been laid in a series of seminal contributions by Watanabe [Watanabe,1999, 2001a,b], with much of the subsequent development condensed into book-level treatmentsin Watanabe [2009, 2018]. We also refer the reader to Shaowei Lin’s thesis [Lin, 2011] and thebackground section of Drton and Plummer [2017] for lucid summaries of this beautiful theory.Watanabe shows that in singular settings, a more general version of the Laplace approximationis given by log m ( X ( n ) ) = (cid:96) n ( ξ (cid:63) ) − λ log n + ( m −
1) log(log n ) + R n , (1.4)assuming that the data is generated from p ( · | ξ (cid:63) ). The stochastic error term R n is O P (cid:63) (1) as before.The quantity λ ∈ (0 , d/
2] is called the real log-canonical threshold (RLCT) and the integer m ≥ multiplicity . Only when λ = d/ m = 1, one recovers the usual Laplace approximation as aspecial case of the expansion (1.4). However, in general, the usual Laplace approximation no longerprovides a correct approximation to the log evidence. For more on model selection in singularsettings, we refer the reader to Drton and Plummer [2017], Watanabe [2013]. Over the years, there3as been a growing literature on determining (or bounding) λ for specific singular statistical models;see Aoyagi [2010, 2019], Aoyagi and Watanabe [2005], Drton et al. [2017], Hayashi and Watanabe[2017], Rusakov and Geiger [2005], Yamazaki and Watanabe [2003] for a flavor of this literature.Watanabe’s derivation of Eq. (1.4) has two major ingredients. First, the parameter space ispartitioned and parameter transformations are performed to express the integrand in Eq. (1.2) overeach partition to a more manageable normal crossing (or simply, normal) form. The existenceof such partitions and parameter transformations is guaranteed by a famous result in algebraicgeometry due to Hironaka [Hironaka, 1964] on the resolution of singularities. Watanabe thenanalyzes the asymptotic order of a generic integrand in normal form using complex analytic toolsand Schwartz distribution theory [Friedlander and Joshi, 1998]. The RLCT and its multiplicityhave simple analytically tractable expressions for an integral in normal form; see § §
2. We follow standard practice to first analyze a de-terministic version of the problem, replacing the log-likelihood ratio with its expectation under thedata generating model, and then proceed to handle the stochastic component. Interestingly, theRCLT and its multiplicity appear as the rate and shape parameters of a certain Gamma distributionin our analysis.Variational approaches [Bishop, 2006, MacKay, 2003, Wainwright and Jordan, 2008] have in-creasingly grown in popularity in Bayesian statistics as a different set of probabilistic tools toapproximate the evidence. Variational Bayes (VB) aims to find the best approximation to theposterior (or another target) distribution from a class of tractable probability distributions, withthe approximation error most commonly measured in terms of a Kullback–Leibler divergence. Thisscheme equivalently produces a lower bound to the log-marginal likelihood, commonly known asthe evidence lower bound (ELBO). One of the most popular choices for the approximating class4f distributions is the mean-field family constituting of product distributions, whose origins can betraced back to statistical physics [Parisi, 1988]. The mean-field approximation has seen enormousapplications in Bayesian statistics due to its simplicity as well as availability of general purposecoordinate ascent algorithms (CAVI; Bishop [2006]) to approximate the optimal ELBO.In §
3, we show that mean-field variational inference correctly recovers the RLCT for normalforms, even though the posterior distribution itself has strong dependence and is far from a productstructure (see Figure 1 for an example). To show this result, we first produce a candidate solutionfrom the mean-field class which provides the correct order of the ELBO up to log(log n ) terms. Next,by analyzing the dynamics of the aforesaid coordinate ascent algorithms in the 2d case, we establishthat the order of the ELBO at the candidate solution can not be globally improved, hence showingour bound is sharp. Studying the dynamics of the algorithm was also instrumental in guiding ustowards an analytic form of the candidate solution. While asymptotics of the ELBO for mean-field VB have been studied in specific models such as mixture models [Watanabe and Watanabe,2004, Watanabe and Watanabe, 2005, 2006, 2007a,b], hidden Markov models [Hosino et al., 2005],stochastic context-free grammars [Hosino et al., 2006], and Boltzmann machines [Watanabe et al.,2009], the general result proven here is new to the best of our knowledge. Our analysis adds to theemerging literature on algorithmic behavior of mean field VB [Ghorbani et al., 2018, Mukherjeeet al., 2018, Plummer et al., 2020, Zhang and Zhou, 2020]. Beyond the Bayesian statistics literature,we were also inspired by the recent success of mean-field approximations to estimate the normalizingconstant for probabilistic graphical models [Austin, 2019, Basak and Mukherjee, 2017, Chatterjeeand Dembo, 2016, Yan, 2020]. We begin with introducing some notation. We reserve the notations E (cid:63) and P (cid:63) to respectivelydenote expectation and probability under (the n -fold product of) p ( · | ξ (cid:63) ), where ξ (cid:63) denotes thetrue data generating parameter. Let K n ( ξ ) = n − [ (cid:96) n ( ξ (cid:63) ) − (cid:96) n ( ξ )] be the negative log-likelihoodratio scaled by a factor of n − , so that its E (cid:63) -expectation is the Kullback–Leibler divergence, K ( ξ ) := E (cid:63) (cid:0) K n ( ξ ) (cid:1) = D (cid:0) p ( · | ξ (cid:63) ) (cid:107) p ( · | ξ ) (cid:1) . { ξ : K ( ξ ) = 0 } contains more than one point for singular models.Define Z ( n ) = (cid:90) Ω e − nK n ( ξ ) ϕ ( ξ ) , Z K ( n ) = (cid:90) Ω e − nK ( ξ ) ϕ ( ξ ) . (2.1)It is immediate that Z ( n ) = log m ( X ( n ) ) − (cid:96) n ( ξ (cid:63) ), and it is equivalent to study the asymptoticbehavior of Z ( n ) with n . The deterministic quantity Z K ( n ) is closely related to Z ( n ) as it isobtained by replacing the stochastic quantity K n ( ξ ) with its expectation K ( ξ ) under the truedistribution. Let us denote Z i ( ξ ) = log (cid:26) p ( X i | ξ (cid:63) ) p ( X i | ξ ) (cid:27) − E (cid:63) log (cid:26) p ( X i | ξ (cid:63) ) p ( X i | ξ ) (cid:27) , i = 1 , . . . , n, (2.2)so that n − (cid:80) ni =1 Z i ( ξ ) = [ K n ( ξ ) − K ( ξ )] characterizes the difference between K n and K as anaverage of i.i.d. random variables. Normal-crossing form.
Throughout the paper, we assume K ( ξ ) = ξ : = ξ k . . . ξ k d d is amonomial with k = ( k , . . . , k d ) T ∈ N d a multi-index having at least one positive entry; and thatthe prior density ϕ ( ξ ) = b ( ξ ) ξ h , where h = ( h , . . . , h d ) T ∈ N d is another multi-index and b ( · ) > normal crossing form or simply normal form . While these choices may seem very specific, they in fact completely encapsulate thecomplexity of the general problem. This impactful observation was made by Watanabe based on adeep result in algebraic geometry due to Hironaka on the resolution of singularities, and played amajor role in the development of singular learning theory. A simplified form of Hironaka’s theoremfrom Chapter 6 of Watanabe [2018] is quoted below with minor notational changes. Theorem 2.1 (Hironaka’s theorem) . Assume that K ( ξ ) ≥ is a nonzero analytic function on Ω and that the set { ξ ∈ Ω : K ( ξ ) = 0 } is not empty. Then there exist (cid:15) > , sets { Ξ j ; Ξ j ⊂ Ω } and { U j ; U j ⊂ R d } such that { ξ ∈ Ω : K ( ξ ) < (cid:15) } = (cid:83) j Ξ j , and, for each pair (Ξ j , U j ) , there existanalytic maps g j : U j (cid:55)→ Ξ j satisfying K ( g j ( u )) = u j , | g T j ( u ) | = b j ( u ) | u h j | , here | g T j ( u ) | denotes the absolute value of the determinant of the Jacobian matrix J = ( ∂ξ i /∂u l ) il of the transformation ξ = g j ( u ) . Moreover, b j ( u ) > for all j and k j , h j are multi-indices. For a given K ( · ), the theorem guarantees the existence of the coordinate maps { g j } underwhich K can be locally identified with a monomial on each U j . Hence, the overall integral is firstexpressed as the sum of integrals over each Ξ j , and within each Ξ j , a parameter transformation ismade using the map g j to the reduce the integral to a normal form.The normal form offers a convenient reduction since the real log canonical threshold λ and itsmultiplicity m for normal forms are determined by the multi-indices k and h in a particularly simplefashion: λ is the minimum of the numbers { ( h j + 1) / (2 k j ) } dj =1 and m is the number of indices j which assume the minimum value. For example, in the d = 2 case, the general theory implies (cid:90) [0 , e − nξ ξ dξ (cid:16) C log n √ n , (cid:90) [0 , e − nξ ξ dξ (cid:16) Cn / , since in the first case, k = (1 , T and h = (0 , T , implying λ = min { / , / } = 1 / m = 2; while in the second one, k = (1 , T and h = (0 , T , implying λ = min { / , / } = 1 / m = 1.Figure 1: Contour plot of exp( − nx y ) (left) and exp( − nx y ) (right) on [0 , for n = 100As a concrete statistical example that we shall repeatedly return to in this article, we considerExample 27 in Watanabe [2018] pertaining to a single-layer neural network model, where responseand covariate pair ( y, x ) ∈ R × [0 ,
1] have a joint density modeled as p ( y, x | θ ) = p ( y | x, θ ) p ( x ) = N ( y ; θ tanh( θ x ) , [0 , ( x ), with θ = ( θ , θ ) T ∈ [0 , . Suppose the true parameter is (0 , T ,7nd assume a uniform prior on θ . Watanabe shows that K ( θ , θ ) = θ θ K ( θ ) , K ( t ) = (cid:90) tanh ( tx ) t dx. Moreover, under the transformation, ξ = θ , ξ = θ ( K ( θ ) / / : = g ( θ ) , (2.3)the model reduces to the normal form with ( k , k ) = (1 ,
1) and ( h , h ) = (0 , λ = 1 / , d , K ( ξ ) = ξ , and ϕ ( ξ ) = b ( ξ ) ξ h for the rest of the article. We now proceed to derive non-asymptotic bounds to theintegrals in Eq. (2.1) using probabilistic arguments. We first analyze the non-stochastic quantity Z K ( n ) in § Z ( n ) in § Z K ( n ) In this subsection, we take up the analysis of the non-stochastic quantity Z K ( n ). Watanabe used anumber of powerful complex analytic tools to study the asymptotic behavior of Z K ( n ) as n → ∞ .The asymptotic behavior of Z K ( n ) is dictated by the Laurent expansion of the associated complex-valued zeta function ζ K ( z ) = (cid:90) K ( ξ ) − z ϕ ( ξ ) dξ, z ∈ C . In particular, if ( λ, m ) is the smallest pole and its multiplicity of the meromorphic function ζ K ,then Z K ( n ) ≈ Cn − λ (log n ) m . Chapter 3 of Lin [2011] contains an exposition on the Laurent expansion of ζ K . Alternatively, onemay recognize Z k ( n ) as the Laplace transform of a quantity called the state-density function, whichis a generalized function in the parlance of Schwartz distribution theory. The state-density function8nd the zeta function are inter-related, with the zeta function being the Mellin transform of thestate-density function. See Chapter 4 of Watanabe [2009] and Chapter 5 of Watanabe [2018] for aderivation of the asymptotics of Z K ( n ) based on the state-density function.Our goal here is to provide a non-asymptotic two-sided bound to Z K ( n ) for normal forms basedentirely on basic probabilistic arguments. Interestingly, the quantities λ and m turn out to berelated to the rate and shape parameters of a collection of gamma densities, as we shall see below.In the first result, we assume b ( ξ ) = 1 and treat the general case as a corollary. Theorem 2.2.
Let K ( ξ ) = ξ for ξ ∈ Ω = [0 , d and k = ( k , . . . , k d ) T ∈ N d with at least onepositive entry, and let ϕ ( · ) be a probability density on Ω with ϕ ( ξ ) ∝ ξ h , where h = ( h , . . . , h d ) T ∈ (0 , ∞ ) d . Then, there exists positive constants C and C independent of n such that C (log n ) m − n λ < Z K ( n ) < C (log n ) m − n λ , where λ = min j h j + 12 k j , m = d (cid:88) j =1 (cid:18) h j + 12 k j = λ (cid:19) . Proof.
The main idea behind our proof is to exploit the natural representation of Z K ( n ) as theexpectation of a random variable with respect to the prior measure. Specifically, let T = K ( ξ ),where ξ ∼ ϕ is a random variable distributed according to the prior measure. Then, it immediatelyfollows that the real random variable T takes values in the unit interval [0 ,
1] and Z K ( n ) = Ee − nT .Before proceeding to simplify this expectation, we note some conventions and notation. Let ¯ d = (cid:80) dj =1 ( k j (cid:54) = 0), and without loss of generality assume that k j > j = 1 , . . . , ¯ d and k j = 0 for j > ¯ d . Define λ j : = ( h j + 1) / (2 k j ) for j = 1 , . . . , ¯ d , and without loss of generality, further assumethat these are sorted in non-decreasing order λ ≤ λ . . . ≤ λ ¯ d . By definition, ¯ d ≥ m , and the first m of the λ j s all equal λ . Throughout, we use the convention that an Expo( β ) distribution hasdensity βe − βx (0 , ∞ ) ( x ), that is, β denotes the rate parameter of the distribution.The random variable Z : = − log T can be expressed as Z = (cid:80) ¯ dj =1 Z j with Z j = − log( ξ k j j )for j = 1 , . . . , ¯ d . An application of the change of measure formula yields that Z j ∼ Expo( λ j ) with λ j = ( h j + 1) / (2 k j ) as defined above; interestingly, observe the quantities ( h j + 1) / (2 k j )s appearas the exponential rate parameters. Moreover, since the prior measure ϕ has a product form, the9 j s are independent across j . Letting Φ K ( · ) denote the cumulative distribution function of T , wethen have, for any t ∈ (0 , K ( t ) = P ( T ≤ t ) = P ( − log T ≥ − log t ) = P (cid:18) ¯ d (cid:88) j =1 Z j ≥ log(1 /t ) (cid:19) . (2.4)It follows from the above display that lim t ↓ Φ K ( t ) = 0, lim t ↑ Φ K ( t ) = 1, and Φ K is an absolutelycontinuous cdf that admits a density ϕ K ( · ) with respect to the Lebesgue measure, given by, ϕ K ( t ) = 1 t g Z (cid:0) log(1 /t ) (cid:1) (0 , ( t ) , (2.5)where g Z is the density of Z with respect to the Lebesgue measure. Our object of interest, Z K ( n ) = (cid:90) e − nt ϕ K ( t ) dt = (cid:90) n e − t t g Z (cid:0) log( n/t ) (cid:1) dt. (2.6)Before proceeding to prove the theorem in its entire generality, we consider two special cases whichare instructive in themselves and also help build towards the general proof.First, consider the special case where λ j = λ for all j = 1 , . . . , ¯ d . Then, m = ¯ d and Z ∼ Gamma( m, λ ), where a Gamma( α, β ) distribution has density (cid:0) β α / Γ( α ) (cid:1) e − βx x α − (0 , ∞ ) ( x ). Itfollows that for any t ∈ (0 , n ), g Z (cid:0) log( n/t ) (cid:1) = λ m Γ( m ) n − λ t λ (cid:0) log( n/t ) (cid:1) m − . Substituting in equation (2.6), we obtain that Z K ( n ) = Cn − λ (cid:90) n t λ − e − t (cid:0) log( n/t ) (cid:1) m − dt (cid:124) (cid:123)(cid:122) (cid:125) I m ( n ) (cid:16) n − λ (log n ) m − . (2.7)The proof of the assertion that I m ( n ) (cid:16) (log n ) m − for any m ≥ λ < . . . < λ ¯ d , which implies that λ = λ and m = 1.The distribution of Z isn’t recognizable as a standard density any longer, although an analyticexpression for its density is available in the literature as quoted below. Theorem 2.3 ([Bibinger, 2013, Mathai, 1982]) . Let Z k ind. ∼ Expo ( λ k ) for k = 1 , . . . , K , with λ < . . . < λ K . Then, the density g Z of Z = (cid:80) Kk =1 Z k is g Z ( z ) = K (cid:88) k =1 (cid:18) (cid:89) r (cid:54) = k λ r λ r − λ k (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) b k g k ( z ) , where g k ( z ) = λ k e − λ k z is the density of Z k . The coefficients { b k } can be both positive and negative, and thus the above is not a mixture ofexponential densities. However, the coefficient b corresponding to the smallest rate parameter λ is positive. We have, for any t ∈ (0 , n ), g Z (cid:0) log( n/t ) (cid:1) = ¯ d (cid:88) j =1 b j λ j n − λ j t λ j . Substituting this expression in equation (2.6), we get Z K ( n ) = ¯ d (cid:88) j =1 b j λ j n − λ j (cid:90) n e − t t λ j − dt (cid:16) ¯ d (cid:88) j =1 b j n − λ j (cid:16) n − λ . (2.8)This proves the theorem for this special case. The fact that b > n − λ > n − λ T for λ T > λ > g Z is of secondaryimportance, and the focus should be on extracting the most significant contribution in terms of n .This is our strategy for the most general case.In the general case, assume that there are d ∗ ≤ ¯ d unique λ -values λ ∗ < λ ∗ . . . < λ ∗ d ∗ among { λ j } ¯ dj =1 with corresponding multiplicities m , . . . , m d ∗ . It is then immediate that (cid:80) d ∗ s =1 m s = ¯ d .Also, ( λ ∗ , m ) = ( λ, m ) from the theorem statement. Exploiting the independence of the Z j s, wewrite Z = (cid:80) d ∗ s =1 W s , with W s ind. ∼ Gamma( m s , λ ∗ s ) for s = 1 , . . . , d ∗ . While there exist expressions11or the density of sum of independent Gamma random variables [Mathai, 1982], they are muchmore cumbersome than the simpler case of exponentials in Theorem 2.3. Hence, we do not attemptto work with the density g Z and instead aim to bound Z K ( n ) from both sides. To that end, wecrucially use the idea of stochastic ordering of random variables.Recall that for real random variables X , X , X is said to be stochastically smaller than X iffor every x ∈ R , P ( X > x ) ≥ P ( X > x ). We use the notation X < st X to denote this stochasticordering. We now record a useful result. Lemma 2.1.
Consider the random variable Z = (cid:80) d ∗ s =1 W s , with W s ind. ∼ Gamma( m s , λ ∗ s ) . Assume λ ∗ < . . . < λ ∗ d ∗ and let ¯ d = (cid:80) d ∗ s =1 m s . Define Z (cid:96) = W and Z c = (cid:80) d ∗ s =2 (cid:102) W s , where (cid:102) W s ind. ∼ Gamma( m s , λ ∗ ) are also independent of W . Then, Z (cid:96) ∼ Gamma( m , λ ∗ ) , Z c ∼ Gamma( ¯ d − m , λ ∗ ) , Z (cid:96) and Z c are independent, and with Z u : = Z (cid:96) + Z c , Z (cid:96) < st Z < st Z u . With this result in place, we now aim to bound Z K ( n ) = Ee − nT . Since e − nT is a non-negativerandom variable taking values in (0 , Ee − nT = (cid:90) u =0 P (cid:0) e − nT > u (cid:1) du = (cid:90) u =0 P (cid:0) T < log(1 /u ) /n (cid:1) du = (cid:90) u =0 P (cid:0) Z > log n − log(log 1 /u ) (cid:1) du. (2.9)For any z >
0, we have the following two-sided inequality from Lemma 2.1, P ( Z (cid:96) > z ) < P ( Z > z ) < P ( Z u > z ) < P ( Z (cid:96) > z ) + P ( Z c > z ) . Here, the last inequality follows from an application of the union bound. Substituting this inequalityat the end of equation (2.9) for every u and working backwards, we obtain Ee − nT (cid:96) < Ee − nT < Ee − nT (cid:96) + Ee − nT c , where T (cid:96) = e − Z (cid:96) and T c = e − Z c . Since Z (cid:96) and Z c are both gamma random variables, it follows12rom equation (2.7) that Ee − nT (cid:96) > Cn − λ (log n ) m − and Ee − nT (cid:96) + Ee − nT c < C n − λ (log n ) m − + C n − λ (log n ) m − < C n − λ (log n ) m − . This delivers the desired bound.We now state a corollary to Theorem 2.2 relaxing the assumption on the prior Corollary 2.1.
Assume the setup of Theorem 2.2. Let b : U → R be an analytic function with b ( ) (cid:54) = 0 , where U is any open subset of R d containing Ω . Then, (cid:90) Ω b ( ξ ) e − nK ( ξ ) ϕ ( ξ ) dξ (cid:16) n − λ (log n ) m − . Corollary 2.1 shows that the assumption of a product prior in Theorem 2.2 can be relaxed tomore general priors of the form (cid:101) ϕ ( ξ ) ∝ b ( ξ ) ϕ ( ξ ), with the same asymptotic order of the normalizingconstant as before. Z ( n ) We now extend our probabilistic analysis from the previous subsection to analyze the stochasticquantity Z ( n ). Write Z ( n ) = (cid:90) Ω e − nK ( ξ ) −√ n ξ k W n ( ξ ) ϕ ( ξ ) dξ, where W n ( ξ ) = 1 √ n n (cid:88) i =1 ξ − k Z i ( ξ ) . (2.10)We make some simplifying assumptions to keep the presentation from getting notationally tooheavy. We shall assume ϕ ( ξ ) ∝ ξ h , and also that λ j = λ for all j = 1 , . . . , ¯ d < d . Recall from § m = ¯ d in this case. Denote by I the set { , . . . , m } and J = { m + 1 , . . . , d } . Clearly, under ϕ ( · ), the common distribution of the independent random variables ξ j for j ∈ I is Beta( h j + 1 , ξ k j j is Beta( λ,
1) distributed for j ∈ I .We now state a stochastic approximation result for Z ( n ) in Theorem 2.4 that can be consideredas a non-asymptotic version of Theorem 11 in Watanabe [2018]. The main idea lies in decoupling13he effect of the singular part ξ I controlled by K ( ξ ) from the non-singular ξ J part of ξ . As we shallsee in Theorem 2.4, our proof relies heavily on the properties of the conditional density of ξ I given K ( ξ ).Since the distribution of Z = − log T = − (cid:80) mj =1 k j log ξ j is Gamma( m, λ ), the conditionaldistribution ( − k log ξ , . . . , − k m log ξ m ) := ( Z , . . . , Z m ) | Z is given by Z × Dirichlet( m ) andis expressed as f Z ,...,Z m | Z ( z , . . . , z m ) = Γ( m ) Z m − , ≤ m − (cid:88) i =1 z i ≤ Z, z m = Z − m − (cid:88) i =1 z i . Hence, the conditional density of ξ | Z = ( e − Z / (2 k ) , . . . , e − Z m / (2 k m ) ) | Z is given by ϕ ξ | Z ( ξ I ) = 2 m (cid:81) mj =1 k j (cid:81) mj =1 ξ j Γ( m ) Z m − , e − Z ≤ ( ξ I − m ) k − m ≤ , ξ kI = Z, (2.11) where I − m = I \{ m } , k − m = k \{ k m } . Also, ϕ ξ | Z = z ( ξ I ) is the same as the density ϕ ξ | T = e − z ( ξ I ).Define a sequence of stochastic processes D n ( t, ξ ) with index set R + × [0 , d as D n ( t, ξ ) = t λ − e − t −√ tW n ( ξ ) ϕ ξ | Z = − log( t/n ) ( ξ I ) ϕ ( ξ J ) . Further, define an integrated version of D n ( t, ξ ) as D n ( t ) = (cid:82) Ω D n ( t, ξ ) dξ for t ∈ R + . Theorem 2.4.
The following expression provides a non-asymptotic stochastic expansion of Z ( n ) with n − λ (log n ) m − as the leading term, Z ( n ) d (cid:89) j =1 ( h j + 1) = n − λ (log n ) m − λ m Γ( m ) (cid:90) n D n ( t ) dt + R n where R n = (cid:90) nt =0 r n ( t ) D n ( t ) dt, r n ( t ) = λ m Γ( m ) n − λ m − (cid:88) j =1 (cid:18) m − j (cid:19)(cid:0) log n (cid:1) j ( − log t ) m − − j . Moreover, the remainder term R n is smaller order in comparison with the dominating term. If he sequence of stochastic processes W n satisfied (cid:107) W n (cid:107) ∞ = O p (1) , then | R n | n − λ (log n ) m − → , almost surely.Proof. First part:
By abuse of notation we shall assume that ϕ corresponds to a product Betadensity (cid:81) dj =1 Beta( ξ j | h j + 1 , Z ( n ) by (cid:81) dj =1 ( h j + 1) we have Z ( n ) d (cid:89) j =1 ( h j + 1) = (cid:90) e − nK ( ξ ) −√ nξ k W n ( ξ ) ϕ ( ξ ) dξ = (cid:90) e − nt (cid:104) (cid:90) e −√ ntW n ( ξ ) ϕ ξ | T = t ( ξ I ) ϕ ( ξ J ) dξ (cid:105) ϕ K ( t ) dt where ϕ K ( t ) is the density of T = K ( ξ ) as in § ϕ K ( t ) = (1 /t ) g Z (cid:8) log(1 /t ) (cid:9) , where g Z ( · ) is the pdf of a Gamma( λ, m ) random variable, we have by another change of variable, nt (cid:55)→ t that, Z ( n ) d (cid:89) j =1 ( h j + 1) = (cid:90) n e − t t g Z (cid:16) log nt (cid:17)(cid:104) (cid:90) e −√ tW n ( ξ ) ϕ ξ | T = t/n ( ξ I ) ϕ ( ξ J ) dξ (cid:105) dt. Noting, g Z (cid:0) log( n/t ) (cid:1) = λ m Γ( m ) n − λ t λ (cid:0) log( n/t ) (cid:1) m − = λ m Γ( m ) n − λ t λ (cid:0) log n (cid:1) m − + r n ( t ) t λ it follows Z ( n ) d (cid:89) j =1 ( h j + 1) = λ m Γ( m ) n − λ (cid:0) log n (cid:1) m − (cid:90) n (cid:90) D n ( t, ξ ) dξdt + R n . Second part:
Since the maximum exponent of all logarithmic terms inside the summation is( m −
2) we have | r n ( t ) | n − λ (log n ) m − ≤ C ( m )(log n ) − m − (cid:88) j =1 | log t | m − − j C ( m ) depending on m . Also D n ( t ) ≤ t λ − e − t + √ t (cid:107) W n (cid:107) ∞ and hence | R n | n − λ (log n ) m − ≤ C ( m )log n m − (cid:88) j =1 (cid:90) ∞ t =0 e − t + √ t (cid:107) W n (cid:107) ∞ t λ − | log t | m − − j dt Since the function e − t + √ t (cid:107) W n (cid:107) ∞ t λ − | log t | m − − j is integrable and (cid:107) W n (cid:107) ∞ = O p (1), the result fol-lows immediately. Watanabe [Watanabe, 2018] arrives at an asymptotic expansion of Z ( n ) using a different technique.He refers to the inverse Laplace transform of the integrand exp( − nK ( ξ )) ϕ ( ξ ) as the state densityfunction given by δ ( t − ξ k ) ϕ ( ξ ), where δ ( · ) is the Dirac delta function in the sense of a generalizedfunction in Schwartz distribution theory. Thereafter,exp( − nK n ( ξ )) ϕ ( ξ ) = (cid:90) t =0 e − nt + √ ntW n ( ξ ) δ ( t − ξ k ) ϕ ( ξ ) dt whence the marginal likelihood Z ( n ) becomes Z ( n ) = (cid:90) (cid:90) t =0 e − nt + √ ntW n ( ξ ) δ ( t − ξ k ) dtϕ ( ξ ) dξ := (cid:90) n v ( t ) dt, with v ( t ) = (cid:82) e − t + √ tW n ( ξ ) δ ( t/n − ξ k ) dξ . Watanabe then distills v ( t ) by passing it through aMellin’s transform (cid:82) v ( t ) t z dt , discards the smaller order terms before transforming back to ˜ v ( t ) (cid:16) n − λ t λ − { log( n/t ) } m − e − t + √ tW n ( ξ ) using the inverse Mellin transform. Substituting back into Z ( n ), Z ( n ) (cid:16) (cid:90) n ˜ v ( t ) dt (cid:16) n − λ (log n ) m − (cid:90) n t λ − e − t + √ tW n ( ξ ) dt. Our proof technique avoids the Mellin’s transform and instead relies on simple probabilistictools. The main differences are as follows: • The use of Dirac delta function as a generalized function is avoided by taking the conditionaldistribution of ξ with respect to ξ k and multiplying with the marginal density of ξ k .16 The conditioning neutralizes the effect of the singular part along with the stochastic compo-nent, while the marginal density provides the overall order. • No approximation is made during the process. In contrast, while taking the inverse Mellin’stransform, smaller order terms are dropped. Hence our representation of Z ( n ) in Theorem2.4 is exact as opposed to Theorem 11 in Watanabe [2018].Now, we show that asymptotically as n → ∞ , Theorem 2.4 recovers Theorem 11 in Watanabe[2018]. An important ingredient of making this connection is to show a weak convergence ofthe sequence of stochastic processes W n . The expected value and the covariance of Z i ( ξ ) are E Z i ( ξ ) = 0 , cov[ Z i ( ξ ) , Z i ( ζ )] := c z ( ξ, ζ ), respectively. For W n ( ξ ), the same quantities are given by E W n ( ξ ) = ξ − k √ n n (cid:88) i =1 E Z i ( ξ ) = 0 , cov[ W n ( ξ ) , W n ( ζ )] = c z ( ξ, ζ ) ξ k ζ k := c w ( ξ, ζ ) . Let W ∗ denote a mean zero Gaussian process on Ω = [0 , d with covariance kernel c w . Underappropriate conditions on the stochastic processes { ˜ Z i ( ξ ) = ξ − k Z i ( ξ ) : ξ ∈ [0 , d } , we show inProposition 2.1 that W n weakly converges to W ∗ . The proof requires sub-Gaussianity [Vershynin,2018] of ˜ Z i ( ξ ). Assumption A1:
Suppose that ˜ Z i ( ξ ) are iid sub-Gaussian. Furthermore suppose that there existsa positive function L : R → R + with E e tL ( X ) ≤ e t / (2 c L ) for every t > c L >
0, such that | ˜ Z i ( ξ ) − ˜ Z i ( ξ T ) | ≤ L ( X i ) (cid:107) ξ − ξ T (cid:107) . Proposition 2.1.
Under Assumption A1 , W n w → W ∗ . Sections 10.4 and 10.5 of Watanabe [2018] provide heuristic arguments to study weak conver-gence of W n ; the assumptions require ξ − k Z i ( ξ ) to be at least d/ Assumption A1 requires ˜ Z i ( ξ ) to be Lipschitz and sub-Gaussian. In many ex-amples, such as the aforementioned one-layered neural network in Example 27 of Watanabe [2018], A real valued random variable X is called sub-Gaussian if there exists a constant c X > P ( | X | > t ) ≤ e − t / (2 c X ) , t ≥ Z i ( θ , θ ) = 12 θ θ F ( X i ; θ ) − θ θ Y i F ( X i ; θ ) + K ( θ , θ ) , where F ( x ; θ ) := tanh( θ x ) /θ . Then,˜ Z i ( θ , θ ) = Z i ( θ , θ ) θ θ = 12 θ θ F ( X i ; θ ) − Y i F ( X i ; θ ) + 12 θ θ K ( θ ) . Applying the change of variables ξ = θ and ξ = θ ( K ( θ ) /θ ) / = g ( θ ) and noting that˜ Z i ( ξ ) := ˜ Z i ( ξ , g − ( ξ )), we have˜ Z i ( ξ ) = 12 ξ g − ( ξ ) F { X i ; g − ( ξ ) } − Y i F { X i ; g − ( ξ ) } + 12 ξ g − ( ξ ) K { g − ( ξ ) } . We show ˜ Z i ( ξ ) above satisfies Assumption A1 in Appendix D.The next ingredient in establishing the connection is to establish a weak limit of D n ( t ) = (cid:82) D n ( t, ξ ) dξ . In Proposition 2.2, we show that for each t > D n ( t ) converges weakly to thefollowing fixed stochastic process D ( t ) = (cid:90) t λ − e − t −√ tW ∗ ( ,ξ J ) ϕ ( ξ J ) dξ J . In addition, we also show in Proposition 2.2 that (cid:82) n D n ( t ) dt converges in distribution to (cid:82) ∞ D ( t ) dt . Proposition 2.2. If W n w → W ∗ , (cid:107) W ∗ (cid:107) ∞ = O p (1) and ξ I (cid:55)→ W ∗ ( ξ I , ξ J ) is almost surely continuous,then for each t > , D n ( t ) w → D ( t ) and (cid:82) n D n ( t ) dt w → (cid:82) ∞ D ( t ) dt . Using Theorem 2.4 and Proposition 2.2, Z ( n ) d (cid:89) j =1 ( h j + 1) ∼ n − λ (log n ) m − λ m Γ( m ) (cid:90) ∞ t λ − e − t + √ tW ∗ ( ,ξ J ) ϕ ( ξ J ) ϕ ( ξ J ) dt Z ( n ) ∼ n − λ (log n ) m − m ( m − (cid:81) mj =1 k j (cid:90) ∞ t λ − e − t + √ tW ∗ ( ,ξ J ) ξ h J J dt, (2.12)18here h J = ( h m +1 , . . . , h d ). Using the properties of the Dirac delta function, we can write D ( t ) = (cid:90) t λ − e − t −√ tW ∗ ( ξ ) δ ( ξ I ) ϕ ( ξ J ) dξ. where δ ( ξ I ) is a Dirac delta measure at , which also appears in Theorem 11 in Watanabe [2009]in the expansion of Z ( n ). Observing that that k j = 0 for j = m + 1 , . . . d , (2.12) exactly matcheswith the equation (5.32) of Theorem 10 or the expression under Theorem 11 in Watanabe [2009]. In this section, our goal is to show that mean-field variational approximation always correctlyrecovers the RLCT for singular models in normal form, which therefore constitute an interestingclass of statistical examples where the mean-field approximation is provably better than the Laplaceapproximation. An important ingredient of our analysis is to study the dynamics of the associatedCAVI algorithm, which might be of independent interest. While mean-field inference is known toproduce meaningful parameter estimates in many statistical models [Pati et al., 2018, Yang et al.,2020], the algorithmic landscape contains both positive and negative results [Ghorbani et al., 2018,Mukherjee et al., 2018, Wang et al., 2006, Zhang and Zhou, 2020].To simplify our analysis, we shall work with the deterministic quantity Z K ( n ). Define a prob-ability density γ ( n ) K ( ξ ) = e − nK ( ξ ) ϕ ( ξ ) Z K ( n ) , ξ ∈ Ω , (3.1)with K ( ξ ) = ξ in normal form as in Theorem 2.2 and ϕ ( ξ ) ∝ b ( ξ ) ξ h is a probability densityon Ω with the analytic function b ( · ) as in Corollary 2.1. Clearly, Z K ( n ) is then recognized as thenormalizing constant of γ ( n ) K ( · ), which serves as a deterministic version of the posterior defined inEq. (1.1).For any probability measure ρ on Ω with ρ (cid:28) ϕ , the following well-known identity is easy toestablish, D (cid:0) ρ (cid:107) γ ( n ) K (cid:1) = log Z k ( n ) + (cid:20) (cid:90) Ω nK ( ξ ) ρ ( dξ ) + D ( ρ (cid:107) ϕ ) (cid:21) , (3.2)19here recall D ( µ (cid:107) ν ) = E µ (log dµ/dν ) is the Kullback–Leibler divergence between µ and ν . Animmediate upshot of this is the Gibb’s variational inequality, which states that for any probabilitydensity ρ (cid:28) ϕ on Ω, log Z K ( n ) ≥ Ψ n ( ρ ) := − (cid:20) (cid:90) nK ( ξ ) ρ ( dξ ) + D ( ρ (cid:107) ϕ ) (cid:21) , (3.3)with equality attained if and only if ρ = γ ( n ) K . The Gibb’s variational inequality is central to avariational approximation to the normalizing constant Z K ( n ). The quantity Ψ n ( ρ ) in the righthand side of (3.3) is a lower bound to log Z K ( n ) for any ρ (cid:28) ϕ . A variational lower bound tolog Z K ( n ) is then obtained by optimizing the variational parameter ρ over a family of probabilitydensities F on Ω, log Z K ( n ) ≥ ELBO( F ) : = sup ρ ∈F Ψ n ( ρ ) . (3.4)The notation ELBO here abbreviates evidence lower bound , which is commonly used to designatethe variational lower bound in Bayesian statistics. If the supremum in (3.4) is attained at some ρ (cid:63) ∈ F , the density ρ (cid:63) is called the optimal variational approximation. It follows from equation(3.2) that ρ (cid:63) is a best approximation to γ ( n ) K in terms of KL divergence from the class F , i.e., D (cid:0) ρ (cid:63) (cid:107) γ ( n ) K (cid:1) = inf ρ ∈F D (cid:0) ρ (cid:107) γ ( n ) K (cid:1) . The choice of the family F typically aims to balance computational tractability and expressiveness.A popular example is the mean-field family, F MF : = (cid:8) ρ = ρ ⊗ . . . ⊗ ρ d : ρ (cid:28) ϕ a prob. measure on Ω (cid:9) , (3.5)where ρ is assumed to be a product-measure, with no further restriction on the constituent arms.Mean-field variational approximation has its origins in statistical physics [Parisi, 1988], and hassubsequently found extensive usage in Bayesian statistics for parameter estimation and modelselection [Bishop, 2006].Since γ ( n ) K does not lie in F MF for any n , it follows that the inequality in equation (3.4) is a20trict one if we restrict F to the mean-field family. We, however, show below that the mean-fieldapproximation correctly recovers the leading order term in the asymptotic expansion of log Z K ( n ). Theorem 3.1.
Consider a variational approximation (3.4) to Z K ( n ) in equation (3.1) , where thevariational family F is taken to be the mean-field family F MF defined in equation (3.5) . Then,there exists a constant C independent of n such that ELBO( F MF ) ≥ − λ log n − C , where λ =min j [( h j + 1) / (2 k j )] is the real log canonical threshold. Since log Z K ( n ) (cid:16) − λ log n + ( m −
1) log(log n ), it follows that the mean-field approximationcorrectly recovers the leading order term in the asymptotic expansion of log Z K ( n ). This, inparticular, implies that the relative error R n due to the mean-field approximation R n : = | log Z K ( n ) − ELBO( F MF ) || log Z K ( n ) | → n → ∞ . This is rather interesting, as the density γ ( n ) K clearly does not lie in F MF for any finite n . Proof.
Our strategy is to produce a candidate (cid:101) ρ = ⊗ dj =1 (cid:101) ρ j such that Ψ n ( (cid:101) ρ ) ≥ − λ log n − C . Weintroduce some notation before describing the (cid:101) ρ j s. For k, h, β >
0, define a density f k,h,β on [0 , f k,h,β ( u ) = u h exp( − βu k ) [0 , ( u ) B ( k, h, β ) , (3.6)where B ( k, h, β ) = (cid:82) x h exp( − βx k ) dx . We record two useful facts about f k,h,β in the Lemma be-low. We collect some well-known facts first about the incomplete gamma function; see Abramowitzand Stegun [1964]. Remark 3.1.
For x, a > , denote the lower incomplete gamma function by γ ( a, x ) = 1Γ( a ) (cid:90) x t a − e − t dt. For any fixed a > , γ ( a, · ) takes values in (0 , , with lim x →∞ γ ( a, x ) = 1 . Also, lim x → γ ( a, x ) /x a = 1 / Γ( a + 1) , and γ ( a + 1 , x ) = γ ( a, x ) − x a e − x Γ( a + 1) . emma 3.1. Let the density f k,h,β be as in equation (3.6) . Then,(i) The normalizing constant B ( k, h, β ) is given by B ( k, h, β ) = β − λ Γ( λ ) γ ( λ, β )2 k . (ii) The quantity (cid:82) u k f k,h,β ( du ) depends on k and h only through λ : = ( h + 1) / (2 k ) . Call thisexpectation G ( λ, β ) , and we have G ( λ, β ) : = (cid:90) u k f k,h,β ( u ) du = λβ γ ( λ + 1 , β ) γ ( λ, β ) . (iii) We have, lim β →∞ | log B ( k, h, β ) − ( − λ log β ) | λ log β = 0 , lim β →∞ G ( λ, β ) λ/β = 1 . Thus, for large β , log B ( k, h, β ) (cid:16) − λ log β , and G ( λ, β ) (cid:16) λ/β . We now construct (cid:101) ρ = ⊗ dj =1 (cid:101) ρ j . Let g ∈ { , . . . , d } be such that ( h g + 1) / (2 k g ) = λ ; in case thereare multiple such indices, we arbitrarily break tie. Set (cid:101) ρ j = f k j ,h j ,β j for j = 1 , . . . , d with β g = n and β j = 1 for j (cid:54) = g . With this choice, let us now bound Ψ n ( (cid:101) ρ ). We divide this up into two parts.First, we have (cid:90) Ω nK ( ξ ) (cid:101) ρ ( ξ ) dξ = n d (cid:89) j =1 (cid:90) ξ k j j (cid:101) ρ j ( ξ j ) dξ j = n d (cid:89) j =1 G ( λ j , β j ) = nG ( λ, n ) (cid:89) j (cid:54) = g G ( λ j , , where recall that λ j = ( h j + 1) / (2 k j ). For the first equality, we use the product form of both K ( · ) and (cid:101) ρ . The second inequality uses Lemma 3.1 and that ( λ g , β g ) = ( λ, n ). The quantity (cid:81) j (cid:54) = g G ( λ j ,
1) is clearly a constant free of n , and from part (iii) of Lemma 3.1, nG ( λ, n ) (cid:16) λ . Thisimplies (cid:82) nK ( ξ ) (cid:101) ρ ( ξ ) dξ is overall of a constant order.Next, consider D ( (cid:101) ρ (cid:107) ϕ ). Let (cid:101) ϕ be the probability density on Ω with ¯ ϕ ( ξ ) ∝ ξ h . Write D ( (cid:101) ρ (cid:107) ϕ ) = D ( (cid:101) ρ (cid:107) ¯ ϕ ) + (cid:90) Ω (cid:101) ρ log ¯ ϕϕ . The second term in the above display, up to an additive constant, is − (cid:82) Ω log b ( ξ ) (cid:101) ρ ( dξ ), which22s bounded above by − log b since b ( · ) > b on Ω. Hence, we focus attention on the first term D ( (cid:101) ρ (cid:107) ¯ ϕ ), which is simpler to analyze than D ( (cid:101) ρ (cid:107) ϕ ) since both (cid:101) ρ and ¯ ϕ have a product form as inequation (3.5). In particular, we have D ( (cid:101) ρ (cid:107) ¯ ϕ ) = (cid:80) dj =1 D ( (cid:101) ρ j (cid:107) ¯ ϕ j ), where ¯ ϕ j is the j th marginal of¯ ϕ with density ¯ ϕ j ( u ) ∝ u h j for u ∈ [0 , (cid:101) ρ j = f k j ,h j ,β j , we have, D ( (cid:101) ρ j (cid:107) ¯ ϕ j ) = − β j (cid:90) u k j f k j ,h j ,β j ( u ) du − log B ( k j , h j , β j ) + log(1 + h j )= − β j G ( λ j , β j ) − log B ( k j , h j , β j ) + log(1 + h j ) . If j (cid:54) = g , the above quantity is some constant free of n . For j = g , from part (iii) of Lemma 3.1, weget D ( (cid:101) ρ g (cid:107) ¯ ϕ g ) = C + λ log n .Putting the pieces together, we have proved that Ψ n ( (cid:101) ρ ) ≥ − λ log n − C for some constant C free of n . This completes the proof. In this section, we first provide some insight into our choice of the candidate solution (cid:101) ρ in theproof of Theorem 3.1. The choice of (cid:101) ρ was motivated by empirically analyzing the behavior ofa coordinate ascent (CAVI) algorithm for the optimization problem sup ρ ∈F MF Ψ n ( ρ ) in the d = 2case, which naturally constrains the coordinate updates to lie in the family of densities { f k,h,β } .Secondly, we comment on the “optimality” of the candidate (cid:101) ρ , regarding which Theorem 3.1 isinconclusive. In particular, the theorem only obtains a lower bound to ELBO( F MF ), and it is naturalto question whether there is a scope for improvement – note that there is a log(log n ) gap betweenthe asymptotic order of log Z K ( n ) and the lower bound to the ELBO. Studying the dynamics ofthe CAVI algorithm, we demonstrate a class of examples where ELBO( F MF ) = − λ log n + C forsome constant C , implying no further improvement is possible uniformly.The Coordinate Ascent Variational Inference (CAVI) algorithm is popular in statistics andmachine learning for maximizing an evidence lower bound over a mean-field family; see Chapter10 of Bishop for a book-level treatment. The CAVI can be interpreted as a cyclical coordinateascent algorithm which at any iteration t ≥ n ( ρ ) as a function of ρ j ,keeping { ρ (cid:96) } (cid:96) (cid:54) = j fixed at their current value { ρ ( t ) (cid:96) } (cid:96) (cid:54) = j . For example, in the d = 2 case, the iterates23 ( t ) = ρ ( t )1 ⊗ ρ ( t )2 for t ≥ ρ ( t )1 = arg max ρ Ψ n (cid:0) ρ ⊗ ρ ( t − (cid:1) , ρ ( t )2 = arg max ρ Ψ n (cid:0) ρ ( t )1 ⊗ ρ (cid:1) , with an arbitrary initialization ρ (0) = ρ (0)1 ⊗ ρ (0)2 (cid:28) ϕ , and assuming the first component getsupdated first. The objective function Ψ n ( ρ ⊗ ρ ) is concave in each argument so that the max-imization problems in the update step above have unique solutions. Moreover, these maximizersadmit a convenient integral representation, which facilitates tractability of the updates in condi-tionally conjugate models. It is straightforward to see that the successive CAVI iterates increasethe objective function value, since for any t ≥ n (cid:0) ρ ( t )1 ⊗ ρ ( t )2 (cid:1) ≥ Ψ n (cid:0) ρ ( t )1 ⊗ ρ ( t − (cid:1) ≥ Ψ n (cid:0) ρ ( t − ⊗ ρ ( t − (cid:1) , (3.7)although convergence to a global optimum is not guaranteed in general.Returning to the present case, consider the normal form of a singular model with parameterdimension d = 2, γ ( n ) K ( ξ , ξ ) ∝ ξ h ξ h exp( − nξ k ξ k ) , ( ξ , ξ ) ∈ [0 , , (3.8)resulting from setting b ( ξ ) ≡ λ i = ( h i + 1) / k i for i = 1 , λ ≤ λ , implying the real log canonical threshold for thismodel is λ = λ . The mean field family F MF in this case consists of product distributions ρ ⊗ ρ ,where ρ and ρ are absolutely continuous densities on [0 , t th iteration of the CAVI algorithm (3.7) in this case is ρ ( t ) ( ξ ) = ρ ( t )1 ( ξ ) · ρ ( t )2 ( ξ ), with ρ ( t )1 ( ξ ) = f k ,h ,nµ ( t )1 ( ξ ) , ρ ( t )2 ( ξ ) = f k ,h ,nµ ( t )2 ( ξ ) , (3.9)where recall the density f k,h,β is defined in equation (3.6) and for t ≥ µ ( t )1 = G (cid:0) λ , nµ ( t − (cid:1) , µ ( t ) = G (cid:0) λ , nµ ( t )1 (cid:1) . (3.10) although, it is rarely jointly concave
24e also record the value of the ELBO at iteration t ,Ψ n ( ρ ( t ) ) = − n G (cid:0) λ , nµ ( t )1 (cid:1) G (cid:0) λ , nµ ( t )2 (cid:1) + n µ ( t )1 G (cid:0) λ , nµ ( t )1 (cid:1) + n µ ( t )2 G (cid:0) λ , nµ ( t )2 (cid:1) + log B (cid:0) h , k , nµ ( t )1 (cid:1) + log B (cid:0) h , k , nµ ( t )2 (cid:1) . (3.11)Inspecting equation (3.9), it becomes apparent that the changes in ρ ( t ) i across t entirely takes placethrough the univariate parameters µ ( t ) i , for i = 1 ,
2, with the joint evolution of ( µ ( t )1 , µ ( t )2 ) describedthrough the 2d dynamical system in equation (3.10). Instead of analyzing the dynamics of thetwo-dimensional sequential system in (3.10), we achieve a further simplification by decoupling itinto the following one-dimensional systems µ ( t )1 = G (cid:16) λ , n G (cid:0) λ , nµ ( t − (cid:1)(cid:17) , t ≥ ,µ ( t )2 = G (cid:16) λ , n G (cid:0) λ , nµ ( t − (cid:1)(cid:17) , t ≥ . (3.12)These equations, when initialized using µ (1)1 = G (cid:0) λ , nµ (0)2 (cid:1) , will produce the same behavior as theoriginal system (3.10) and are far easier to analyze than the coupled two-dimensional system.We now show that the system (3.12) has a unique globally attracting fixed point ( µ (cid:63) , µ (cid:63) ), i.e.,the updates (cid:0) µ ( t )1 , µ ( t )2 (cid:1) will always converge to ( µ (cid:63) , µ (cid:63) ) as t tends to infinity, irrespective of theinitialization – there is no possibility of cycles or divergence. The asymmetric ( λ < λ ) andsymmetric ( λ = λ ) cases require separate attention, and are treated in two parts. We commentthat the existence of unique fixed points is generally not guaranteed; see, for example, Plummeret al. [2020]. Lemma 3.2.
Let < λ < λ < ∞ . Fix n ∈ N .1. The function x (cid:55)→ G (cid:0) λ , nG ( λ , nx ) (cid:1) on [0 , ∞ ) has a unique fixed point which lies in theinterval Λ : = [0 , λ / ( λ + 1)] .2. The function x (cid:55)→ G (cid:0) λ , nG ( λ , nx ) (cid:1) on [0 , ∞ ) has a unique fixed point which lies in theinterval Λ : = [0 , λ / ( λ + 1)] .Each of these fixed points are globally attracting. et λ = λ = λ with < λ < ∞ . The function x (cid:55)→ G (cid:0) λ, nG ( λ, nx ) (cid:1) on [0 , ∞ ) has a uniqueglobally attracting fixed point which lies in the interval [0 , (cid:112) λ/n ] . Defining ρ (cid:63) = ρ (cid:63) ⊗ ρ (cid:63) , ρ (cid:63)i = f k i ,h i ,nµ (cid:63)i , i = 1 , , (3.13)it follows from the above result in conjunction with Scheffe’s theorem that the CAVI updates inthe density space (3.9) converge to ρ (cid:63) in the total variation norm irrespective of the initialization,i.e., lim t →∞ (cid:107) ρ ( t ) − ρ (cid:63) (cid:107) TV = 0. Importantly, we can now argue that ρ (cid:63) is a global maximizer ofΨ n , and hence ELBO( F MF ) = Ψ n ( ρ (cid:63) ). To see why this is true, assume in the contrary that thereexists ¯ ρ = ¯ ρ ⊗ ¯ ρ with Ψ n ( ¯ ρ ) > Ψ n ( ρ (cid:63) ). Initialize the CAVI updates (3.9) with ¯ ρ and ¯ ρ . Sincethe updates converge to ρ (cid:63) , invoke equation (3.7) to conclude that Ψ n ( ρ (cid:63) ) ≥ Ψ n ( ¯ ρ ), arriving at acontradiction.It is evident that ( µ (cid:63) , µ (cid:63) ) (and hence ρ (cid:63) ) depend on n , although we have not attempted tocharacterize their dependence on n as yet. Since we now know the identity of a global maximizer ofΨ n for any n ∈ N , we proceed to characterize the order of the fixed points µ (cid:63) and µ (cid:63) as a functionof n in Lemma 3.3 below. Lemma 3.3.
Let < λ ≤ λ < ∞ . The fixed points of the system (3.9) satisfy µ (cid:63) µ (cid:63) (cid:16) /n .Moreover,1. if λ < λ , then µ (cid:63) = O (1) and µ (cid:63) (cid:16) /n .2. if λ = λ , then µ (cid:63) = µ (cid:63) and both fixed points are of order / √ n . Lemma 3.3 shows that the product of µ (cid:63) and µ (cid:63) always decays in the order of 1 /n . In thesymmetric case when λ = λ , this is achieved by equally splitting the order between the two,while in the asymmetric case, the other extreme is observed. In the asymmetric case, nµ (cid:63) (cid:16) n and nµ (cid:63) (cid:16)
1, and empirically observing this for various choices of λ and λ originally motivated ourchoice of (cid:101) ρ in the proof of Theorem 3.1. Moreover, as we shall see below, even though the choiceof (cid:101) ρ was not optimal in the symmetric case, it did provide the best possible order of the ELBO.With all ingredients in place, we are now in a position to establish sharpness of our lower boundin Theorem 3.1. 26igure 2: Optimized value of the ELBO for the example in Eq. (3.8) as a function of log n fordifferent combinations of ( λ , λ ). Theorem 3.2.
Consider the normal form of a two parameter singular model with h = ( h , h ) and k = ( k , k ) . The mean field approximation to the evidence lower bound of normal form is ELBO( F MF ) = − λ log n + C for some constant C , where λ is the RLCT.Proof. Recall that ELBO( F MF ) = Ψ n ( ρ (cid:63) ) = lim t →∞ Ψ n ( ρ t ) with the expression for Ψ n ( · ) providedin Eq. (3.11). From Lemma 3.3, we know that ρ (cid:63) corresponds to µ (cid:63) = O (1) & µ (cid:63) (cid:16) /n in theasymmetric case, and µ (cid:63) = µ (cid:63) (cid:16) / √ n in the symmetric case. Substituting these values separatelyin the expression for Ψ n ( · ) and using the bounds from Lemma 3.1 delivers the desired result.Figure 2 empirically confirms the conclusion of Theorem 3.2 by plotting Ψ n ( ρ (cid:63) ) as a functionof log n for different combinations of ( λ , λ ). Specifically, for each of the four pairs of ( λ , λ ),we vary n over a fixed grid of values in the logarithmic scale. For each value of n , we run thedynamical system in Eq. (3.12) to convergence, and plot the corresponding converged value of theELBO from Eq. (3.11) against log n . In each case, the points almost exactly line up with a slope of − min { λ , λ } .In Appendix C, we revisit the one-layered neural network example and obtain the coordinateascent updates both in the original ( θ , θ ) coordinates and in the transformed ( ξ , ξ ) coordinatesthat render the likelihood into a normal form. Refer to Eq. (C.1) and Eq. (C.2) and their derivations27n Appendix C. The plot of the evidence lower bound in the transformed coordinates (Eq. C.3) asa function of sample size n is shown in Figure 3. Recall that λ = λ = 1 / − /
2, thus delivering the correct order of the evidence lower bound andonce again confirming the conclusion of Theorem 3.2. We note here that the CAVI updates in theoriginal untransformed parameterization did not produce the correct order of the ELBO in thisexample.In this example, and in general, the transformation to normal form requires explicit knowledgeof the RLCT, which is only known in a handful of settings as indicated earlier. As many singularmodels have unknown RLCT, it would be worthwhile to develop a general class of transformation-based variational families where we can learn the necessary transformation of the parameters tothe normal form before assuming a mean-field structure in the transformed parameterization.Figure 3: The ELBO as a function of log n in the one-layered neural network example. A Remaining proofs from Section 2
A.1 Proof of Lemma 2.1
For G ∼ Gamma( α, λ ) and any t > P ( G ≤ t ) = λ α Γ( α ) (cid:90) t e − λx x α − dx = 1Γ( α ) (cid:90) λt e − x x α − dx. is an increasing function of λ (for fixed α and t ). Thus, if Z i ∼ Gamma( α, λ i ) for i = 1 , λ > λ , then Z < st Z .The proof then follows from the fact that if Z , Z , Z are non-negative random variables with28 < st Z , then Z + Z < st Z + Z . A.2 Proof of Corollary 2.1
Since b ( · ) is analytic on U containing Ω, we have, for any ξ ∈ Ω that b ( ξ ) = b ( ) + (cid:88) | α |≥ ∂ α b ( ) α ! ξ α , where α = ( α , . . . , α d ) is a multi-index with | α | = (cid:80) dj − α j , ∂ α b = ∂ α . . . ∂ α d b , and α ! = α ! . . . α d !.Now use the dominated convergence theorem to interchange the integral and sum, and observe thatthe constant term provides the dominating order. A.3 Proof of Proposition 2.1
We show that W n ( ξ ) converges weakly to the Gaussian process W ∗ . By Theorem 1.5.7 in van derVaart and Wellner [1996] it suffices to show the marginal weak convergence and asymptotic tightnessof W n ( ξ ). We begin with the convergence of the marginals. For ξ , . . . , ξ L ∈ [0 , d with integer L >
0. Applying the multivariate central limit theorem, as n → ∞ ,( W n ( ξ ) , . . . , W n ( ξ L )) → N (0 , C )where C = ( c w ( ξ i , ξ j )) ≤ i,j ≤ L . Next we show the asymptotic tightness of W n ( ξ ) by proving thefollowing three sufficient conditions. First [0 , d is totally bounded. The second condition is thetightness of W n ( ξ ) for a fixed ξ . Fix ξ ∈ [0 , d , for (cid:15) >
0. We need to show that there exists acompact set K , such that P { W n ( ξ ) ∈ K } > − (cid:15) . We construct K = {| W n ( ξ ) | ≤ t } with t chosenas follows. By Assumption A1 , ˜ Z i ( ξ ) are independent centered sub-Gaussian and by Theorem2.6.2 of Vershynin [2018] P ( | W n ( ξ ) | ≥ t ) ≤ P (cid:32)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 ˜ Z i ( ξ ) (cid:12)(cid:12)(cid:12) ≥ √ nt (cid:33) ≤ − ct )for some constant c >
0. Choosing t = (cid:112) /(cid:15) ) completes the proof of tightness.The third condition is that W n ( ξ ) is asymptotically uniformly d -equicontinuous, where d ( ξ, ζ ) =29 ξ − ζ (cid:107) is the metric generated by the norm in Assumption A1 . W n ( ξ ) is said to be asymptoticallyuniformly d -equicontinuous if for any η, (cid:15) >
0, there exists a δ > P (cid:40) sup d ( ξ,ζ ) <δ | W n ( ξ ) − W n ( ζ ) | > (cid:15) (cid:41) < η. To that end, sup d ( ξ,ζ ) <δ | W n ( ξ ) − W n ( ζ ) | ≤ √ n n (cid:88) i =1 | ˜ Z i ( ξ ) − ˜ Z i ( ζ ) | ≤ δ √ n n (cid:88) i =1 L ( X i )and P (cid:110) sup d ( ξ,ζ ) <δ | W n ( ξ ) − W n ( ζ ) | > (cid:15) (cid:111) ≤ P (cid:110) n (cid:88) i =1 L ( X i ) > √ n(cid:15)δ (cid:111) = P (cid:104) exp (cid:110) t n (cid:88) i =1 L ( X i ) (cid:111) > exp (cid:16) t √ n(cid:15)δ (cid:17)(cid:105) ≤ exp (cid:8) − t √ n(cid:15)/δ + ntc L / (cid:9) for any t >
0, where the final inequality follows from Markov’s and
Assumption A1 . Setting t = (cid:15)/ ( δ √ nc L ), we obtain P (cid:110) sup d ( ξ,ζ ) <δ | W n ( ξ ) − W n ( ζ ) | > (cid:15) (cid:111) ≤ e − (cid:15) / (2 c L δ ) . Choosing δ = (cid:15)/ ( c L (cid:112) /η )) completes the proof of asymptotically uniformly d -equicontinuous.Therefore the conditions of Theorem 1.5.7 in van der Vaart and Wellner [1996] are met and W n ( ξ )converges weakly to a Gaussian process. A.4 Proof of Proposition 2.2
We shall prove only the second part; the proof of the first part is very similar and is omitted. Weuse the notation G ( t, W ) := t λ − e − t −√ tW . (cid:82) n | D n ( t ) − D ( t ) | dt is bounded above by (cid:90) n (cid:90) | G ( t, W ∗ ( , ξ J )) − G ( t, W ∗ ( ξ I , ξ J )) | ϕ ξ | Z = − log( t/n ) ( ξ I ) ϕ ( ξ J ) dξdt + (cid:90) n (cid:90) | G ( t, W n ( ξ I , ξ J )) − G ( t, W ∗ ( ξ I , ξ J )) | ϕ ξ | Z = − log( t/n ) ( ξ I ) ϕ ( ξ J ) dξdt (A.1)To control the first term, for given any (cid:15) >
0, there exists δ >
0, such that sup {(cid:107) ξ I (cid:107) <δ } | e −√ tW ∗ ( ,ξ J ) − e −√ tW ∗ ( ξ I ,ξ J ) | < (cid:15) . Then the first term is less than (cid:15) + 2 (cid:90) n t λ − e − t + √ t (cid:107) W ∗ (cid:107) ∞ (cid:104) (cid:90) {(cid:107) ξ I (cid:107)≥ δ } ϕ ξ | Z = − log( t/n ) ( ξ I ) dξ I (cid:105) dt (A.2)Observe that the second term in the r.h.s of (A.2) is bounded above by( m − P { ξ > δ/ √ m − | Z = − log( t/n ) } . The one dimensional marginal ξ | Z of the conditional density (2.11) is given by ϕ ξ | Z ( ξ ) = 2 k ξ Z , e − Z ≤ ξ k ≤ . Note that the sequence of random variables f n ( ξ ) = ( ξ > δ/ √ m − ϕ ξ | Z = − log( t/n ) ( ξ ) con-verges to zero and is bounded above by the integrable function ϕ ξ | Z = − log( t/n ) ( ξ ), hence an appli-cation of the dominated convergence theorem shows (cid:82) f n ( ξ ) dξ converges to 0. Another applicationof DCT shows that the second term in (A.2) converges to 0.The second term of (A.2) can be bounded above by (cid:90) ∞ t λ − e − t (cid:90) | e √ tW n ( ξ I ,ξ J ) − e √ tW ∗ ( ξ I ,ξ J ) | ϕ ξ | Z = − log( t/n ) ( ξ I ) ϕ ( ξ J ) dξ Since W n w → W ∗ , by continuous mapping, the above converges weakly to 0. Finally, (cid:90) ∞ n D ( t ) dt = (cid:90) ∞ t = n t λ − e − t −√ tW ∗ ( ,ξ J ) ϕ ξ | Z = − log( t/n ) ( ξ I ) ϕ ( ξ J ) dξ I dt ≤ (cid:90) ∞ t = n t λ − e − t + √ t (cid:107) W ∗ (cid:107) ∞ dt B Remaining proofs from Section 3
B.1 Proof of Lemma 3.1
Part (i) follows from a change of variable v = βu k . For part (ii), we have, using the definition of B ( k, h, β ), G ( λ, β ) = B ( k, k + h, β ) B ( k, h, β ) = (2 k ) − β − ( λ +1) Γ( λ + 1) γ ( λ + 1 , β )(2 k ) − β − ( λ ) Γ( λ ) γ ( λ, β ) = λβ γ ( λ + 1 , β ) γ ( λ, β ) . For the first part of part (iii), we have log B ( k, h, β ) = − λ log β + log γ ( λ, β ) + terms free of β . Theconclusion follows since lim β →∞ γ ( λ, β ) = 1.For the second part of part (iii), use Remark 3.1 to write G ( λ, β ) = λβ (cid:18) − β λ e − β Γ( λ + 1) γ ( λ, β ) (cid:19) . From the above, the conclusion follows since lim β →∞ β λ e − β = 0 and lim β →∞ γ ( λ, β ) = 1. B.2 Proof of Lemma 3.2
Asymmetric case ( λ < λ ). We prove the first statement, the proof of the second is similar.Consider the function F ( x ) = G ( λ , nG ( λ , nx )) − x . We have F (0) = G ( λ , nG ( λ , > F ( λ / ( λ + 1)) < G ( λ , nG ( λ , nx )) < λ / ( λ + 1). Applying the intermediate valuetheorem to the function F on the interval Λ gives the result. Notice for x > λ / ( λ + 1) thefunction G ( λ , nG ( λ , nx )) − x < , ∞ ) \ Λ . Symmetric case ( λ = λ ). We first build state some preparatory results.
Lemma B.1.
Suppose f : (0 , ∞ ) → (0 , ∞ ) is strictly monotone on (0 , ∞ ) , then f has no periodicpoints of period p > . emma B.2. The discrete dynamical system defined by µ ( k ) = f k − ( µ ) µ ( k ) = f k ( µ ) (B.1) can only converge to a fixed point of x of f or a two cycle C = { x , x } of f . Here f k ( x ) denotesthe k -fold composition of f evaluated at x . Lemma B.3.
For λ > , function G ( λ, nx ) has a single fixed point x ∗ in the interval [0 , (cid:112) λ/n ] and no other fixed points in [0 , ∞ ) .Proof. Suppose λ >
0. The G ( λ, nx ) is monotone decreasing on [0 , ∞ ). Its maximum is λ/ ( λ + 1) = G ( λ, >
0. There must be a root of G ( λ, nx ) − x on the interval [0 , λ/ ( λ + 1)] byintermediate value theorem. Similarly G ( λ, nx ) − x < λ/ ( λ + 1) , ∞ ) so it has noroots on this interval.We now provide a more detailed bound on the fixed point. It follows from Lemma 3.1 that G ( λ, nx ) − x < λ/ ( nx ) − x . Letting x ∗ denote the fixed point of G ( λ, nx ) in [0 , λ/ ( λ + 1)] we have0 = G ( λ, nx ∗ ) − x ∗ < λ/ ( nx ∗ ) − x ∗ (B.2)from which it follows that ( x ∗ ) < λ/n so x ∗ < | (cid:112) λ/n | . Therefore 0 < x ∗ < (cid:112) λ/n .We now complete the proof of the symmetric case. To simplify notation let λ = λ = λ and F ( x ) = G ( λ, nx ). Let f k ( x ) denote the k -fold composition of f . The system in the symmetric caseis µ (0) = µ µ ( k ) = F k − ( µ ) µ ( k ) = F k ( µ ) (B.3)The structure of the symmetric system heavily limits the possible convergence behavior of thesystem. Lemma B.2 shows that this type of system can only converge to a fixed point of F or to33-cycle of F . Furthermore, lemma B.3 shows that F ( x ) = G ( λ, nx ) only has a single fixed point x ∗ which asymptotically approaches (cid:112) λ/n and no periodic points of order p >
1. This completes theproof.
B.3 Proof of Lemma 3.3
The proof for the symmetric case has already been established in the previous Lemma. We thereforefocus on the asymmetric case. We begin with recording a useful result. We will need to know ifthere exist solutions to certain equations which arise when studying the order of the fixed pointsfor the system (3.12). More specifically these equations are zG ( α, z ) = β and 1 = xG ( α, βx ). Theyare equivalent under the change of variables βx = z , The existence of solutions to these equationsis as follows. Lemma B.4.
Suppose α, β ∈ (0 , ∞ ) . Recall G defined as in Lemma 3.1.1. For α > β , zG ( α, z ) = β has a solution in (0 , ∞ ) .2. For α ≤ β , zG ( α, z ) = β has no solution in (0 , ∞ ) .Proof.
1. If α > β , apply intermediate value theorem to zG ( α, z ) − β which converges to α − β > z → ∞ and starts at − β for z = 0. So the equation has a root.2. If α ≤ β , apply intermediate value theorem to zG ( α, z ) − β which converges to α − β ≤ z → ∞ and starts at − β for z = 0. So the equation has no root in (0 , ∞ ).Back to the main proof, write µ (cid:63) = c /f ( n ) and µ (cid:63) = c /f ( n ), where c , c are constants thatdo not depend on n and f : N → (0 , ∞ ). We will analyze the behavior of the order of the fixedpoint µ (cid:63) . The analysis of µ (cid:63) is similar. The fixed point equation becomes µ (cid:63) = G ( λ , nG ( λ , nµ (cid:63) )) c f ( n ) = G (cid:18) λ , nG (cid:18) λ , c nf ( n ) (cid:19)(cid:19) (B.4) Case 1
Assume that lim n →∞ f ( n ) /n = ∞ . Then lim n →∞ n/f ( n ) = 0 and for sufficiently large n
34e have c f ( n ) = G (cid:18) λ , nG (cid:18) λ , c nf ( n ) (cid:19)(cid:19) ≈ G (cid:18) λ , n λ λ + 1 (cid:19) ≈ λ ( λ + 1) λ n (B.5)This implies f ( n ) = O ( n ), a contradiction of our assumptions. Case 2
Without loss of generality we can assume that lim n →∞ f ( n ) /n = 1. Then lim n →∞ n/f ( n ) = 1 and for sufficiently large n we have c f ( n ) = G (cid:18) λ , nG (cid:18) λ , c nf ( n ) (cid:19)(cid:19) ≈ G ( λ , nG ( λ , c )) ≈ λ nG ( λ , c ) (B.6)The constant c is a solution to the additional constraining equation zG ( λ , z ) = λ (B.7)Applying an analogous argument to µ (cid:63) , we get that c must solve the constraining equation zG ( λ , z ) = λ (B.8)By Lemma B.4 there is a solution to the equation (B.8) for c , but still no solution to the equation(B.7) for c . Case 3
Assume that lim n →∞ f ( n ) /n = 0 and lim n →∞ f ( n ) = ∞ . Then lim n →∞ n/f ( n ) = ∞ and for sufficiently large n we have c f ( n ) = G (cid:18) λ , nG (cid:18) λ , c nf ( n ) (cid:19)(cid:19) ≈ G ( λ , λ f ( n )) ≈ λ λ c f ( n ) (B.9)Under the assumption λ < λ we have reached a contradiction. Case 4
Assume that lim n →∞ f ( n ) /n = 0 and f ( n ) is bounded. This implies that µ (cid:63) ≈ /c ,35here c is a constant independent of n ,1 c = G (cid:18) λ , nG (cid:18) λ , nc (cid:19)(cid:19) ≈ G ( λ , λ c ) (B.10)By Lemma B.4, the equation for 1 = zG ( λ , λ z ) which determines c has a solution but theequation 1 = zG ( λ , λ z ) which determines c has no solution.These four cases exhaust all the possible behaviors of a generic f : N → (0 , ∞ ). Therefore wehave exhausted all possible solutions bounded fixed points of the functions G ( λ , nG ( λ , nx )) and G ( λ , nG ( λ , nx )). For λ < λ the only possible solutions to the additional constraining equationsyeild µ (cid:63) = c where c is the solution to 1 = zG ( λ , λ z ) and µ (cid:63) = c /n where c is the solution to λ = zG ( λ , z ). This completes the proof of order in the asymmetric case. C Derivation of CAVI and ELBO for Example 27 in Watanabe’sbook
The CAVI updates in the original ( θ , θ ) coordinate is given bylog q ( θ ) = − nK (cid:18) θ − K K (cid:19) − n S yy + nK K + Const.log q ( θ ) = − nm T ( θ , X n ) + nm T ( θ , X n , Y n ) − n S yy (C.1)where m = (cid:82) θ q ( θ ) dθ , m = (cid:82) θ q ( θ ) dθ , S yy = (cid:80) ni =1 y i and T ( X n , Y n , θ ) = 1 n n (cid:88) i =1 y i tanh( θ x i ) T ( X n , θ ) = 1 n n (cid:88) i =1 tanh ( θ x i ) K ( X n , Y n ) = (cid:90) T ( X n , Y n , θ ) q ( θ ) dθ K ( X n ) = (cid:90) T ( X n , θ ) q ( θ ) dθ .
36t follows that the ELBO is given by I p + I + I , where I p = n π ) − n S yy − m K ( X n , Y n ) + m K ( X n )) ,I = − log Z + nK (cid:18) m − K K m + K K (cid:19) ,I = log Z + nm K ( X n ) − nm K ( X n , Y n ) . In the following, we shall transform coordinates to reduce the likelihood to a normal form. To aidthis, we rewrite the likelihood function as p ( Y n | X n , θ , θ ) = 1log(2 π ) n/ exp (cid:110) − n S yy − n θ θ K n, ( θ , X n ) − nθ θ K n, ( θ , X n , Y n ) (cid:111) where K n, ( θ , X n ) = 1 n n (cid:88) i =1 (cid:26) tanh( θ x i ) θ (cid:27) , K n, ( θ , X n , Y n ) = 1 n n (cid:88) i =1 (cid:26) y i tanh( θ x i ) θ (cid:27) . We use the following asymptotic limits of K n, and K n, to approximate and thus simplify thelikelihood further. K n, → θ E ( E [ Y tanh( θ X ) | X ]) = 0 , K n, → K ( θ ) = (cid:90) tanh( θ x ) θ dx. Therefore the posterior p ( θ , θ | X n , Y n ) concentrates around γ ( n ) K ( θ , θ ) ∝ exp {− nK ( θ , θ ) } φ ( θ , θ ) , K ( θ , θ ) = 12 θ θ K ( θ ) . Conversion to normal form is done by the change of variables ξ = θ , ξ = θ ( K ( θ ) / / := g ( θ ) . The inverse transform is θ = ξ and θ = g − ( ξ ) and its Jacobian is J ( ξ ) = (cid:12)(cid:12)(cid:12)(cid:12) g T ( g − ( ξ )) (cid:12)(cid:12)(cid:12)(cid:12) . ξ -coordinate system we have K ( θ , θ ) = K ( ξ , g − ( ξ )) = θ θ K ( θ , X n )2 = ξ ξ . The concentrated normal form of the posterior is γ ( n ) K ( ξ , ξ ) = exp( − nξ ξ + log J ( ξ ) − log C L ) , where C L is the normalizing constant. Then q ( ξ ) = exp (cid:8) − nF ξ − log C (cid:9) , q ( ξ ) = exp (cid:8) − nF ξ + log J ( ξ ) − log C (cid:9) , (C.2)where F = (cid:90) ξ q ( ξ ) dξ , F = (cid:90) g (1) g (0) ξ q ( ξ ) dξ . The negative KL divergence between the variational distribution q ( ξ ) = q ( ξ ) ⊗ q ( ξ ) and thenormal form of the posterior γ ( n ) K ( ξ ) is − D ( q ( ξ ) || γ ( n ) K ( ξ )) = I + I + I γ , where I γ = − nF F − log( C L ) , I = nF F + log( C ) , I = nF F + log( C ) . Hence the optimal evidence lower bound isΨ n ( q ∗ ) = − nF ∗ F ∗ + log( C ∗ ) + log( C ∗ ) . (C.3)38 Verifying Assumption A1 for Example 27 in Watanabe’s book
We bound the difference | ˜ Z i ( ξ ) − ˜ Z i ( ξ T ) | by repeated application of the triangle inequality, | ˜ Z i ( ξ ) − ˜ Z i ( ξ T ) | ≤ A + B + C, where A = 12 | ξ g − ( ξ ) F ( X i ; g − ( ξ )) − ξ T g − ( ξ T ) F ( X i ; g − ( ξ T )) | ,B = | Y i F ( X i ; g − ( ξ )) − Y i F ( X i ; g − ( ξ T )) | ,C = 12 | ξ g − ( ξ ) K ( g − ( ξ )) − ξ T g − ( ξ T ) K ( g − ( ξ T )) | . Denoting (cid:107) x (cid:107) = (cid:80) dj =1 | x j | for x ∈ R d , using Lemmata D.1 and D.2 yields A ≤
12 ( L , ∞ ( X i ) + C g − L , ∞ ( X i ) + C g − (cid:107) g − (cid:107) ∞ L ( X i )) (cid:107) ξ − ξ T (cid:107) := L a ( X i ) (cid:107) ξ − ξ T (cid:107) B ≤ C g − L ( X i ) | Y i |(cid:107) ξ − ξ T (cid:107) := L b ( X i , Y i ) (cid:107) ξ − ξ T (cid:107) C ≤
12 ( (cid:107) g − (cid:107) ∞ (cid:107) K (cid:107) ∞ + C g − (cid:107) K (cid:107) ∞ + C K C g − (cid:107) g − (cid:107) ∞ ) (cid:107) ξ − ξ T (cid:107) := L c ( X i ) (cid:107) ξ − ξ T (cid:107) . Combining the bounds we have | ˜ Z i ( ξ ) − ˜ Z i ( ξ T ) | ≤ L ( X i ) (cid:107) ξ − ξ T (cid:107) with with L ( X i ) = L a ( X i ) + L b ( Y i , X i ) + L c ( X i ). It remains to show that L ( X ) satisfy the tail bound E e tL ( X ) ≤ e t / (2 c L ) . Wewill show that L a ( X ), L b ( Y , X ), L c ( X ) are all sub-Gaussian, which implies the above bound on L ( X ) holds. L c ( X ) is a constant random variable so it is sub-Gaussian. L b ( Y , X ) is the productof a bounded random variable L ( X ) with a sub-Gaussian random variable | Y | . L a ( X ) is the sumof bounded random variables and hence sub-Gaussian. Lemma D.1.
Suppose X ∼ U [0 , . Let F ( X ; ω ) := tanh( ωX ) /ω, x, ω ∈ [0 , , then for s = 1 , there exist bounded random variables L ,s ∞ ( X ) , L ,s ∞ ( X ) , L s ( X ) such that sup ω { F s ( X ; ω ) } ≤ L ,s ∞ ( X )sup ω { ωF s ( X ; ω ) } ≤ L ,s ∞ ( X ) | F s ( X ; ω ) − F s ( X ; ω T ) | ≤ L s ( X ) | ω − ω T | here L k,s ∞ ( X ) = L s ( X ) = | X | for k = 0 , and s = 1 , Proof.
For s = 1, we have F s ( X ; ω ) = tanh( ωX ) /ω and ωF s ( X ; ω ) = tanh( ωX ). Both of thesefunctions are globally | · | -Lipschitz functions in the ω -variable since they both have bounded,continuous derivatives. ddω F ( X ; ω ) = ddω (cid:20) tanh( ωX ) ω (cid:21) = Xω sech ( Xω ) − tanh( Xω ) ω . It follows sup ω ∈ [0 , | F ( X ; ω ) | ≤ | X | := L , ∞ ( X ) , sup ω ∈ [0 , | ωF ( X ; ω ) | = | tanh( X ) | ≤ | X | := L , ∞ ( X ) , sup ω ∈ [0 , (cid:12)(cid:12)(cid:12)(cid:12) ddω (cid:20) tanh( ωX ) ω (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) = sup ω ∈ [0 , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Xω (cid:32) sech ( Xω ) − tanh( Xω ) Xω ω (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ | X | := L ( X ) . Similarly for s = 2, we have F s ( X ; ω ) = tanh ( ωX ) /ω and ωF s ( X ; ω ) = tanh ( ωX ) /ω . Both ofthese functions are globally | · | -Lipschitz functions in the ω -variable since they both have bounded,continuous derivatives. ddω F ( X ; ω ) = ddω (cid:20) tanh ( ωX ) ω (cid:21) = tanh( Xω )( ω sech ( Xω ) − Xω ) ω . It follows sup ω ∈ [0 , | F ( X ; ω ) | ≤ | X | := L , ∞ ( X ) , sup ω ∈ [0 , | ωF ( X ; ω ) | ≤ | X | := L , ∞ ( X ) , sup ω ∈ [0 , (cid:12)(cid:12)(cid:12)(cid:12) ddω (cid:20) tanh ( ωX ) ω (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) = sup ω ∈ [0 , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Xω (cid:32) sech ( Xω ) − tanh( Xω ) Xω ω (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ | X | := L ( X ) . Lemma D.2.
The functions K ( ω ) = (cid:90) tanh ( ωx ) ω dx, ω ∈ [0 , nd the inverse of the function g ( θ ) = θ ( K ( θ ) /θ ) / , for θ ∈ [0 , , are globally | · | -Lipschitz withLipschitz constants C K and C g − , respectively, | K ( ω ) − K ( ω T ) | ≤ C K | ω − ω T | , | g − ( ω ) − g − ( ω T ) | ≤ C K | ω − ω T | . Proof.
The function K has a bounded derivative for ω ∈ [0 , K is globallyLipschitz on [0 ,
1] with Lipschitz constant C K = sup ω | K T ( ω ) | . Also, since g T is bounded in [0 , ω = g ( θ ), ( d/dω ) g − ( ω ) = 1 /g T ( g − ( ω )), the function g − has a bounded derivative in [0 , g − is globally Lipschitz on [0 ,
1] with Lipschitz constant C g − = sup ω | ddω g − ( ω ) | . References
Abramowitz, M. and Stegun, I. A. (1964).
Handbook of mathematical functions: with formulas,graphs, and mathematical tables . Number 55. Courier Corporation. 21Aoyagi, M. (2010). Stochastic complexity and generalization error of a restricted Boltzmann ma-chine in Bayesian estimation.
Journal of Machine Learning Research , 11(Apr):1243–1272. 4Aoyagi, M. (2019). Learning coefficient of Vandermonde matrix-type singularities in model selec-tion.
Entropy , 21(6):561. 4Aoyagi, M. and Watanabe, S. (2005). Stochastic complexities of reduced rank regression in Bayesianestimation.
Neural Networks , 18(7):924–933. 4Austin, T. (2019). The structure of low-complexity Gibbs measures on product spaces.
The Annalsof Probability , 47(6):4002–4023. 5Basak, A. and Mukherjee, S. (2017). Universality of the mean-field for the Potts model.
ProbabilityTheory and Related Fields , 168(3):557–600. 5Bibinger, M. (2013). Notes on the sum and maximum of independent exponentially distributedrandom variables with different scale parameters. arXiv preprint arXiv:1307.3945 . 11Bishop, C. (2006).
Pattern Recognition and Machine Learning . Information Science and Statistics.Springer. 4, 5, 20 41hatterjee, S. and Dembo, A. (2016). Nonlinear large deviations.
Advances in Mathematics ,299:396–450. 5Drton, M., Lin, S., Weihs, L., and Zwiernik, P. (2017). Marginal likelihood and model selection forGaussian latent tree and forest models.
Bernoulli , 23(2):1202–1232. 4Drton, M. and Plummer, M. (2017). A Bayesian information criterion for singular models.
Journalof the Royal Statistical Society: Series B (Statistical Methodology) , 79(2):323–380. 3Friedlander, F. G. and Joshi, M. C. (1998).
Introduction to the Theory of Distributions . CambridgeUniversity Press. 4Ghorbani, B., Javadi, H., and Montanari, A. (2018). An instability in variational inference for topicmodels. In
International Conference on Machine Learning . 5, 19Hayashi, N. and Watanabe, S. (2017). Tighter upper bound of real log canonical threshold of non-negative matrix factorization and its application to Bayesian inference. In , pages 1–8. IEEE. 4Hironaka, H. (1964). Resolution of singularities of an algebraic variety over a field of characteristiczero: Ii.
Annals of Mathematics , pages 205–326. 4Hosino, T., Watanabe, K., and Watanabe, S. (2005). Stochastic complexity of variational Bayesianhidden markov models. In
Proceedings. 2005 IEEE International Joint Conference on NeuralNetworks, 2005. , volume 2, pages 1114–1119 vol. 2. 5Hosino, T., Watanabe, K., and Watanabe, S. (2006). Free energy of stochastic context free grammaron variational Bayes. In
ICONIP . 5Kass, R., Tierney, L., and Kadane, J. (1990). The validity of posterior expansions based on Laplace’smethod. 2Lin, S. (2011).
Algebraic methods for evaluating integrals in Bayesian statistics . PhD thesis, UCBerkeley. 3, 8MacKay, D. J. (2003).
Information theory, inference and learning algorithms . Cambridge universitypress. 4 42athai, A. (1982). Storage capacity of a dam with gamma type inputs.
Annals of the Institute ofStatistical Mathematics , 34(3):591–597. 11, 12Mukherjee, S. S., Sarkar, P., Wang, Y. R., and Yan, B. (2018). Mean field for the stochasticblockmodel: optimization landscape and convergence issues. In
Advances in Neural InformationProcessing Systems , pages 10694–10704. 5, 19Parisi, G. (1988).
Statistical Field Theory . Frontiers in Physics. Addison-Wesley. 5, 20Pati, D., Bhattacharya, A., and Yang, Y. (2018). On statistical optimality of variational Bayes. In
International Conference on Artificial Intelligence and Statistics , pages 1579–1588. 19Plummer, S., Pati, D., and Bhattacharya, A. (2020). Dynamics of coordinate ascent variationalinference: A case study in 2d Ising models. arXiv preprint arXiv:2007.06715 . 5, 25Robert, C. (2007).
The Bayesian choice: from decision-theoretic foundations to computationalimplementation . Springer Science & Business Media. 2Rusakov, D. and Geiger, D. (2005). Asymptotic model selection for naive Bayesian networks.
Journal of Machine Learning Research , 6(Jan):1–35. 4Schwarz, G. (1978). Estimating the dimension of a model.
The Annals of Statistics , 6(2):461–464.2Tierney, L. and Kadane, J. (1986). Accurate approximations for posterior moments and marginaldensities.
Journal of the American Statistical Association , 81(393):82–86. 2van der Vaart, A. W. and Wellner, J. A. (1996). Weak convergence. In
Weak convergence andempirical processes , pages 16–28. Springer. 29, 30Vershynin, R. (2018). High-dimensional probability: An introduction with applications in datascience. 2018. .17, 29Wainwright, M. J. and Jordan, M. I. (2008).
Graphical models, exponential families, and variationalinference . Now Publishers Inc. 4 43ang, B., Titterington, D., et al. (2006). Convergence properties of a general algorithm for calculat-ing variational Bayesian estimates for a normal mixture model.
Bayesian Analysis , 1(3):625–650.19Watanabe, K., Shiga, M., and Watanabe, S. (2009). Upper bound for variational free energy ofBayesian networks.
Mach. Learn. , 75(2):199–215. 5Watanabe, K. and Watanabe, S. (2004). Lower bounds of stochastic complexities in variationalBayes learning of Gaussian mixture models. In
IEEE Conference on Cybernetics and IntelligentSystems, 2004. , volume 1, pages 99–104 vol.1. 5Watanabe, K. and Watanabe, S. (2005). Stochastic complexity for mixture of exponential familiesin variational Bayes. In Jain, S., Simon, H. U., and Tomita, E., editors,
Algorithmic LearningTheory , pages 107–121, Berlin, Heidelberg. Springer Berlin Heidelberg. 5Watanabe, K. and Watanabe, S. (2006). Stochastic complexities of Gaussian mixtures in variationalBayesian approximation.
Journal of Machine Learning Research , 7(Apr):625–644. 5Watanabe, K. and Watanabe, S. (2007a). Stochastic complexities of general mixture models invariational Bayesian learning.
Neural Networks , 20(2):210 – 219. 5Watanabe, K. and Watanabe, S. (2007b). Stochastic complexity for mixture of exponential familiesin generalized variational Bayes.
Theoretical Computer Science , 387(1):4 – 17. AlgorithmicLearning Theory. 5Watanabe, S. (1999). Algebraic analysis for singular statistical estimation. In
International Con-ference on Algorithmic Learning Theory , pages 39–50. Springer. 3Watanabe, S. (2001a). Algebraic analysis for nonidentifiable learning machines.
Neural Computa-tion , 13(4):899–933. 3Watanabe, S. (2001b). Algebraic geometrical methods for hierarchical learning machines.
NeuralNetworks , 14(8):1049–1060. 3Watanabe, S. (2009).
Algebraic geometry and statistical learning theory , volume 25. CambridgeUniversity Press. 3, 9, 19 44atanabe, S. (2013). A widely applicable Bayesian information criterion.
Journal of MachineLearning Research , 14(Mar):867–897. 3Watanabe, S. (2018).
Mathematical theory of Bayesian statistics . CRC Press. 3, 6, 7, 9, 13, 16, 17Yamazaki, K. and Watanabe, S. (2003). Singularities in mixture models and upper bounds ofstochastic complexity.
Neural networks , 16(7):1029–1038. 4Yan, J. (2020). Nonlinear large deviations: Beyond the hypercube.
Annals of Applied Probability ,30(2):812–846. 5Yang, Y., Pati, D., Bhattacharya, A., et al. (2020). α -variational inference with statistical guaran-tees. Annals of Statistics , 48(2):886–905. 19Zhang, A. Y. and Zhou, H. H. (2020). Theoretical and computational guarantees of mean fieldvariational inference for community detection.