Gaussian variational approximation with sparse precision matrices
NNoname manuscript No. (will be inserted by the editor)
Gaussian variational approximation with sparse precisionmatrices
Linda S. L. Tan · David J. Nott
Received: date / Accepted: date
Abstract
We consider the problem of learning a Gaus-sian variational approximation to the posterior distri-bution for a high-dimensional parameter, where we im-pose sparsity in the precision matrix to reflect appro-priate conditional independence structure in the model.Incorporating sparsity in the precision matrix allowsthe Gaussian variational distribution to be both flexibleand parsimonious, and the sparsity is achieved throughparameterization in terms of the Cholesky factor. Ef-ficient stochastic gradient methods which make appro-priate use of gradient information for the target distri-bution are developed for the optimization. We consideralternative estimators of the stochastic gradients whichhave lower variation and are more stable. Our approachis illustrated using generalized linear mixed models andstate space models for time series.
Keywords
Gaussian variational approximation · stochastic gradient algorithms · sparse precisionmatrix · variational Bayes Linda S. L. TanDepartment of Statistics and Applied ProbabilityNational University of Singapore6 Science Drive 2, Singapore 117546Tel.: +65-6516-4416Fax: +65-6872-3919E-mail: [email protected] J. NottDepartment of Statistics and Applied ProbabilityNational University of Singapore6 Science Drive 2, Singapore 117546Operations Research and Analytics ClusterNational University of Singapore21 Lower Kent Ridge Road, Singapore 119077Tel.: +65-6516-2744Fax: +65-6872-3919E-mail: [email protected]
Bayesian inference provides a principled way of com-bining data with prior beliefs through the applicationof Bayes’ rule. The posterior distribution is, however,often intractable and simulation-based Markov ChainMonte Carlo (MCMC) methods have become a centraltool in Bayesian computation. In recent years, varia-tional methods (Jordan et al., 1999) have also emergedas an important alternative to MCMC, providing fastapproximate inference for complex, high-dimensionalmodels. Unlike MCMC, which can be made arbitrar-ily accurate, variational methods make certain simpli-fying assumptions about the posterior density (e.g. atractable form q ( θ ) where θ denotes the vector of vari-ables) and seek to optimize the Kullback-Leibler diver-gence D KL ( q || p ) between q ( θ ) and the true posterior p ( θ | y ) subject to these assumed restrictions. While ear-lier research on variational methods concentrated onconjugate models with analytically tractable expecta-tions under which the variational Bayes approach (At-tias, 1999) yields efficient closed-form updates (Winnand Bishop, 2005), recent focus considers stochasticgradient approximation methods (Robbins and Monro,1951) for non-conjugate models (e.g. Paisley et al.,2012; Salimans and Knowles, 2013). Further discus-sion of the literature is deferred to Section 2. Rohdeand Wand (2015) give a nice recent summary of alter-natives to stochastic gradient approaches for handlingnon-conjugacy in the variational Bayes framework.Titsias and L´azaro-Gredilla (2014) propose a sim-ple yet effective variational method known as “dou-bly stochastic variational inference”, where the approx-imating density is parameterized in terms of its mean µ and a lower triangular scale matrix L . An efficientstochastic gradient algorithm is then developed for op- a r X i v : . [ s t a t . C O ] A p r Linda S. L. Tan, David J. Nott timizing µ and L by (1) parameterizing the vector ofvariables θ as Lz + µ where z is a random variablethat can be sampled easily from a base distribution thatdoes not depend on the variational parameters (see alsoKingma and Welling, 2014; Rezende et al., 2014) and(2) sub-sampling from the data. The stochastic gradi-ents constructed in this manner are “doubly stochastic”as they are built upon two sources of stochasticity thatcomes from sampling from the variational distributionand the full data set. This approach is very general inthat it can be applied to any model where the joint den-sity is differentiable. Unlike variational Bayes, it doesnot assume independence relationships among blocksof an appropriate partition of θ . Such independence as-sumptions have been shown to result in underestima-tion of the posterior variance (Wang and Titterington,2005; Bishop, 2006). The quality of the resulting ap-proximation is thus limited only by how well the formof q ( θ ) matches the true posterior. Using this approach,Kucukelbir et al. (2016) develop an automatic differen-tiation variational inference (ADVI) algorithm in Stan,where q ( θ ) is assumed to be either a diagonal (mean-field) or unrestricted Gaussian variational approxima-tion. Constrained variables are transformed to the realline via Stan’s library of transformations and the gradi-ents are computed using Monte Carlo integration. Theynote that while unrestricted ADVI is able to captureposterior correlations and hence produces more accu-rate marginal variance estimates than mean field ADVI,it can be prohibitively slow for large data since the num-ber of variational parameters scales as the square of thelength of θ .In this article, we consider variational approxima-tions which take the form of a multivariate Gaussiandistribution N ( µ, Σ ) for models with high-dimensionalparameters ( µ denotes the mean and Σ the covariancematrix). However, instead of expressing Σ as LL T andoptimizing L (the Cholesky factor of Σ ) as in Tit-sias and L´azaro-Gredilla (2014) and Kucukelbir et al.(2016), we parameterize the optimization problem interms of the Cholesky factor of the precision matrix.This parameterization is important as it provides an av-enue to impose a sparsity structure in the precision ma-trix that reflects conditional independence relationshipsin the posterior. This sparsity structure reduces com-putational complexity greatly and enables fast infer-ence for models with a large number of variables with-out having to assume independence relationships in theposterior. We demonstrate how our approach can be ap-plied to generalized linear mixed models (GLMMs) andstate space models (SSMs) for time series data. Assum-ing that the number of global variables is small com-pared to the number of local variables, our approach reduces the number of variational parameters to be up-dated in each iteration from O ( n ) to O ( n ), where n denotes the number of subjects in GLMMs and thelength of time series in SSMs. In this way, the accu-racy of using a unrestricted lower triangular matrix L can be achieved at the computational cost (same orderof magnitude) of using a diagonal matrix L .Recently, several classes of richer variational approx-imations which go beyond factorized (mean-field) ap-proximating densities and which are able to reflect theposterior dependence structure to varying degrees havebeen proposed (e.g. Gershman et al., 2012; Salimansand Knowles, 2013). Rezende and Mohamed (2015) pro-pose the specification of the approximate posterior us-ing normalizing flows. Starting with say simple factor-ized distributions, highly flexible and complex approxi-mate posteriors are constructed by transforming the ini-tial density through a sequence of invertible mappingswhich perform expansions or contractions of the prob-ability mass in targeted regions. The resulting chain oftransformed densities is known as a normalizing flow.The authors show that true posterior can be recov-ered asymptotically under the Langevin flow, thus over-coming an important limitation of variational inference.Ranganath et al. (2016) propose hierarchical variationalmodels which are built by placing a prior distributionon the parameters of a mean-field variational approxi-mation and then proceeding to integrate out the mean-field parameters. They specify the prior using normal-izing flows and demonstrate that the hierarchical vari-ational model achieves better performance in terms ofperplexity and held-out likelihood for deep exponentialfamilies (Ranganath et al., 2015). Structured stochasticvariational inference (Hoffman and Blei, 2015) is a gen-eralization of stochastic variational inference to allowdependencies between global and local variables. In theapproximating density, independence is assumed onlyamong elements in the global variables and among thelocal variables conditional on the global variables. De-pendence between a local variable and the global vari-ables is captured via a local parameter defined implic-itly as the point at which the local evidence lower boundis maximized.Archer et al. (2016) develop “black-box” variationalinference (Ranganath et al., 2014) for SSMs, where aGaussian variational approximation is considered forthe latent states. To capture the temporal correlationstructure, the precision matrix is assumed to be a blocktri-diagonal matrix. While related, our approach differsfrom Archer et al. (2016) in several aspects. For theSSMs application, we consider a joint Gaussian vari-ational approximation for the model parameters andlatent states while Archer et al. (2016) assume that aussian variational approximation with sparse precision matrices 3 the model parameters are known and consider a Gaus-sian approximate posterior for the latent states only.Secondly, we optimize the Cholesky factor of the pre-cision matrix directly while Archer et al. (2016) con-sider other parameterizations such as defining the ap-proximate posterior through a product of Gaussian fac-tors and parameterizing the mean and blocks in thetri-diagonal inverse covariance using neural networks.Third, we consider a more general sparsity structurein the precision matrix, which reflects the conditionalindependence structures in the posterior distributionand is not limited to band matrices. We also con-sider an alternative estimator of the stochastic gra-dient which differs from the “black-box” approach ofArcher et al. (2016) as well as that used by Titsias andL´azaro-Gredilla (2014) and Kucukelbir et al. (2016). Wedemonstrate empirically that this estimator has lowervariance at the mode and is helpful in improving thestability and precision of the proposed algorithm. Thisestimator is inspired by Han et al. (2016), who pro-pose using Gaussian copulas to accommodate modelswhose posteriors are for instance, skewed, heavy-tailedor multi-modal and hence unsuited to a Gaussian varia-tional approximation. Our idea of introducing sparsityvia the Cholesky factor of the precision matrix mayprove useful in this context as well. The relationship be-tween the Laplace and the Gaussian variational approx-imation is discussed in Opper and Archambeau (2009)while Challis and Barber (2013) consider some differ-ent parameterizations in terms of the Cholesky. We donot consider Laplace approximations (Rue et al., 2009)in this paper since an important advantage of stochas-tic gradient methods is they are generally amenable tosub-sampling, although this is not always straightfor-ward in complex latent variable models where the localparameters are dependent.In Section 2, we review doubly stochastic variationalinference, the approach of Titsias and L´azaro-Gredilla(2014). Section 3 describes how the optimization prob-lem can be framed in terms of the precision matrix,develops the algorithm using alternative gradient esti-mators and discusses the importance of imposing spar-sity structure in the precision matrix. The setting of thelearning rate in the stochastic gradient algorithm is dis-cussed in Section 4. In Section 5, we illustrate how ourapproach can be applied to GLMMs and state spacemodels. The performance of our algorithm is investi-gated using several real data sets. We conclude with adiscussion of our major results and findings in Section6. In this section, we provide some general backgroundon variational methods and give a brief review of dou-bly stochastic variational inference (Titsias and L´azaro-Gredilla, 2014) as we will be considering a modificationof their approach.For a Bayesian inference problem, let θ denote thevector of variables, p ( θ ) be the prior and p ( y | θ ) the like-lihood for observed data y . In variational approxima-tion (see, e.g. Bishop, 2006; Ormerod and Wand, 2010),an attempt is made to approximate an intractable pos-terior distribution p ( θ | y ) ∝ p ( θ ) p ( y | θ ) using a memberof some approximating family. Here we will consider aparametric family with typical element q λ ( θ ) where λ denotes variational parameters to be chosen. Minimiza-tion of the Kullback-Leibler divergence between q λ ( θ )and p ( θ | y ) with respect to λ can be shown to be equiv-alent to maximizing a lower bound on the log marginallikelihood log p ( y ) (where p ( y ) = (cid:82) p ( θ ) p ( y | θ ) dθ ), andtaking the form L ( λ ) = (cid:90) log p ( θ ) p ( y | θ ) q λ ( θ ) q λ ( θ ) dθ. In non-conjugate models, L ( λ ) will generally not have aclosed form. There has been much recent research con-cerned with stochastic gradient methods (Robbins andMonro, 1951) able to optimize L ( λ ) efficiently in thissituation (Ji et al., 2010; Nott et al., 2012; Paisley et al.,2012; Kingma and Welling, 2014; Hoffman et al., 2013;Ranganath et al., 2014; Rezende et al., 2014; Titsiasand L´azaro-Gredilla, 2014, 2015).The method of Titsias and L´azaro-Gredilla (2014)(hereafter TL) is one state of the art method which op-timizes L ( λ ) using gradient information from the tar-get distribution. Write h ( θ ) = p ( θ ) p ( y | θ ). In the TLmethod, an approximating distribution of the form q ( θ | µ, L ) = | L | − f ( L − ( θ − µ ))is assumed (so that λ = ( µ, L )) where f is a fixed den-sity. Here µ is a vector of parameters of dimension d ,where d is the dimension of θ , and L is a d × d lower tri-angular matrix with positive diagonal elements. If f isthe density of a vector of independent standard normalrandom variables then q ( θ | µ, L ) is normal, N ( µ, LL T ),and the covariance matrix is being parameterized with L as the Cholesky factor. We will only be consideringthe case of a multivariate normal approximation in thispaper.The lower bound L ( λ ) = L ( µ, L ) is an expectationwith respect to q ( θ | µ, L ), but can be written as an ex-pectation with respect to the density f . Writing the Linda S. L. Tan, David J. Nott integral in this way (for the purpose of the stochasticgradient optimization) results in an approach which isable to effectively use gradient information from thetarget log posterior. More precisely, writing E q ( · ) forthe expectation with respect to q ( θ | µ, L ) and E f ( · ) forthe expectation with respect to f , we have L ( µ, L ) = E q (log h ( θ ) − log q ( θ | µ, L ))= E f (log h ( µ + Ls ) − log q ( µ + Ls | µ, L ))= E f (log h ( µ + Ls )) + log | L | + K (1)where s = L − ( θ − µ ) is distributed according to thedensity f and K denotes a term not depending on µ, L .This approach of applying a transformation θ = µ + Ls so that the lower bound can be rewritten as an ex-pectation with respect to a fixed density f that doesnot depend on the variational parameters is sometimesreferred to as the “reparameterization trick” (Kingmaand Welling, 2014; Rezende et al., 2014; Titsias andL´azaro-Gredilla, 2014). The advantage of this approachis that efficient gradient estimators of L ( µ, L ) can nowbe constructed by sampling s from f instead of θ from q ( θ | µ, L ), which has been found to result in estimatorswith very high variance (see, e.g. Paisley et al., 2012).Next, we give expressions for the gradients of L ( µ, L ) with respect to µ and L . To explain their deriva-tion we need some notation first. For a scalar valuedfunction g ( x ) of a vector valued argument x , ∇ x g ( x )denotes the gradient vector for the function writtenas a column vector. Also, for a scalar valued function g ( A ) of a matrix A , ∇ A g ( A ) means vec − ( ∇ vec( A ) g ( A ))where, for a d × d square matrix A , vec( A ) is thevector of length d obtained by stacking the columnsof A underneath each other, and vec − is the inverseoperation that takes a vector of length d and cre-ates a d × d square matrix by filling up the columnsfrom left to right from the elements of the vector.In addition, we use the following well known result.For conformably dimensioned matrices A , B and C ,vec( ABC ) = ( C T ⊗ A )vec( B ). This implies that wecan write Ls = vec( ILs ) = ( s T ⊗ I )vec( L ). Then ∇ µ E f (log h ( µ + Ls )) = E f ( ∇ θ log h ( µ + Ls ))and ∇ vec( L ) E f (log h ( µ + Ls ))= ∇ vec( L ) E f (log h ( µ + ( s T ⊗ I )vec( L )))= E f (( s ⊗ I ) ∇ θ log h ( µ + Ls ))= E f (vec( ∇ θ h ( µ + Ls ) s T )) . (2)The last line of (2) just says that ∇ L E f (log h ( µ + Ls )) = E f ( ∇ θ h ( µ + Ls ) s T ) . (3) Note that entries above the diagonal should beset to zero for the right-hand-side of (3) because L is lower triangular. For the term log | L | in L ( µ, L ), we have ∇ µ log | L | = 0 and ∇ L log | L | =diag(1 /L , . . . , /L dd ).Once we have expressions for the derivatives of thelower bound as expectations with respect to f , we canestimate these gradients unbiasedly using simulationsfrom this distribution (typically based on just a singledraw). When the log-likelihood is a sum of a large num-ber of terms, such as in the case of a very large dataset,we can subsample the terms and still construct appro-priate unbiased gradient estimates if we desire (hencethe name “doubly stochastic variational inference”). Al-gorithm 1 shows the basic stochastic gradient methodof Titsias and L´azaro-Gredilla (2014). The sequence ρ t , t ≥
1, in the algorithm is a sequence of learning ratessatisfying the Robbins-Monro conditions (cid:80) t ρ t = ∞ , (cid:80) t ρ t < ∞ . Initialize µ (0) = 0 and T (0) = I d .For t = 1 , . . . , N ,1. Generate s ∼ N (0 , I d ).2. Compute θ ( t ) = µ ( t ) + L ( t ) s .3. Update µ ( t +1) = µ ( t ) + ρ t ∇ θ log h ( θ ( t ) ).4. Update L ( t +1) = L ( t ) + ρ t {∇ θ log h ( θ ( t ) ) s T + diag(1 /L ( t )11 , . . . , /L ( t ) dd ) } . (Entries abovethe diagonal are fixed at zero). Algorithm 1: Doubly stochastic variational inference al-gorithm of Titsias and L´azaro-Gredilla (2014).
When the vector of variables θ is high-dimensional, al-lowing L to be a dense matrix is computationally im-practical. An alternative is to assume that L is diag-onal, but that loses any ability to capture dependencestructure of the posterior. Here we consider an alterna-tive approach where we follow a similar strategy to thatof Titsias and L´azaro-Gredilla (2014), but instead pa-rameterize the inverse covariance (precision) matrix interms of the Cholesky factor and then impose sparsityon it that reflects conditional independence structurein the model.3.1 Model SpecificationConsider a model with observations y = ( y , . . . , y n ),latent variables b , . . . , b n and model parameters η . Let aussian variational approximation with sparse precision matrices 5 θ = [ b T , . . . , b Tn , η T ] T , (4)denote the vector of all variables. We assume that thejoint distribution can be written in the form p ( y, θ ) = p ( η ) (cid:40) n (cid:89) i =1 p ( y i | b i , η ) (cid:41) p ( b , . . . , b k | η ) × (cid:40)(cid:89) i>k p ( b i , | b i − , . . . , b i − k , η ) (cid:41) (5)for some 1 ≤ k ≤ n . In this model, b i is conditionallyindependent of the other latent variables in the poste-rior distribution p ( θ | y ) given η and the neighboring k latent variables.3.2 Gaussian variational approximation with sparseprecision matrixWe consider the variational approximation ( q ) of theposterior to be a multivariate Gaussian distribution N ( µ, T − T T − ), where T is a lower triangular matrixwith positive diagonal entries. With f being the jointdensity of a d -vector of independent standard Gaus-sian variables as before, we can write q ( θ | µ, T ) = | T | f ( T T ( θ − µ )).Let Ω denote the precision matrix of the Gaussiandistribution. Then Ω = T T T and hence T is just theCholesky factor of Ω . The statistical motivation for im-posing sparsity on the Cholesky factor of the precisionmatrix is as follows. It is well known that for a Gaussiandistribution, Ω ij = 0 corresponds to variables i and j being conditionally independent given the rest. Also, if Ω = T T T where T is lower triangular, proposition 1 ofRothman et al. (2010) states that if T is row bandedthen Ω possesses the same row banded structure. Thismeans that imposing sparsity in T can be useful forreflecting conditional independence relationships in Ω .For our model in (5), let us partition Ω into blocks Ω ij , 1 ≤ i, j ≤ n + 1 according to (4). For the Gaussianvariational approximation to reflect the conditional in-dependence structure in the posterior, we would like tohave Ω ij = 0 for { ≤ i, j ≤ n | j < i − k or j > i + k } ,with no constraints on the remaining blocks. Write T for the Cholesky factor partitioned in the same way as Ω with blocks T ij , 1 ≤ i, j ≤ n + 1. Since T is lowertriangular, T ij = 0 if i < j and T ii , 1 ≤ i ≤ n + 1,are lower triangular matrices. If T ij = 0 for { ≤ j
0. However, ˆ g µ, andˆ g T, are still subjected to the randomness in s aroundthe mode. Thus we prefer to use the estimators ˆ g µ, and ˆ g T, , which do not incur any additional computa-tion except for the term T s .3.5 Uniqueness of the Cholesky factorWe note that in Algorithm 1, the updates of L do notensure that the diagonal entries are positive. While thisdoes not result in any computational issues, we prefer toadd in the following step to ensure that the diagonal en-tries of T are positive. This helps to ensure uniquenessof the Cholesky factor and reduces the possibility ofmultiple local modes which is an important issue espe-cially in the high-dimensional problems considered here.To achieve this aim. We introduce the lower triangularmatrix T (cid:48) where T (cid:48) ij = T ij if i (cid:54) = j and T (cid:48) ii = log( T ii ). Wecompute the stochastic gradient updates for T (cid:48) whoseentries are unconstrained. The gradient ˆ g T (cid:48) , = ˆ g T, forall non-diagonal entries. Diagonal entries of ˆ g T (cid:48) , can becomputed by multiplying the diagonal entries of ˆ g T, bythe diagonal entries of T .The modification of the doubly stochastic varia-tional inference algorithm, in terms of the Cholesky fac-tor of the precision matrix, is summarized in Algorithm2. Now let us consider sparsity in the Cholesky factor T . Suppose some elements of T are fixed at zero. ThenAlgorithm 2 remains the same, except that only thesubset of elements of T which are not fixed at zero arestored and updated. Note that in step 2, if T is a sparsematrix, we can compute T − T s by solving the linearsystem T T x = s for x . This can be done very efficientlybecause T T is upper triangular and sparse. Similarly,in computing the update at step 5, we need to computethe vector T − g µ . This can also be computed by solvingthe linear system T x = g µ , which is again easy because T is a sparse lower triangular matrix. So even in veryhigh dimensions, if T is appropriately sparse, Algorithm2 can be implemented in a way that is efficient in termsof both memory storage requirements and CPU time. aussian variational approximation with sparse precision matrices 7 − b − b Trt − − b t − b Trtxt − z Fig. 1: Toenail data: estimates of ∇ µ L given by ˆ g µ, (black) and ˆ g µ, (red) at the mode for a subset of the variables. Initialize µ (0) = 0, T (0) = I d and T (cid:48) (0) = 0.For t = 1 , . . . , N ,1. Generate s ∼ N (0 , I d ).2. Compute θ ( t ) = µ ( t − + T ( t − − T s .3. Compute g ( t ) µ = ∇ θ log h ( θ ( t ) ) + T ( t − s .4. Update µ ( t ) = µ ( t − + ρ t g ( t ) µ .5. Set g ( t ) T (cid:48) = − T ( t − − T s ( T ( t − − g ( t ) µ ) T .6. Set diag( g ( t ) T (cid:48) ) = diag( g ( t ) T (cid:48) ) (cid:12) diag( T ( t − ).7. Update T (cid:48) ( t ) = T (cid:48) ( t ) + ρ t g ( t ) T (cid:48) .8. Update T ( t ) = T (cid:48) ( t ) .9. Update diag( T ( t ) ) = exp(diag( T (cid:48) ( t ) )). Algorithm 2: Modified doubly stochastic variationalinference algorithm parameterized in terms of theCholesky factor of the precision matrix. A / ( t + A ) A where t denotes the iteration, and A ≥ A ≥ . < A ≤ { µ, T } converge at different rates. For instance, Titsias and L´azaro-Gredilla (2014) scaled down the learning rate of µ by 0.1 when using for L in Algorithm 1. It is alsolikely that the parameters have different “scale”, espe-cially when some of the constrained parameters haveto be transformed to the real line. These considera-tions increase the need for learning rates that are adap-tive and parameter-specific. Several adaptive step-sizesequences (e.g. Duchi et al. (2011), Ranganath et al.(2013), Kucukelbir et al., 2016) have been proposed.We find that the ADADELTA method (Zeiler, 2012), inparticular, worked very well with Algorithm 2 and weuse it for all the examples. For consistency, we also usedADADELTA to compute the step-size for Algorithm 1.While ADADELTA has worked well in our experiments,we have only worked on a limited number of datasetsand it is likely that other learning rates may yield betterperformance. From our observations, the performanceof learning rates tend to be problem-dependent.The ADADELTA method takes into considerationthe scale of the parameters by incorporating secondorder information through a Hessian approximation.Suppose at iteration t , a parameter x is updated as x ( t ) = x ( t − + ∆x ( t ) , where ∆x ( t ) = ωg ( t ) x and g ( t ) x isthe gradient. The step-size ω is computed as ω = (cid:112) E [ ∆ x ] ( t − + (cid:15) (cid:112) E [ g x ] ( t ) + (cid:15) , (9)where E [ ∆ x ] ( t − and E [ g x ] ( t ) are exponentially decay-ing averages of ∆x ( t )2 and g ( t ) x , and (cid:15) is a small positiveconstant added to ensure the denominator is positiveand the initial step-size is nonzero. The terms E [ ∆ x ] ( t ) and E [ g x ] ( t ) are updated as E [ ∆ x ] ( t ) = ρE [ ∆ x ] ( t − + (1 − ρ ) ∆ ( t ) x and E [ g x ] ( t ) = ρE [ g x ] ( t − + (1 − ρ ) g ( t ) x at each iteration where ρ is a decaying constant.The motivation of this approach comes from Newton-Raphson algorithms where it is well known that the Linda S. L. Tan, David J. Nott inverse of the Hessian matrix provides an optimalor near-optimal step-size sequence (see, e.g. Spall,2003). ADADELTA approximates the Hessian by tak-ing f (cid:48)(cid:48) ( x ) ≈ ∆xf (cid:48) ( x ) , hence the form of ω in (9). To applyADADELTA, we modify Algorithm 2 as outlined below.Note that the step-size for µ and T are different. As T Initialize µ (0) = 0, T (0) = I d , T (cid:48) (0) = 0, E [ g µ ] (0) = E [ ∆ µ ] (0) = 0, E [ g T (cid:48) ] (0) = E [ ∆ T (cid:48) ] (0) = 0.For t = 1 , . . . , N ,1. Generate s ∼ N (0 , I d ).2. θ ( t ) = µ ( t − + T ( t − − T s .3. g ( t ) µ = ∇ θ log h ( θ ( t ) ) + T ( t − s .4. Accumulate gradient E [ g µ ] ( t ) = ρE [ g µ ] ( t − +(1 − ρ ) g ( t ) µ .5. Compute change ∆ ( t ) µ = (cid:113) E [ ∆ µ ] ( t − + (cid:15) (cid:113) E [ g µ ] ( t ) + (cid:15) g ( t ) µ
6. Accumulate change E [ ∆ µ ] ( t ) = ρE [ ∆ µ ] ( t − + (1 − ρ ) ∆ ( t ) µ .7. µ ( t ) = µ ( t − + ∆ ( t ) µ g ( t ) T (cid:48) = − T ( t − − T s ( T ( t − − g ( t ) µ ) T .9. Set diag( g ( t ) T (cid:48) ) = diag( g ( t ) T (cid:48) ) (cid:12) diag( T ( t − ).10. Accumulate gradient E [ g T (cid:48) ] ( t ) = ρE [ g T (cid:48) ] ( t − + (1 − ρ ) g ( t ) T (cid:48) .11. Compute change ∆ ( t ) T (cid:48) = √ E [ ∆ T (cid:48) ] ( t − + (cid:15) √ E [ g T (cid:48) ] ( t ) + (cid:15) g ( t ) T (cid:48)
12. Accumulate change E [ ∆ T (cid:48) ] ( t ) = ρE [ ∆ T (cid:48) ] ( t − + (1 − ρ ) ∆ ( t ) T .13. T (cid:48) ( t ) = T (cid:48) ( t − + ∆ ( t ) T (cid:48) .14. Update T ( t ) = T (cid:48) ( t ) .15. Update diag( T ( t ) ) = exp(diag( T (cid:48) ( t ) )). Algorithm 2 with ADADELTA.is a sparse matrix, we find that it is more efficient toperform steps 8–12 in vector-form and to store only thenon-zero elements of g ( t ) T , ∆ ( t ) T , E [ g T (cid:48) ] ( t ) , E [ ∆ T (cid:48) ] ( t ) . Welet (cid:15) = 10 − and ρ = 0 .
95, the setting recommended byZeiler (2012). We note that Algorithm 2 is more sensi-tive to ρ when the estimators ˆ g µ, and ˆ g T, are used ascompared to the alternative estimators ˆ g µ, and ˆ g T, .4.2 Stopping CriterionIn variational algorithms, the lower bound is commonlyused as an objective function to check for convergence.When the updates are deterministic, the lower boundis guaranteed to increase after each cycle and the algo-rithm can be terminated when the increase in the lowerbound is negligible. In Algorithms 1 and 2, the updates are stochastic and so the lower bound is not guaran-teed to increase at each iteration. Computing the lowerbounds in (1) and (6) also requires evaluating the expec-tations with respect to the variational approximation q .It is straightforward, however, to obtain an unbiased es-timate of the lower bound at each iteration. Let s be arandom variate generated from N (0 , I d ). From (1), anunbiased estimate of the lower bound for Algorithm 1is given byˆ L = log h ( µ + Ls ) − log q ( µ + Ls | µ, L )= log h ( µ + Ls ) + d π ) + log | L | + 12 s T s Similarly, an unbiased estimate of the lower bound forAlgorithm 2 is given byˆ L = log h ( µ + T − T s ) + d π ) − log | T | + 12 s T s. Here we do not evaluate the expectation of the lastterm s T s analytically so that the randomness in s willcancel out between the first and the last term when weare close to the mode (see similar argument is given inSection 3.4).As the estimate ˆ L is stochastic, we consider insteadthe average of ˆ L over the past F iterations, say ¯ L , tominimize variability. We compute ¯ L after every F it-erations and keep a record of the maximum value of¯ L attained thus far, say ¯ L max . The algorithm is termi-nated when ¯ L falls below ¯ L max more than M consecu-tive times. This may imply either that the algorithm hasconverged and hence the lower bound estimates are justbouncing around or the algorithm is diverging and theestimates of the lower bound are deteriorating. We saythat the algorithm is “diverging” if ¯ L is tending towards −∞ . In Section 5, we adopt rather conservative valuesof F = 2500 and M = 3 to avoid the dangers of stop-ping prematurely (see, e.g. Booth and Hobert, 1999).Alternative stopping criteria can also be constructedby examining the relative change in the parameter up-dates from successive iterations or the magnitude of thegradients of the parameters (see, e.g. Spall, 2003). In this section, we illustrate how we can impose spar-sity in T via Algorithm 2 using appropriate posteriorconditional independence relationships for generalizedlinear mixed models (GLMMs) and state space models(SSMs). We code Algorithms 1 and 2 in Julia Version0.5.0 ( http://julialang.org/ ) and make use of itsfunctions for sparse matrix representations to store andperform operations on high-dimensional sparse matri-ces efficiently. aussian variational approximation with sparse precision matrices 9Size ADVI (Stan) Algorithm 1 Algorithm 2( n ) mean-field unrestricted mean-field unrestrictedEpilepsy (Model I) 59 2 1 7 (350) 18 (400) 20 (725)Epilepsy (Model II) 59 4 4 9 (350) 45 (550) 32 (850)Toenail 294 4 15 18 (150) 135 (325) 88 (650)Polypharmacy 500 7 74 30 (150) 262 (250) 56 (225)GBP/USD exchange rate 945 10 diverge 10 (600) diverge 43 (750)DEM/USD exchange rate 1866 18 diverge 23 (700) diverge 77 (700) Table 1: Runtime (in seconds) of ADVI (Stan) and Algorithms 1 and 2 (Julia) for datasets of different sizes. Thenumber of iterations (in hundreds) used in Algorithms 1 and 2 are given in brackets.We compare the variational approximations withposteriors obtained through long runs of MCMC (re-garded as ground-truth). In all examples, fitting viaMCMC was performed in RStan ( http://mc-stan.org/interfaces/rstan ) and the same priors are usedin MCMC and variational approximations. For MCMC,we use 50,000 iterations in each example and the firsthalf is discarded as burn-in. A thinning factor of 5 wasapplied and the remaining 5000 samples are used toestimate the posterior density.We note that Algorithm 1 can also be readily imple-mented in Stan using automatic differentiation varia-tional inference (ADVI, Kucukelbir et al., 2016). Hencewe have also included the results from ADVI for com-parison. However, there are some differences betweenour implementation of Algorithm 1 in Julia and that inStan, namely, the learning rate and stopping criterionare different and we impose the additional restrictionthat the diagonal elements in L must be positive.Table 1 shows the runtimes for ADVI and Algo-rithms 1 and 2 for the datasets considered in this sec-tion. We use the terms “mean-field” to refer to the casewhere L is a diagonal matrix and “unrestricted” when L is a full lower triangular matrix. All experiments arerun on a Intel Core i5 CPU@ 3.20GHz 8.0GB Ram.5.1 Generalized linear mixed modelsHere we consider GLMMs where y i = ( y i , . . . , y in i ) T is the set of responses for the i th subject, X ij and Z ij are vectors of predictors for y ij , µ ij = E ( y ij ), and g ( · )is a smooth invertible link function. Let g ( µ ij ) = X Tij β + Z Tij b i for i = 1 , . . . , n, j = 1 , . . . , n i ,b i ∼ N (0 , G ) for i = 1 , . . . , n,β ∼ N (0 , σ β I k ) , where β is a vector of fixed effects parameters and b i is a random effect for the i th subject. Here we con-sider binary responses, where y ij ∼ Bernoulli( µ ij ) withthe logit link function g ( µ ij ) = log µ ij − µ ij , and count responses, where y ij ∼ Poisson( µ ij ) with the log linkfunction g ( µ ij ) = log( µ ij ). Variational methods havebeen considered for efficient computation in GLMMsby Ormerod and Wand (2012); Tan and Nott (2013,2014); Lee and Wand (2016a,b), among others.We parameterize the elements of the random effectscovariance matrix G so that they are unconstrained andso that a normal variational posterior approximation isreasonable. Let G = W W T , where W , the Choleskyfactor of G , is a p × p lower triangular matrix withpositive diagonal entries. Let W ∗ denote the matrixfor which W ∗ ii = log( W ii ) and W ∗ ij = W ij if i (cid:54) = j .Write ζ = vech( W ∗ ), where vech is the operation thattransforms the lower triangular part of a square ma-trix into a vector by stacking elements below the di-agonal column by column. We assume a normal prior, ζ ∼ N (0 , σ ζ I p ( p +1) / ).The vector of variables in the model is given by θ as defined in (4), where η = ( β T , ζ T ) T and the lengthof θ is d = pn + k + p ( p + 1) /
2. The joint distributioncan be written as p ( y, θ ) = h ( θ ) = p ( β ) p ( ζ ) n (cid:89) i =1 p ( b i | ζ ) n i (cid:89) j =1 p ( y ij | β, b i ) . For this model, note that b i and b j are conditionallyindependent in the posterior distribution for i (cid:54) = j given η . For the GLMM, the sparsity structure imposed on T and hence Ω is illustrated in (10). Our Algorithm 2can efficiently learn a T with such a structure. T = T . . . T . . . . . . T nn T n +1 , T n +1 , . . . T n +1 ,n T n +1 ,n +1 ,Ω = Ω . . . Ω ,n +1 Ω . . . Ω ,n +1 ... ... . . . ... ...0 0 . . . Ω nn Ω n,n +1 Ω n +1 , Ω n +1 , . . . Ω n +1 ,n Ω n +1 ,n +1 (10) For the GLMM, using a full rank lower triangular ma-trix L in Algorithm 1 requires updates of O ( n p ) ele-ments at each iteration while Algorithm 2 only requires O ( np ) updates (assuming k and p are small). Hencethe efficiency of Algorithm 2 as compared to Algorithm1 (unrestricted) increases rapidly with the number ofsubjects in the dataset as can be seen from Table 1.There is only a slight computational overhead in us-ing Algorithm 2 as compared to a diagonal matrix L inAlgorithm 1, which requires O ( np ) updates. However,Algorithm 2 reflects the posterior dependency structureand hence has the potential to provide better estimates.Next, we investigate the performance of Algorithm 2 onseveral data sets. We set σ β = σ ζ = 100 throughout.The gradient of log h ( θ ) is derived in Appendix A. The epilepsy data of Thall and Vail (1990) includes n = 59 epileptics who were randomized to a new drug,progabide (Trt=1) or a placebo (Trt=0) in a clinicaltrial. The response is given by the number of seizurespatients have during four follow-up periods. Other co-variates include the logarithm of age (Age), the loga-rithm of the number of baseline seizures (Base), Visit(coded as Visit = − .
3, Visit = − .
1, Visit = 0 . = 0 . i by Age i − mean(Age).Consider the following two models from Breslow andClayton (1993). Model I is a Poisson random interceptmodel wherelog µ ij = β + β Base
Base i + β Trt
Trt i + β Age
Age i + β Base × Trt
Base i × , Trt i + β V4 V4 ij + b i for i = 1 , ..., n , j = 1 , ...,
4. Model II is a Poisson randomintercept and slope model wherelog µ ij = β + β Base
Base i + β Trt
Trt i + β Age
Age i + β Base × Trt
Base i × Trt i + β Visit
Visit ij + b i + b i Visit ij for i = 1 , ..., n , j = 1 , ..., β and ζ are shown in Fig-ure 2. Algorithm 1 (mean-field) converged quickly forboth models while the runtime of Algorithm 1 (unre-stricted) doubled with the inclusion of a second randomeffect. For this dataset, Algorithm 2 performed betterthan the mean-field and unrestricted approximations.It produces very good approximations of the marginalposteriors of β , but is overconfident in estimating the marginal posteriors of ζ in Model II. The variationalposteriors from Algorithm 1 (mean-field) are accuratein the mean but the variance is underestimated, quiteseverely in some cases.Figure 3 shows the iterates of the mean parameter µ corresponding to β and ζ and the averaged lower bound( ¯ L ) for Model II. For Algorithm 1 (unrestricted), it ap-pears that some of the parameters have yet to stabilizeeven though the lower bound has reached stationarity. This dataset compares two oral anti-fungal treatmentsfor toenail infection (De Backer et al., 1998) and con-tains information for 294 patients who are evaluated atseven visits. Some patients did not attend all plannedvisits and there were a total of 1908 measurements. Thepatients were randomized into two treatment groups,one receiving 250 mg of terbinafine per day (Trt=1)and the other 200 mg of itraconazole per day (Trt=0).The time in months ( t ) that they arrived for visits wasrecorded and the binary response variable (onycholysis)indicates the degree of separation of the nail plate fromthe nail-bed (0 if none or mild, 1 if moderate or severe).We consider the logistic random intercept model,logit( µ ij ) = β + β Trt
Trt i + β t t ij + β Trt × t Trt i × t ij + u i , for i = 1 , ..., ≤ j ≤ The polypharmacy data set (Hosmer et al., 2013)is available at and it contains dataon 500 subjects studied over seven years. The re-sponse is whether the subject is taking drugs from 3or more different groups. We consider the covariates,Gender = 1 if male and 0 if female, Race = 0 if sub-ject is white and 1 otherwise, Age, and the followingbinary indicators for the number of outpatient mentalhealth visits, MHV 1=1 if 1 ≤ MHV ≤
5, MHV 2=1if if 6 ≤ MHV ≤
14 and MHV 3=1 if MHV ≥
15. LetINPTMHV = 0 if there were no inpatient mental health aussian variational approximation with sparse precision matrices 11 −1.0 0.0 0.5 1.0 . . . . b b Base −2 −1 0 1 . . . . b Trt −0.5 0.0 0.5 1.0 . . . b BasexTrt−1.0 0.0 1.0 2.0 . . . . b Age −0.4 −0.3 −0.2 −0.1 0.0 b V4 −1.0 −0.8 −0.6 −0.4 −0.2 z −1.0 −0.5 0.0 0.5 1.0 . . . . . b . . . . b Base −2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 . . . . b Trt−0.5 0.0 0.5 1.0 . . . b BasexTrt −1.0 −0.5 0.0 0.5 1.0 1.5 2.0 . . . . b Age −0.8 −0.6 −0.4 −0.2 0.0 0.2 0.4 b Visit−1.0 −0.8 −0.6 −0.4 −0.2 z . . . . z z Fig. 2: Epilepsy data: posterior distributions of β and ζ estimated using ADVI (dotted, blue for mean-field andgreen for unrestricted), Algorithm 1 (blue for mean-field and green for unrestricted), Algorithm 2 (red) and MCMC(black) for Model I (first two rows) and Model II (last two rows).visits and 1 otherwise. We consider a logistic randomintercept model (see Hosmer et al., 2013) of the formlogit( µ ij ) = β + β Gender i + β Race i + β Age ij + β MHV 1 ij + β MHV 2 ij + β MHV 3 ij + β INPTMHV ij + u i , for i = 1 , . . . , j = 1 , . . . ,
7. We apply ADVI and Algorithms 1 and 2 to thismodel. From Table 1, the increase in runtime of Al-gorithm 1 (unrestricted) due to the larger number ofsubjects as compared to the toenail dataset is evident.The runtime of Algorithm 1 (unrestricted) is about 4.7times that of Algorithm 2 while the runtime of Algo-rithm 2 is only slightly longer than that of Algorithm1 (mean-field). From Figure 5, Algorithm 2 provided a llllllllllllllllllllllllllllllllll − . . . . b iterations (x 2500) llllllllllllllllllllll llllllllllllllllllllllllllllllllll . . . b Base iterations (x 2500) llllllllllllllllllllll llllllllllllllllllllllllllllllllll − . − . − . . b Trt iterations (x 2500) llllllllllllllllllllll llllllllllllllllllllllllllllllllll − . . . . b BasexTrt iterations (x 2500) llllllllllllllllllllll llllllllllllllllllllllllllllllllll − . . . . b Age iterations (x 2500) llllllllllllllllllllllllllllllllllllllllllllllllllllllll − . − . − . b Visit iterations (x 2500) llllllllllllllllllllll llllllllllllllllllllllllllllllllll − . − . . z iterations (x 2500) llllllllllllllllllllll llllllllllllllllllllllllllllllllll − . . z iterations (x 2500) llllllllllllllllllllll llllllllllllllllllllllllllllllllll − . . . z iterations (x 2500) llllllllllllllllllllll lllllllllllllllllllllllllllllllll Lower bound iterations (x 2500) lllllllllllllllllllll
Fig. 3: Epilepsy data: Mean ( µ ) iterates corresponding to β and ζ and the averaged lower bound ( ¯ L ) from Algorithm1 with unrestricted lower triangular matrix L (green) and Algorithm 2 (red) for Model II. −3.5 −2.5 −1.5 −0.5 b −3 −1 0 1 2 . . . b Trt −0.60 −0.45 −0.30 b t −0.4 −0.2 0.0 b Trtxt z Fig. 4: Toenail data: posterior distributions of β and ζ estimated using ADVI (dotted, blue for mean-field andgreen for unrestricted), Algorithm 1 (blue for mean-field and green for unrestricted), Algorithm 2 (red) and MCMC(black).very good approximation of the marginal posteriors of β but there is some underestimation of the mean andstandard deviation of ζ .5.2 State space modelsHere we consider the stochastic volatility model widelyused in modeling financial time series (see, e.g. Harveyet al., 1994; Kim et al., 1998), which is an example ofa non-linear state space model. The observations y t aregenerated from a zero-mean Gaussian distribution witha variance stochastically evolving over time. The unob-served log-volatility b t is modeled as an AR(1) processwith Gaussian disturbances. Let y t ∼ N (0 , exp( λ + σb t )) for t = 1 , . . . , n,b ∼ N (0 , / (1 − φ )) ,b t +1 ∼ N ( φb t ,
1) for t = 2 , . . . , n, (11) where λ ∈ R , σ > < φ <
1. In (11), y t isthe mean-corrected return at time t and b t is assumedto follow a stationary process with b drawn from thestationary distribution. We transform the constrainedparameters to the real space by letting σ = exp( α ) and φ = exp( ψ )exp( ψ )+1 , where α, ψ ∈ R . Assume normal priors, α ∼ N (0 , σ α ), λ ∼ N (0 , σ λ ) and ψ ∼ N (0 , σ ψ ). Theset of variables is given by θ = ( b , . . . , b n , η T ) T , where η = ( α, λ, ψ ), and the joint distribution is given by p ( y, θ ) = h ( θ ) = p ( α ) p ( λ ) p ( ψ ) p ( b | ψ ) × (cid:40) n (cid:89) t =1 p ( y t | b t , λ, α ) (cid:41) (cid:40) n − (cid:89) t =1 p ( b t − | b t , ψ ) (cid:41) . (12)From (12), b t is conditionally independent of all otherstates in the posterior distribution given η and theneighboring states. Hence, we can take advantage of aussian variational approximation with sparse precision matrices 13 −6.0 −5.0 −4.0 −3.0 . . . . . b −0.5 0.5 1.5 . . . . . b Gender −2.5 −1.5 −0.5 0.5 . . . . b Race . . . . . b Age −0.5 0.0 0.5 1.0 1.5 . . . . b MHV4 b MHV4 b MHV4 . . . . b INPTMHV2 z Fig. 5: Polypharmacy data: posterior distributions of β and ζ estimated using ADVI (dotted, blue for mean-fieldand green for unrestricted), Algorithm 1 (blue for mean-field and green for unrestricted), Algorithm 2 (red) andMCMC (black) .this conditional independence in the variational ap-proximation q ( θ ) = N ( µ, Ω ). By setting T ij = 0,1 ≤ j < i − < n , T T T has the sparsity we desirefor Ω . This sparse structure is illustrated in (13). T = T . . . T T . . . T T . . . . . . T n − ,n − . . . T n,n − T n,n T n +1 , T n +1 , T n +1 , . . . T n +1 ,n − T n +1 ,n T n +1 ,n +1 (13)For the SSM, the number of parameters to update ineach iteration of Algorithm 1 (unrestricted) is O ( n )while Algorithm 2 only requires O ( n ) updates (similarto Algorithm 1 mean-field). This is an important factorto consider in SSMs as the number of observations in atime series over a long period may be large.Next, we illustrate Algorithm 2 using two sets ofexchange rate data which is available from the dataset“Garch” in the R package Ecdat . We compute themean-corrected response { y t } from the exchange rates { r t } as y t = 100 × (cid:40) log( r t /r t − ) − n n (cid:88) i =1 log( r i /r i − ) (cid:41) . The gradient of log h ( θ ) is derived in Appendix B. Here we consider daily observations of the weekdayexchange rates of the U.S. Dollar against the British Pound from 1st October 1981 to 28th June 1985. Thisdataset has been considered by Harvey et al. (1994),Kim et al. (1998) and Durbin and Koopman (2012).The number of responses is n = 945. Applying ADVIand Algorithms 1 and 2 to this dataset, the resultingvariational posteriors are shown in Figure 6. We notethat Algorithm 1 (unrestricted) diverges as the aver-aged lower bound ¯ L is deteriorating and tending to-wards −∞ . ADVI (unrestricted) also fails to converge.The mean-field approximations of ADVI and Algorithm1 have difficulty in capturing the means of α and ψ andonly manage to capture the mean of λ . Algorithm 2 wasable to capture the means with reasonable accuracy butthere is underestimation in the variance of α and ψ .Figure 7 shows the mean (solid lines) and 1 standarddeviation intervals (dotted lines) of the log volatility b t at each time point estimated using Algorithm 2 andMCMC. Algorithm 2 was able to capture the meansvery accurately but there is some underestimation ofthe standard deviation. Next, we consider the entire series of weekday ex-change rates of the U.S. Dollar against the GermanDeutschemark from 2nd January 1980 to 21st June 1987available in “Garch”. This is a much larger dataset with n = 1866 responses. We apply ADVI and Algorithms 1and 2 to this dataset. The unrestricted approximationsof ADVI and Algorithm 1 fail to converge again. Theapproximations of Algorithm 2 improved from the pre-vious dataset and the underestimation of the standard −2.5 −2.0 −1.5 −1.0 a −1.5 −1.0 −0.5 0.0 l . . . y Fig. 6: GBP/USD exchange rate data: posterior distributions of { α, λ, ψ } estimated using ADVI (dotted, blue formean-field), Algorithm 1 (blue for mean-field), Algorithm 2 (red) and MCMC (black). − − t b t Fig. 7: GBP/USD exchange rate data: Mean (solid line) and 1 standard deviation intervals (dotted lines) of logvolatility b t estimated using Algorithm 2 (red) and MCMC (black).deviation was less severe. As in the previous case, themean field approximations of ADVI and Algorithm 1had difficulty in capturing the means of α and ψ . Figure9 shows the mean and 1 standard deviation intervals ofthe log volatility b t . For this dataset, Algorithm 2 cap-tured both the mean and standard deviation of the logvolatility accurately. In this article, we propose parameterizing the preci-sion matrix in a Gaussian variational approximation interms of the Cholesky factor and performing optimiza-tion using stochastic gradient methods. This approachenables us to impose sparsity in the precision matrix soas to reflect conditional independence structure in theposterior distribution appropriately. We also proposeimproved estimators of the stochastic gradient, whichhave lower variance and which are helpful in increasingthe stability and precision of the algorithm. We illus-trate how our approach can be applied to generalizedlinear mixed models and state space models. Our exper-imental results indicate that imposing sparsity in the precision matrix reduces the computational complex-ity of the problem greatly without having to assumeindependence among the model parameters. This alsohelps to improve the accuracy of the posterior approx-imations. We note that Algorithm 2 performs betterthan the unrestricted approximations of ADVI and Al-gorithm 1 on several occasions even though there arelesser constraints in the unrestricted approximations.This may be due to the difficulties associated withsuch high-dimensional optimization. In using a Gaus-sian variational approximation, all constrained param-eters have to be transformed to the real line and weobserved sensitivity towards the transformations beingused as well as the way the model is being parameter-ized. These are important areas for future work.
Acknowledgements
Linda Tan was supported by the Na-tional University of Singapore Overseas Postdoctoral Fellow-ship. David Nott’s research was supported by a SingaporeMinistry of Education Academic Research Fund Tier 2 grant(R-155-000-143-112). We thank the reviewers and the editorsfor their time and helpful comments which have improved themanuscript.aussian variational approximation with sparse precision matrices 15 −2.0 −1.8 −1.6 −1.4 −1.2 a −1.2 −1.0 −0.8 −0.6 −0.4 l . . . . y Fig. 8: DEM/USD exchange rate data: posterior distributions of { α, λ, ψ } estimated using ADVI (dotted, blue formean-field), Algorithm 1 (blue for mean-field), Algorithm 2 (red) and MCMC (black). − − t b t Fig. 9: DEM/USD exchange rate data: Mean (solid line) and 1 standard deviation intervals (dotted lines) of logvolatility b t estimated using Algorithm 2 (red) and MCMC (black). References
Archer, E., I. M. Park, L. Buesing, J. Cunningham, andL. Paninski (2016). Black box variational inferencefor state space models. arXiv:1511.07367.Attias, H. (1999). Inferring parameters and structureof latent variable models by variational Bayes. InK. Laskey and H. Prade (Eds.),
Proceedings of the15th Conference on Uncertainty in Artificial Intelli-gence , San Francisco, CA, pp. 21–30. Morgan Kauf-mann.Bishop, C. M. (2006).
Pattern recognition and machinelearning . New York: Springer.Booth, J. G. and J. P. Hobert (1999). Maximizinggeneralized linear mixed model likelihoods with anautomated monte carlo em algorithm.
Journal ofthe Royal Statistical Society: Series B (StatisticalMethodology) 61 , 265–285.Breslow, N. E. and D. G. Clayton (1993). Approximateinference in generalized linear mixed models.
Journalof the American Statistical Association 88 , 9–25.Challis, E. and D. Barber (2013). Gaussian Kullback-Leibler approximate inference.
Journal of Machine Learning Research 14 , 2239–2286.De Backer, M., C. De Vroey, E. Lesaffre, I. Scheys,and P. D. Keyser (1998). Twelve weeks of contin-uous oral therapy for toenail onychomycosis causedby dermatophytes: A double-blind comparative trialof terbinafine 250 mg/day versus itraconazole 200mg/day.
Journal of the American Academy of Der-matology 38 , 57–63.Duchi, J., E. Hazan, and Y. Singer (2011). Adaptivesubgradient methods for online learning and stochas-tic optimization.
Journal of Machine Learning Re-search 12 , 2121–2159.Durbin, J. and S. J. Koopman (2012).
Time series anal-ysis by state space methods (2 ed.). United Kingdom:Oxford University Press.Gershman, S., M. Hoffman, and D. Blei (2012). Non-parametric variational inference. In J. Langford andJ. Pineau (Eds.),
Proceedings of the 29th Interna-tional Conference on Machine Learning , pp. 663–670.Han, S., X. Liao, D. B. Dunson, and L. C. Carin (2016).Variational Gaussian copula inference. In A. Grettonand C. C. Robert (Eds.),
Proceedings of the 19th In-ternational Conference on Artificial Intelligence and
Statistics , Volume 51, pp. 829–838. JMLR Workshopand Conference Proceedings.Harvey, A., R. Esther, and N. Shephard (1994). Mul-tivariate stochastic variance models.
The Review ofEconomic Studies 61 , 247–264.Hoffman, M. and D. Blei (2015). Stochastic structuredvariational inference. In G. Lebanon and S. Vish-wanathan (Eds.),
Proceedings of the Eighteenth In-ternational Conference on Artificial Intelligence andStatistics , Volume 38, pp. 361–369. JMLR Workshopand Conference Proceedings.Hoffman, M. D., D. M. Blei, C. Wang, and J. Paisley(2013). Stochastic variational inference.
Journal ofMachine Learning Research 14 , 1303–1347.Hosmer, D. W., S. Lemeshow, and R. X. Sturdivant(2013).
Applied Logistic Regression (3 ed.). Hoboken,NJ: John Wiley & Sons, Inc.Ji, C., H. Shen, and M. West (2010). Bounded approx-imations for marginal likelihoods. Technical Report10-05, Institute of Decision Sciences, Duke Univer-sity.Jordan, M. I., Z. Ghahramani, T. S. Jaakkola, and L. K.Saul (1999). An introduction to variational methodsfor graphical models.
Machine Learning 37 , 183–233.Kim, S., N. Shephard, and S. Chib (1998). Stochasticvolatility: likelihood inference and comparison witharch models.
Review of Economic studies 65 , 361–393.Kingma, D. P. and M. Welling (2014). Auto-encodingvariational Bayes. In
Proceedings of the 2nd In-ternational Conference on Learning Representations(ICLR) .Kucukelbir, A., D. Tran, R. Ranganath, A. Gelman,and D. M. Blei (2016). Automatic differentiationvariational inference. arXiv: 1603.00788.Lee, C. Y. Y. and M. P. Wand (2016a). Streamlinedmean field variational Bayes for longitudinal and mul-tilevel data analysis. https://works.bepress.com/matt_wand/13/ .Lee, C. Y. Y. and M. P. Wand (2016b). Variationalmethods for fitting complex Bayesian mixed effectsmodels to health data statistics in medicine.
Statis-tics in Medicine 35 , 165–188.Nott, D. J., S. L. Tan, M. Villani, and R. Kohn (2012).Regression density estimation with variational meth-ods and stochastic approximation.
Journal of Com-putational and Graphical Statistics 21 , 797–820.Opper, M. and C. Archambeau (2009). The variationalGaussian approximation revisited.
Neural Computa-tion 21 , 786–792.Ormerod, J. T. and M. P. Wand (2010). Explainingvariational approximations.
The American Statisti-cian 64 , 140–153. Ormerod, J. T. and M. P. Wand (2012). Gaussian vari-ational approximate inference for generalized linearmixed models.
Journal of Computational and Graph-ical Statistics 21 , 2–17.Paisley, J. W., D. M. Blei, and M. I. Jordan (2012).Variational Bayesian inference with stochastic search.In
Proceedings of the 29th International Conferenceon Machine Learning (ICML-12) .Powell, W. B. (2011).
Approximate dynamic program-ming: solving the curses of dimensionality . Hoboken,NJ: John Wiley & Sons, Inc.Ranganath, R., S. Gerrish, and D. M. Blei (2014). Blackbox variational inference. In S. Kaski and J. Corander(Eds.),
Proceedings of the 17th International Con-ference on Artificial Intelligence and Statistics , Vol-ume 33, pp. 814–822. JMLR Workshop and Confer-ence Proceedings.Ranganath, R., L. Tang, L. Charlin, and D. Blei (2015).Deep exponential families. In G. Lebanon andS. Vishwanathan (Eds.),
Proceedings of the 18th In-ternational Conference on Artificial Intelligence andStatistics , Volume 38, pp. 762–771. JMLR Workshopand Conference Proceedings.Ranganath, R., D. Tran, and D. M. Blei (2016). Hierar-chical variational models. In M. F. Balcan and K. Q.Weinberger (Eds.),
Proceedings of The 33rd Interna-tional Conference on Machine Learning , Volume 37,pp. 324–333. JMLR Workshop and Conference Pro-ceedings.Ranganath, R., C. Wang, D. M. Blei, and E. P. Xing(2013). An adaptive learning rate for stochastic vari-ational inference. In
Proceedings of the 30th Interna-tional Conference on Machine Learning , pp. 298–306.Rezende, D. J. and S. Mohamed (2015). Variationalinference with normalizing flows. In F. Bach andD. Blei (Eds.),
Proceedings of The 32nd Interna-tional Conference on Machine Learning , pp. 1530–1538. JMLR Workshop and Conference Proceedings.Rezende, D. J., S. Mohamed, and D. Wierstra (2014).Stochastic backpropagation and approximate infer-ence in deep generative models. In E. P. Xing andT. Jebara (Eds.),
Proceedings of The 31st Interna-tional Conference on Machine Learning , pp. 1278–1286. JMLR Workshop and Conference Proceedings.Robbins, H. and S. Monro (1951). A stochastic ap-proximation method.
The Annals of MathematicalStatistics 22 , 400–407.Rohde, D. and M. P. Wand (2015). Mean field varia-tional Bayes: general principles and numerical issues. https://works.bepress.com/matt_wand/15/ .Rothman, A. J., E. Levina, and J. Zhu (2010). A newapproach to Cholesky-based covariance regulariza-tion in high dimensions.
Biometrika 97 (3), 539–550. aussian variational approximation with sparse precision matrices 17
Rue, H., S. Martino, and N. Chopin (2009). Approx-imate Bayesian inference for latent Gaussian mod-els by using integrated nested Laplace approxima-tions.
Journal of the Royal Statistical Society: SeriesB (Statistical Methodology) 71 , 319–392.Salimans, T. and D. A. Knowles (2013). Fixed-formvariational posterior approximation through stochas-tic linear regression.
Bayesian Analysis 8 , 837–882.Spall, J. C. (2003).
Introduction to stochastic searchand optimization: estimation, simulation and control .New Jersey: Wiley.Tan, L. S. L. and D. J. Nott (2013). Variational infer-ence for generalized linear mixed models using par-tially non-centered parametrizations.
Statistical Sci-ence 28 , 168–188.Tan, L. S. L. and D. J. Nott (2014). A stochastic vari-ational framework for fitting and diagnosing general-ized linear mixed models.
Bayesian Analysis 9 , 963–1004.Thall, P. F. and S. C. Vail (1990). Some covariancemodels for longitudinal count data with overdisper-sion.
Biometrics 46 , 657–671.Titsias, M. and M. L´azaro-Gredilla (2014). Doublystochastic variational Bayes for non-conjugate infer-ence. In
Proceedings of the 31st International Con-ference on Machine Learning (ICML-14) , pp. 1971–1979.Titsias, M. and M. L´azaro-Gredilla (2015). Local expec-tation gradients for black box variational inference.In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, andR. Garnett (Eds.),
Advances in Neural InformationProcessing Systems 28 (NIPS 2015) .Wang, B. and D. M. Titterington (2005). Inadequacyof interval estimates corresponding to variationalBayesian approximations. In R. G. Cowell and G. Z(Eds.),
Proceedings of the 10th International Work-shop on Artificial Intelligence and Statistics , pp. 373–380. Society for Artificial Intelligence and Statistics.Winn, J. and C. M. Bishop (2005). Variational messagepassing.
Journal of Machine Learning Research 6 ,661–694.Zeiler, M. D. (2012). Adadelta: An adaptive learningrate method. arXiv: 1212.5701.
Appendix A Gradients for generalized linearmixed models
For the GLMM described in Section 5.1, we havelog h ( θ ) = n (cid:88) i =1 n i (cid:88) j =1 log p ( y ij | β, b i ) + n (cid:88) i =1 log p ( b i | ζ )+ log p ( β ) + log p ( ζ )= (cid:88) i,j { y ij ( X Tij β + Z Tij b i ) − h ( X Tij β + Z Tij b i ) }− n log | W | − n (cid:88) i =1 b Ti W − T W − b i − σ β β T β − σ ζ ζ T ζ + C, where C is a constant not dependent on θ . Forthe logistic GLMM, h ( x ) = log { x ) } while for the Poisson GLMM, h ( x ) =exp( x ). The gradient of log h ( θ ) is given by[ ∇ b log h ( θ ) , . . . , ∇ b n log h ( θ ) , ∇ β log h ( θ ) , ∇ ζ log h ( θ )],where ∇ b i log h ( θ ) = n i (cid:88) j =1 { y ij − h (cid:48) ( X Tij β + Z Tij b i ) } Z ij − W − T W − b i , for i = 1 , . . . , n, ∇ β log h ( θ ) = n (cid:88) i =1 n i (cid:88) j =1 { y ij − h (cid:48) ( X Tij β + Z Tij b i ) } X ij − σ β β, ∇ ζ log h ( θ ) = − n diag( W ) + 1 ζ (cid:12) vech( A ) − σ ζ ζ. (14)In (14), A = (cid:80) ni =1 W − T W − b i b Ti W − T with all entriesabove the diagonal set to zero and 1 diag( W ) and 1 ζ arevectors of length p ( p + 1) /
2. We define the i th elementof 1 diag( W ) as 1 if the i th element of ζ correspond to adiagonal element of W and 0 otherwise. On the otherhand, the i th element of 1 ζ is exp( ζ i ) if ζ i correspondsto a diagonal element of W and 1 otherwise. For thelogistic GLMM, h (cid:48) ( x ) = { − x ) } − and for thePoisson GLMM, h (cid:48) ( x ) = exp( x ). More details on thederivation of the gradients are given below. As log | W | = (cid:80) pi =1 W ∗ ii , ∇ ζ log | W | = 1 diag( W ) . Forthe term − (cid:80) ni =1 b Ti W − T W − b i ,d b Ti W − T W − b i = −{ b Ti W − T (d W T ) W − T W − b i + b Ti W − T W − (d W ) W − b i } = −{ ( b Ti W − T W − ⊗ b Ti W − T )vec(d W T )+ ( b Ti W − T ⊗ b Ti W − T W − )vec(d W ) } = −{ ( b Ti W − T W − ⊗ b Ti W − T ) K p + ( b Ti W − T ⊗ b Ti W − T W − ) } vec(d W )= −{ ( b Ti W − T W − ⊗ b Ti W − T ) K p + K ( b Ti W − T W − ⊗ b Ti W − T ) K p } dvec( W )= − b Ti W − T W − ⊗ b Ti W − T ) K p dvec( W )= − W − b i b Ti W − T W − ) T K p dvec( W )= − W − T W − b i b Ti W − T ) T dvec( W ) , where K p denotes the p × p commutation matrix. Let A = (cid:80) ni =1 W − T W − b i b Ti W − T with all entries abovethe diagonal set to zero. As W is a lower triangularmatrix, ∇ vec( W ) ( − n (cid:88) i =1 b Ti W − T W − b i ) = vec( A ) . Moreover,dvec( W ) = D p dvech( W ) = D p diag(1 ζ )d ζ, where D p is the p ( p +1) / × ∇ ζ ( − n (cid:88) i =1 b Ti W − T W − b i ) = diag(1 ζ ) D Tp vec( A )= 1 ζ (cid:12) vech( A + A T − dg A )= 1 ζ (cid:12) vech( A ) , where dg A is a diagonal matrix with diagonal equal tothe diagonal of A. The last line follows because A is alower triangular matrix. Appendix B Gradients for state space model
For the stochastic volatility model in (12),log h ( θ ) = − nλ − e α n (cid:88) t =1 b t − n (cid:88) t =1 y t exp( − λ − e α b t ) − n − (cid:88) t =1 ( b t +1 − φb t ) + 12 log(1 − φ ) −
12 (1 − φ ) b − α σ α − λ σ λ − ψ σ ψ + C, where C is a constant independent of θ . The gradient ∇ θ log h ( θ ) can be computed from the following com-ponents. ∇ b log h ( θ ) = − (1 − φ ) b + φ ( h − φb ) − e α
2+ e α y exp( − λ − e α b ) , ∇ b t log h ( θ ) = φ ( b t +1 − φb t ) − ( b t − φb t − ) − e α
2+ e α y t exp( − λ − e α b t ) for 1 < t < n, ∇ b n log h ( θ ) = − ( b n − φh n − ) − e α
2+ e α y n exp( − λ − e α b n ) , ∇ α log h ( θ ) = 12 n (cid:88) t =1 y t b t exp( α − λ − e α b t ) − e α n (cid:88) t =1 b t − ασ α , ∇ λ log h ( θ ) = − n n (cid:88) t =1 y t exp( − λ − e α b t ) − λσ λ , ∇ ψ log h ( θ ) = (cid:110) φb − φ (1 − φ ) + n − (cid:88) t =1 ( b t +1 − φb t ) h t (cid:111) × e ψ ( e ψ + 1) − ψσ ψψ