[PDF] Use of model reparametrization to improve variational Bayes

Abstract

We propose using model reparametrization to improve variational Bayes inference for hierarchical models whose variables can be classified as global (shared across observations) or local (observation specific). Posterior dependence between local and global variables is minimized by applying an invertible affine transformation on the local variables. The functional form of this transformation is deduced by approximating the posterior distribution of each local variable conditional on the global variables by a Gaussian density via a second order Taylor expansion. Variational Bayes inference for the reparametrized model is then obtained using stochastic approximation. Our approach can be readily extended to large datasets via a divide and recombine strategy. Using generalized linear mixed models, we demonstrate that reparametrized variational Bayes (RVB) provides improvements in both accuracy and convergence rate compared to state of the art Gaussian variational approximation methods.

Full PDF

aa r X i v : . [ s t a t . M E ] M a r Use of model reparametrization to improvevariational Bayes

Linda. S. L. Tan † National University of Singapore, Singapore

E-mail: [email protected]

Summary . We propose using model reparametrization to improve variational Bayes infer-ence for hierarchical models whose variables can be classiﬁed as global (shared acrossobservations) or local (observation speciﬁc). Posterior dependence between local andglobal variables is minimized by applying an invertible afﬁne transformation on the localvariables. The functional form of this transformation is deduced by approximating theposterior distribution of each local variable conditional on the global variables by a Gaus-sian density via a second order Taylor expansion. Variational Bayes inference for thereparametrized model is then obtained using stochastic approximation. Our approach canbe readily extended to large datasets via a divide and recombine strategy. Using gener-alized linear mixed models, we demonstrate that reparametrized variational Bayes (RVB)provides improvements in both accuracy and convergence rate compared to state of theart Gaussian variational approximation methods.

Keywords : Model reparametrization; Variational approximation; Stochastic varia-tional inference; Generalized linear mixed models; Centering

1. Introduction

Hierarchical models account for the statistical dependence of observations within clus-ters, by organizing them in a multilevel structure and introducing parameters which arecluster-speciﬁc or shared across clusters. Such models are immensely ﬂexible and havefound applications in many ﬁelds, ranging from social science (Gelman and Hill, 2006),medicine (McCormick et al., 2012) to ecology (Cressie et al., 2009). The linear mixedmodel (LMM) and generalized linear mixed model (GLMM), for instance, are highly pop-ular in epidemiological and biomedical studies where observations are often correlatedand hierarchical in nature (Liu, 2016). However, hierarchical modeling poses computa-tional challenges for practitioners, whether a Bayesian or maximum likelihood approachis used for inference. Here we consider a Bayesian framework, which accounts for un-certainty and allows for the incorporation of prior information from previous studies. Avariety of hierarchical models can be ﬁtted in the Bayesian framework using Markov chainMonte Carlo (MCMC) methods via standard software such as BUGS (Lunn et al., 2009)or Stan (Carpenter et al., 2017). Alternatively, fast approximate Bayesian inference can † Linda Tan was supported by the start-up grant R-155-000-190-133.

L. S. L. Tan be obtained for latent Gaussian models by using integrated nested Laplace approxima-tion (INLA, Rue et al., 2009), and for a broader class of models by using variationalapproximation (Jacobs et al., 1991; Blei et al., 2017).The parametrization of a hierarchical model has a huge impact on the performanceof the statistical method used for ﬁtting it, and two popular approaches are centering and noncentering . As illustration, the normal hierarchical model, y | x ∼ N ( x, σ y ) and x | θ ∼ N ( θ, σ x ), where y is observed and θ is the unknown parameter, is centered as the la-tent variable x is centered about θ . The equivalent model, written as y | ˜ x, θ ∼ N (˜ x + θ, σ y )and ˜ x ∼ N (0 , σ x ) is noncentered as ˜ x is independent of θ a priori. Gelfand et al. (1995,1996) observe that MCMC samplers for ﬁtting LMMs and GLMMs converge slowly whenthere is weak identiﬁability or high correlation among variables, and reparametrizationsin the form of hierarchical centering improve the rate of convergence greatly. How-ever, Papaspiliopoulos et al. (2003, 2007) show that centering performs well only if thedata are highly informative about the latent process, and noncentering is preferred ifthe latent process is only weakly identiﬁed. For the normal hierarchical model, y isinformative about x if σ y ≪ σ x (centering is preferred), and x is weakly identiﬁed if σ y ≫ σ x (noncentering performs better). In fact, centering and noncentering are com-plementary as the Gibbs sampler will converge faster under one parametrization if itconverges slowly under the other. For the best of both worlds, Yu and Meng (2011) in-troduce an ancillarity-suﬃciency strategy that interweaves the centered and noncenteredparametrizations to improve the eﬃciency of MCMC algorithms.While centering and noncentering are motivated by improving the prior geome-try of latent variables to minimize “missing” information, reparametrizations that tar-get the posterior geometry may yield even better results. In the MCMC context,Papaspiliopoulos et al. (2003, 2007) introduce a partially noncentered parametrizationfor local variables in LMMs so that they are a posteriori independent of each other andthe global variables. This parametrization lies on the continuum between centering andnoncentering and selects automatically the best parametrization. Christensen et al. (2006)propose data-based reparametrizations for spatial GLMMs that reduce posterior corre-lation among variables so that mixing and convergence of MCMC algorithms are morerobust. INLA, which focuses on accurate marginal posterior approximations of latentGaussian models, also uses a reparametrization of the global parameters (based on theirposterior mode and covariance) to correct for scale and rotation, thus aiding explo-ration of the posterior marginal and simplifying numerical integration. In this article,we propose a reparametrization of the local variables that improves variational Bayes(VB, Attias, 1999) inference for hierarchical models, by applying an invertible aﬃnetransformation to minimize posterior correlation among local and global variables. Variational approximation has become an increasingly popular alternative to MCMCmethods for estimating posterior densities due to their ability to scale up to high-dimensional data. Suppose y is the observed data and θ is the set of variables in a model.In variational approximation, some restriction is placed on the density q ( θ ) approximat-ing the true posterior so that it is more tractable. The Kullback-Leibler (KL) divergence eparametrized variational Bayes between q ( θ ) and p ( θ | y ), D KL ( q || p ) = R q ( θ ) log { q ( θ ) /p ( θ | y ) } dθ , is then minimized sub-ject to these restrictions (Ormerod and Wand, 2010). Common restrictions assume aparametric density such as a Gaussian or a product density, where q ( θ ) = Q ni =1 q i ( θ i ) forsome partition θ = { θ , . . . , θ n } . The latter, known as VB, has many advantages such asbeing low-dimensional, quick to converge, having closed form updates (for conditionallyconjugate models) and scalability to large data (Hoﬀman et al., 2013). However, theresulting approximation can be poor if strong posterior dependencies exist among { θ i } .Tan and Nott (2013) demonstrate that partial noncentering can improve VB infer-ence for GLMMs in terms of convergence rate as well as posterior approximation accu-racy. Suppose β , b i and Ω denote the ﬁxed eﬀects, random eﬀects and random eﬀectsprecision matrix for subject i = 1 , . . . , n in a GLMM, and there are random eﬀects cor-responding to each ﬁxed eﬀect. In VB, we may consider q ( θ ) = q (Ω) q ( β ) Q ni =1 q ( b i ),where θ = { Ω , β, b } and b = [ b T , . . . , b Tn ] T . However, this approximation fails to capturethe usually strong posterior dependence among { β, b } . If we transform b i such that˜ b i = b i − w i β , then w i can be tuned so that posterior dependence between ˜ b i and β isminimized. An approximation q (˜ θ ) = q (Ω) q ( β ) Q ni =1 q (˜ b i ), where ˜ θ = { Ω , β, ˜ b , . . . , ˜ b n } will then reﬂect the dependence structure in p (˜ θ | y ) more accurately. However, this trans-formation does not account for posterior dependence between b i and Ω, and diﬃcultiesremain in estimating the posteriors of Ω and regression coeﬃcients in β which cannotbe centered accurately. Lee and Wand (2016) and Rohde and Wand (2016) consider analternative VB approximation for GLMMs which groups β and b together in a single fac-tor q ( β, b ), and makes use of inherent block-diagonal matrix structures to derive eﬃcientupdates.We consider VB inference for hierarchical models where variables are classiﬁed as global (shared across observations) or local (observation speciﬁc). Our goal is to obtaina low-dimensional posterior approximation scalable to large data, by reparametrizingthe local variables so that posterior dependence between local and global variables isminimized. We apply an invertible aﬃne transformation that is a function of the globalvariables, and the functional form is deduced by considering a second order Taylor ap-proximation to the posterior distribution of the local variables conditional on the globalones. This transformation is shown to be a generalization of the partially noncenteredparametrization when location and scale parameters are unknown. We then considerindependent Gaussian approximations to the posteriors of global and local variables. Asclosed form updates are not feasible, variational parameters are optimized using stochas-tic gradient ascent (Titsias and L´azaro-Gredilla, 2014). The assumption of posterior in-dependence among transformed local and global variables allow our approach to be ex-tended to large data easily using a divide and recombine strategy (Broderick et al., 2013;Tran et al., 2016). Application of this methodology is illustrated using GLMMs. How-ever, our methods can be extended to a wider class of models including discrete choicemodels and spatial GLMMs. This approach of minimizing posterior dependence betweenglobal and local variables using model reparametrization prior to applying VB is referredto as reparametrized variational Bayes (RVB).Other approaches to relax the independence assumption in VB exist in the literature(e.g. Gershman et al., 2012; Salimans and Knowles, 2013; Rezende and Mohamed, 2015).Hoﬀman and Blei (2015) develop structured stochastic variational inference for condi- L. S. L. Tan tionally conjugate models, which allows local variables to depend explicitly on globalvariables in the variational posterior through some function γ ( · ) that is optimized numer-ically by maximizing a local lower bound. Titsias (2017) propose model reparametriza-tion for improving variational inference, and parameters of the aﬃne transformation areoptimized by minimizing the KL divergence between the variational approximation anda density obtained after a variable number of MCMC steps. Our approach is diﬀerent asthe functional form of our aﬃne transformation is deduced from a Taylor approximationto the conditional posteriors of the local variables and remain ﬁxed during optimizationof the variational parameters. Kucukelbir et al. (2016) develop automatic diﬀerentia-tion variational inference in Stan, where the variational density can be a full Gaussianapproximation. However, there may be diﬃculties in inferring such high-dimensionalapproximations. Tan and Nott (2018) consider a Gaussian variational approximation(GVA) where posterior dependence is captured using sparse precision matrices.Our article is organized as follows. Section 2 introduces the notation, and Section 3speciﬁes the model and deﬁnes the aﬃne transformation. Application of the aﬃne trans-formation to GLMMs is illustrated in Section 4. The stochastic variational algorithmis described in Section 5 and Section 6 explains how stochastic gradients are computed.Extension of the variational algorithm to large datasets is discussed in Section 7 andresults for GLMMs are presented in Section 8. Section 9 concludes.

2. Notation

For any r × r matrix A , let diag( A ) denote the diagonal of A , dg( A ) be the diagonalmatrix derived from A by setting non-diagonal elements to zero and ¯ A be the lowertriangular matrix derived from A by setting all superdiagonal elements to zero. Letvec( A ) denote the vector of length r obtained by stacking the columns of A undereach other from left to right and v ( A ) be the vector of length r ( r + 1) / A ) by eliminating all superdiagonal elements of A . Let E r denote the r ( r + 1) / × r elimination matrix such that E r vec( A ) = v ( A ) and K r be the r × r commutationmatrix such that K r vec( A ) = vec( A T ). Let N r = ( K r + I r ) /

2. If A is lower triangular,then E Tr v ( A ) = vec( A ). If A is symmetric, then D r v ( A ) = vec( A ), where D r is the r × r ( r + 1) / ⊗ . Scalar functions applied to vector arguments are evaluated elementwise.

3. Reparametrization using afﬁne transformations

Suppose y i = [ y i , . . . , y in i ] T is the response vector and b i is the r × i = 1 , . . . , n . Let y = [ y T , . . . , y Tn ] T , b = [ b T , . . . , b Tn ] T , θ G bethe g × θ = [ θ TG , b T ] T and d = g + nr be the length of θ . Wefocus on models whose joint density can be written as p ( y, θ ) = p ( θ G ) n Y i =1 p ( b i | θ G ) p ( y i | b i , θ G ) , (1)where p ( θ G ) is the prior of θ G , p ( b i | θ G ) is the density of b i conditional on θ G and p ( y i | b i , θ G ) is the likelihood of observing y i given b i and θ G . The posterior of θ has eparametrized variational Bayes the structure p ( θ | y ) = p ( θ G | y ) Q ni =1 p ( b i | θ G , y i ), where posteriors of the local variablesare independent conditional on the global variables. Consider a variational approxima-tion to p ( θ | y ) of the form q ( θ ) = q ( θ G ) n Y i =1 q ( b i ) . (2)As q ( θ ) assumes posterior independence among { θ G , b , . . . , b n } , this can be a poor choiceif strong dependencies exist. To improve the variational approximation, we reparametrizethe model by applying an invertible aﬃne transformation to each local variable b i ,˜ b i = f ( b i | θ G ) = L − i ( b i − λ i ) , (3)where λ i (vector of length r ) and L i ( r × r lower triangular matrix with positive diagonalentries) are functions of θ G . The inverse transformation is b i = f − (˜ b i | θ G ) = L i ˜ b i + λ i ,and the functional forms of { λ i , L i } will be deduced from p ( b i | θ G , y i ). The motivationis as follows. Suppose p ( b i | θ G , y i ) can be approximated by N ( λ i , Λ i ), and L i L Ti is theunique Cholesky decomposition of Λ i . Then ˜ b i | θ G , y ∼ N (0 , I r ) approximately, whichimplies that ˜ b i is independent of θ G in the posterior. Suppose we estimate the posteriorof ˜ θ = [ θ TG , ˜ b T ] T using q (˜ θ ) = q ( θ G ) n Y i =1 q (˜ b i ) , (4)where ˜ b = [˜ b T , . . . , ˜ b Tn ] T . The product density assumption in (4) is less restrictive than(2) as the posterior dependence of each ˜ b i on θ G has been minimized. It reﬂects moreaccurately the dependency structure in p (˜ θ | y ) and hence is expected to be more accurate. We use the LMM to illustrate how functional forms of { λ i , L i } are deduced and show thatthe aﬃne transformation is a generalization of the partially noncentered parametrization(Tan and Nott, 2013) when variance components are unknown. For i = 1 , . . . , n , j =1 , . . . , n i , let y ij ∼ N ( µ ij , σ ), where σ is assumed known for simplicity, µ ij = X Tij β + Z Tij b i , and b i ∼ N (0 , Ω − ) . Here β is a p × b i is a r × X ij and Z ij are covariates of length p and r respectively. Let X i = [ X i , . . . , X in i ] T and Z i = [ Z i , . . . , Z in i ] T . As p ( b i | β, Ω , y i ) ∝ p ( y i | b i , β, Ω) p ( b i | Ω), b i | β, Ω , y i ∼ N ( λ i , Λ i ),where Λ − i = Ω + Z Ti Z i /σ , λ i = Λ i Z Ti ( y i − X i β ) /σ . (5)We seek a transformation ˜ b i = f ( b i | β, Ω) to minimize posterior dependence of ˜ b i on { β, Ω } . Suppose Ω is known and X i = Z i . Then b i depends on β only through its meanand it suﬃces to consider ˜ b i = b i + Λ i X Ti X i β/σ , (6)where L i = I r and λ i = − Λ i X Ti X i β/σ in (3). Then ˜ b i is independent of β a pos-teriori. Even if we assume q (˜ θ ) = q ( β ) Q ni =1 q (˜ b i ), the optimal q (˜ θ ) which minimizes D KL ( q (˜ θ ) || p (˜ θ | y )) is equal to p (˜ θ | y ) as this product density structure is obeyed in p (˜ θ | y ). L. S. L. Tan

Table 1.

Poisson and binomial GLMMs with canonical links. Poisson ( µ ij ) denotesthe Poisson distribution with mean µ ij and Binomial ( m ij , p ij ) denotes the binomialdistribution with m ij trials and success probability p ij . Model for y ij E ( y ij ) Canonical link: g ( µ ij ) h ij ( η ij )Poisson( µ ij ) µ ij log( µ ij ) exp( η ij )Binomial( m ij , p ij ) m ij p ij logit( µ ij /m ij ) m ij log { η ij ) } Tan and Nott (2013) introduce a partially noncentered parametrization,˜ b i = b i + ( I r − W i ) β, where W i is a r × r tuning parameter. When W i = 0, the parametrization is centered as ˜ b i has a normal prior centered at β . When W i = I r , ˜ b i is independent of β a prioriand the parametrization is noncentered . The optimal value of W i is Λ i Ω, which leadsto instant convergence and recovery of the true posterior in VB. This reparametrizationis equivalent to (6) as I r − W i = Λ i (Λ − i − Ω) = Λ i X Ti X i /σ . The centered and non-centered parametrizations correspond to L i = I r and λ i = − β and 0 respectively. Thusthe optimal partially noncentered parametrization which minimizes D KL ( q (˜ θ ) || p (˜ θ | y ))actually transforms b i so that it is independent of β in the posterior.Extending this notion of minimizing posterior correlation to the case where Ω isunknown and X i = Z i , we can apply the transformation in (3) with λ i and Λ i givenin (5). For LMMs, it is easy to identify λ i and Λ i as p ( b i | β, Ω , y i ) is Gaussian. Moregenerally, we use Taylor expansions to obtain a Gaussian approximation to p ( b i | θ G , y i ).

4. Application to generalized linear mixed models

For i = 1 , . . . , n , j = 1 , . . . , n i , let y ij be generated from a distribution in the exponentialfamily and µ ij = E ( y ij ) depend on the linear predictor, η ij = X Tij β + Z Tij b i , through the link function g ( · ) such that g ( µ ij ) = η ij . Here β , b i , X ij , Z ij , X i and Z i are deﬁned as for LMMs. We consider a diﬀuse prior N (0 , σ β I p ) for β where σ β is large,and the Wishart prior W ( ν, S ) for the precision matrix Ω where ν > r − S is a r × r positive deﬁnite scale matrix. The hyperparameters ν and S are determined basedon the default conjugate prior proposed by Kass and Natarajan (2006) and details aregiven in the supplementary material Section S1. Let η i = X i β + Z i b i and W W T be theunique Cholesky decomposition of Ω, where W is lower triangular with positive diagonalentries. Deﬁne the r × r matrix W ∗ such that W ∗ ii = log( W ii ), W ∗ ij = W ij if i = j and ω = v ( W ∗ ). The vector of global variables is thus θ G = [ β T , ω T ] T .We focus on GLMMs with canonical links, and the one-parameter exponential familywhose natural parameter is η ij ∈ R . Some examples are given in Table 1. The jointdensity is p ( y, θ ) = p ( β ) p ( ω ) Q ni =1 p ( b i | ω ) Q n i j =1 p ( y ij | η ij ) andlog p ( y, θ ) = n X i =1 (cid:20) n i X j =1 { y ij η ij − h ij ( η ij ) } − b Ti Ω b i (cid:21) + n | Ω | − β T β σ β + log p ( w ) + C , eparametrized variational Bayes where h ij ( · ) is the log partition function and C is a constant independent of θ . Theinduced prior p ( ω ) = p (Ω) | ∂v (Ω) /∂ω | and Section S2 of the supplementary materialshows that log p ( ω ) = log p (Ω) + r log 2 + r X i =1 ( r − i + 2) log( W ii ) , where log p (Ω) = ν − r − log | Ω | − tr( S − Ω) + C and C is a constant independent ofΩ. Let h ′ ij ( · ), h ′′ ij ( · ) and h ′′′ ij ( · ) denote the ﬁrst, second and third derivatives of h ij ( · )respectively. From properties of the exponential family, E ( y ij ) = h ′ ij ( η ij ) and var( y ij ) = h ′′ ij ( η ij ) ≥

0. Deﬁne g i ( η i ) = [ h ′ i ( η i ) , . . . , h ′ in i ( η in i )] T , H i ( η i ) = diag([ h ′′ i ( η i ) , . . . , h ′′ in i ( η in i )] T ) . p ( b i | θ G , y i ) : ﬁrst approach In the ﬁrst approach, we ﬁrst ﬁnd a Gaussian approximation to the likelihood p ( y i | η i )by considering a second-order Taylor expansion about some estimate ˆ η i . This is thencombined with p ( b i | ω ) to obtain a Gaussian approximation to p ( b i | θ G , y i ). We have ∇ η i log p ( y i | η i ) = y i − g i ( η i ) and ∇ η i log p ( y i | η i ) = − H i ( η i ). Thuslog p ( y i | η i ) ≈ log p ( y i | ˆ η i ) + ( η i − ˆ η i ) T ( y i − g i (ˆ η i )) − ( η i − ˆ η i ) T H i (ˆ η i )( η i − ˆ η i ) . As p ( b i | θ G , y i ) ∝ exp { log p ( y i | η i ) + log p ( b i | ω ) }∝ exp { b Ti Z Ti ( y i − g i (ˆ η i )) − ( X i β + Z i b i − ˆ η i ) T H i (ˆ η i )( X i β + Z i b i − ˆ η i ) − b Ti Ω b i } ,b i | θ G , y i ∼ N ( λ i , Λ i ) approximately, whereΛ i = (Ω + Z Ti H i (ˆ η i ) Z i ) − , λ i = Λ i Z Ti { y i − g i (ˆ η i ) + H i (ˆ η i )(ˆ η i − X i β ) } . (7) Next, we select ˆ η i at which the Taylor expansion is evaluated. One possibility is tolet ˆ η i be the mode of log p ( y i | η i ) or the maximum likelihood estimate (MLE). Then ∇ η i log p ( y i | η i ) | η =ˆ η i = y i − g i (ˆ η i ) = 0 so that the expected value of y i is equal to itsobserved value and λ i simpliﬁes to Λ i Z Ti H i (ˆ η i )(ˆ η i − X i β ). As the n i observations in y i areindependent, we can evaluate independently each element, ˆ η ML ij = argmax η ij ∈ R p ( y ij | η ij ).The estimate ˆ η ML ij is data-based (depends only on y ij ). However, ˆ η ML ij may not exist forboundary values of y ij . For instance, ˆ η ML ij = logit( y ij ) for the Bernoulli model, whichdoes not exist when y ij = 0 or 1. In such cases, we may consider the “MLE” at inﬁnity.Lemma 1 states the limits of some important quantities in λ i as η ij → ±∞ , which leadsto the deﬁnition of c i as the “MLE” on the extended real line. From Theorem 1, thelimit of λ i always exist as ˆ η i → c i and is zero under certain conditions. Proofs of Lemma1 and Theorem 1 are given in the supplementary material. Lemma 1.

For i = 1 , . . . , n , j = 1 . . . . , n i , if ˆ η ML ij does not exist for an observation y ij , then lim η ij → c h ′ ij ( η ij ) = y ij , lim η ij → c h ′′ ij ( η ij ) = 0 and lim η ij → c h ′′ ij ( η ij ) η ij = 0 foreither c = −∞ or c = ∞ . L. S. L. Tan

Definition 1.

For i = 1 , . . . , n , c i = [ c i , . . . , c in i ] T , where c ij = ˆ η ML ij if ˆ η ML ij existsand c ij = lim h ′ ij ( η ij ) → y ij η ij if ˆ η ML ij does not exist, for j = 1 , . . . , n i . Theorem 1.

For i = 1 , . . . , n , the limit of λ i exists as ˆ η i → c i , and is zero if ˆ η ML ij does not exist for each j = 1 , . . . , n i . Corollary 1.

For i = 1 , . . . , n . if ˆ η i → c i , then λ i → (i) for the Poisson GLMM with log link if y i = 0 .(ii) for the binomial GLMM with logit link if the elements in y i are either or m ij . Proof.

For the Poisson GLMM with log link, ˆ η ML ij = log( y ij ) which is undeﬁnedif y ij = 0. For the binomial GLMM with logit link, ˆ η ML ij = logit( y ij /m ij ) which isundeﬁned if y ij is 0 or m ij . From Theorem 1, λ i → η ML ij does not exist for each j = 1 , . . . , n i .From Corollary 1, the (location) noncentered parametrization is preferred for Poissonand binomial models when y ij lies on the support boundary, and is always preferredfor Bernoulli models. If we set ˆ η i → c i , then both Λ i and the limit of λ i can beevaluated in principle from Theorem 1. We have tried this approach but found that itresults in unstable stochastic variational algorithms if parameter initialization is far fromconvergence. For instance, if y i = 0 for the Poisson GLMM, then H i (ˆ η i ) = diag( y i ) = 0,which implies that λ i →

0, Λ i = Ω − and b i = L i ˜ b i . If the estimate of ω is too small,then estimates for Λ i = Ω − = ( W W T ) − and b i may experience overﬂow. This issue isalleviated if H i (ˆ η i ) is positive deﬁnite. While this problem can likely be resolved withbetter parameter initialization, we prefer non-problem speciﬁc initialization.To avoid undeﬁned estimates of η ij at the support boundary of y ij , we “regularize”the MLE using a Bayesian approach. Adopting a non-informative Jeﬀreys prior for µ ij , p ( µ ij | y ij ) ∝ p ( y ij | µ ij ) p ( µ ij ) ∝ p ( y ij | µ ij ) q |I ( µ ij ) | , and we use the posterior mean E ( η ij | y ij ) = E ( g ( µ ij ) | y ij ) to estimate η ij . Firth (1993)showed that for an exponential family model, the posterior mode corresponding to Jef-freys prior is equivalent to a penalized MLE, where the penalty of log |I ( µ ij ) | helps toremove the asymptotic bias in the MLE. Consider y ij ∼ Poisson( µ ij ) with log µ ij = η ij .The Jeﬀreys prior for µ ij is p ( µ ij ) ∝ µ − / ij and p ( µ ij | y ij ) is Gamma( y ij + 0 . , η ij = E ( η ij | y ij ) = E (log( µ ij ) | y ij ) = ψ ( y ij + 0 . , where ψ ( · ) is the digamma function. If y ij ∼ binomial( m ij , p ij ), then the Jeﬀreys priorfor p ij is Beta(0.5, 0.5) and p ( p ij | y ij ) is Beta( y ij + 0 . , m ij − y ij + 0 . η ij = E ( η ij | y ij ) = E (logit( p ij ) | y ij ) = E (log( p ij ) | y ij ) − E (log(1 − p ij ) | y ij )= ψ ( y ij + 0 . − ψ ( m ij − y ij + 0 . . If y ij ∼ Bernoulli( p ij ), then m ij = 1 and ˆ η ij = ψ ( y ij + 0 . − ψ (1 . − y ij ). Figure 1 showsthat the regularized estimates of ˆ η ij are indistinguishable from the MLEs for values of y ij that do not lie on the support boundary. However, by using these regularized estimates, λ i will be close to but is not zero even if y i lies on the support boundary for the Poissonand binomial GLMMs (see Table 2). eparametrized variational Bayes − − Poisson y ij h ij MLEregularized − − Binomial y ij h ij MLEregularized

Fig. 1.

Plot of maximum likelihood and regularized estimates of ˆ η ij against y ij for Poisson andbinomial ( m ij = 10 ) models. Table 2.

Values of some quantities in Lemma 1 evaluated using regularizedestimates of ˆ η ij compared to the “MLEs” at inﬁnity (in brackets). Poisson Binomial(10, p ij ) Bernoulli( p ij ) y ij = 0 y ij = 0 y ij = m ij y ij = 0 y ij = 1ˆ η ij -1.96 ( −∞ ) -4.27 ( −∞ ) 4.27 ( ∞ ) -2 ( −∞ ) 2 ( ∞ ) h ′ ij (ˆ η ij ) 0.14 (0) 0.14 (0) 9.86 (10) 0.12 (0) 0.88 (1) h ′′ ij (ˆ η ij ) 0.14 (0) 0.14 (0) 0.14 (0) 0.10 (0) 0.10 (0) h ′′ ij (ˆ η ij )ˆ η ij -0.28 (0) -0.58 (0) 0.58 (0) -0.21 (0) 0.21 (0) From our experiments, a Gaussian approximation of p ( b i | y i , θ G ) based on Taylor ex-pansion of p ( y i | η i ) about regularized estimate ˆ η i generally works better for Poisson andbinomial models than Bernoulli models. To understand this phenomenon, consider theFisher information I ( η ij ), which measures the amount of information y ij carries about η ij . For the one-parameter exponential family p ( y ij | η ij ), I ( η ij ) = h ′′ ( η ij ) = var( y ij ). Wehave I ( η ij ) = µ ij for Poisson( µ ij ) and I ( η ij ) = m ij p ij (1 − p ij ) for Binomial( m ij , p ij ).Therefore, y ij is more informative about η ij when µ ij is large for Poisson models, andwhen m ij is large and p ij is far from 0 and 1 for binomial models. Hence, we expectthe ﬁrst approach to perform well when y ij is large for Poisson models, and when m ij is large and y ij /m ij is far from 0 and 1 for binomial models. For the Bernoulli model, m ij is at its lowest value of 1. Hence y ij carries less information about η ij for Bernoullimodels and it is diﬃcult to obtain an accurate approximation of p ( b i | y i , θ G ) using theﬁrst approach. The second approach below is computationally more intensive but itworks better for Bernoulli models, binomial models with small m ij or y ij /m ij close to 0or 1, and Poisson models with many zero counts. p ( b i | θ G , y i ) : second approach Consider a Taylor expansion of log p ( b i | θ G , y i ) about some estimate ˆ b i of b i . We have ∇ b i log p ( b i | θ G , y i ) = Z Ti ( y i − g i ( η i )) − Ω b i , ∇ b i log p ( b i | θ G , y i ) = − Z Ti H i ( η i ) Z i − Ω , and p ( b i | θ G , y i ) ∝ exp (cid:2) b Ti { Z Ti ( y i − g i ( X i β + Z i ˆ b i )) − Ωˆ b i }− ( b i − ˆ b i ) T { Z Ti H i ( X i β + Z i ˆ b i ) Z i + Ω } ( b i − ˆ b i ) (cid:3) . L. S. L. Tan

This implies a Gaussian approximation with Λ − i = Z Ti H i ( X i β + Z i ˆ b i ) Z i + Ω and λ i =ˆ b i + Λ i [ Z Ti { y i − g i ( X i β + Z i ˆ b i ) } − Ωˆ b i ]. If we choose ˆ b i to be the mode of p ( b i | θ G , y i ),then ˆ b i satisﬁes Z Ti ( y i − g i ( X i β + Z i ˆ b i )) = Ωˆ b i and λ i = ˆ b i , Λ i = { Z Ti H i ( X i β + Z i ˆ b i ) Z i + Ω } − . (8)This approach yields more accurate estimates of λ i and Λ i , which translates into bettervariational approximations, but additional computation is required to ﬁnd the mode ˆ b i .Using the Newton-Raphson method, we start with initial estimate b (0) i and iterate b ( t +1) i = b ( t ) i + ( Z Ti H i ( X i β + Z i b ( t ) i ) Z i + Ω) − { Z Ti ( y i − g i ( X i β + Z i b ( t ) i )) − Ω b ( t ) i } until convergence. If n i < r , we set b (0) i = 0. Otherwise, we obtain b (0) i by using theregularized estimate ˆ η i described in the ﬁrst approach. As ˆ η i ≈ X i β + Z i b i , an estimateof b i is b (0) i = ( Z Ti Z i ) − Z Ti (ˆ η i − X i β ). We ﬁnd this initialization to be highly useful inavoiding convergence issues. Monitoring log p ( b i | θ G , y i ) at each iteration, updates areterminated if the rate of increase of log p ( b i | θ G , y i ) is smaller than 10 − .

5. Reparametrized variational Bayes inference

We consider a variational approximation for the reparametrized model of the form in(4), where q ( θ G ) and q (˜ b i ) are Gaussian. Hence, q (˜ θ ) is N ( µ, Σ), where Σ is a blockdiagonal matrix; the ﬁrst n blocks are of order r each and the last block is of order g .Let CC T be the unique Cholesky factorization of Σ where C is lower triangular withpositive diagonal elements. Then q (˜ θ ) = q (˜ θ | µ, C ) and { µ, C } are optimized so that D KL ( q || p ) = R q (˜ θ | µ, C ) log { q (˜ θ | µ, C ) /p (˜ θ | y ) } d ˜ θ is minimized. Since D KL ( q || p ) ≥ p ( y ) ≥ E q { log p ( y, ˜ θ ) − log q (˜ θ | µ, C ) } = L ( µ, C ) = L , (9)where E q denotes expectation with respect to q (˜ θ | µ, C ). Minimizing the KL divergenceis equivalent to maximizing a lower bound L on log p ( y ) with respect to { µ, C } .For non-conjugate models such as GLMMs, E q { log p ( y, ˜ θ ) } cannot be evaluated inclosed form and L is intractable. To overcome this limitation, we maximize L with re-spect to { µ, C } using stochastic variational inference (Titsias and L´azaro-Gredilla, 2014).This approach is based on stochastic approximation (Robbins and Monro, 1951), whereat each iteration t , the variational parameters are updated by taking a small step ρ t inthe direction of the stochastic gradients, µ ( t ) = µ ( t − + ρ t b ∇ µ L , v ( C ( t ) ) = v ( C ( t − ) + ρ t b ∇ v ( C ) L . (10)The stochastic gradients b ∇ µ L and b ∇ v ( C ) L are unbiased estimates of the true gradi-ents ∇ µ L and ∇ v ( C ) L respectively. Under mild regularity conditions, the stochasticapproximation algorithm will converge to a local maximum if the stepsize { ρ t } satisfy P t ρ t = ∞ , P t ρ t < ∞ and the lower bound is concave (Spall, 2003).From (9), unbiased estimates of the true gradients can be constructed by sampling˜ θ from q (˜ θ | µ, C ). However, this approach often results in estimators with high variance eparametrized variational Bayes (Paisley et al., 2012). Hence we employ the “reparametrization trick” (Kingma and Welling, 2014;Rezende et al., 2014; Titsias and L´azaro-Gredilla, 2014), which introduces an invertibletransformation s = C − (˜ θ − µ ). The density of s , denoted by φ ( s ), is N (0 , I d ). Let ℓ (˜ θ ) = log p ( y, ˜ θ ). From (9), L ( µ, C ) = E φ { ℓ (˜ θ ) } − E φ { log q (˜ θ | µ, C ) } . (11)where E φ denotes expectation with respect to φ ( s ) and ˜ θ = Cs + µ . Unbiased estimatesof the true gradients can thus be constructed by sampling s from φ ( s ) instead of ˜ θ from q (˜ θ | µ, C ). The advantage of this reparametrization is that ℓ (˜ θ ) is now a function of { µ, C } , and this enables gradient information from ℓ (˜ θ ) to be utilized eﬀectively. We show that several unbiased estimators of ∇ µ L and ∇ v ( C ) L can be constructed from(11) but some has nicer properties at convergence. The estimators below are based ona single sample s generated from φ ( s ), as one sample often provides suﬃcient gradientinformation. It is straightforward to extend the estimators to a larger number of samplesby averaging. Derivation details are given in the supplementary material.If we evaluate the second term in (11) analytically, then L = E φ { ℓ (˜ θ ) } + log | C | + c ′ ,where c ′ is a constant that does not depend on { µ, C } and ∇ µ L = E φ {∇ ˜ θ ℓ (˜ θ ) } , ∇ v ( C ) L = E φ [ v {∇ ˜ θ ℓ (˜ θ ) s T + C − T } ] . This leads to unbiased estimators, b ∇ µ L = ∇ ˜ θ ℓ (˜ θ ) , b ∇ v ( C ) L = v {∇ ˜ θ ℓ (˜ θ ) s T + C − T } . (12)Alternatively, we do not evaluate E φ { log q (˜ θ | µ, C ) } analytically but approximate bothterms in (11) using the same samples. As log q (˜ θ | µ, C ) depends on { µ, C } directly aswell as through ˜ θ , we apply chain rule to obtain ∇ µ L = E φ {∇ ˜ θ ℓ (˜ θ ) − ∇ ˜ θ log q (˜ θ | µ, C ) − ∇ µ log q (˜ θ | µ, C ) } , (13) ∇ v ( C ) L = E φ [ v {∇ ˜ θ ℓ (˜ θ ) s T − ∇ ˜ θ log q (˜ θ | µ, C ) s T } − ∇ v ( C ) log q (˜ θ | µ, C )] , (14)where we have ∇ ˜ θ log q (˜ θ | µ, C ) = −∇ µ log q (˜ θ | µ, C ) } = − C − T s and ∇ v ( C ) log q (˜ θ | µ, C ) = v { C − T ( ss T − I d ) } . If we use all the terms in (13) and (14), the estimators in (12) willbe recovered after simpliﬁcation. However, as E φ ( s ) = 0 and E φ ( ss T ) = I d , E φ {∇ ˜ θ log q (˜ θ | µ, C ) } = E φ {∇ µ log q (˜ θ | µ, C ) } = 0 , E φ {∇ v ( C ) log q (˜ θ | µ, C ) } = 0 . Alternatively, note that E φ {∇ µ log q (˜ θ | µ, C ) } = 0 and E φ {∇ v ( C ) log q (˜ θ | µ, C ) } = 0 asthey form the expectation of the score. Unbiased estimators can thus be constructed byomitting the last term in (13) and (14). We thus obtain b ∇ µ L = ∇ ˜ θ ℓ (˜ θ ) + C − T s, b ∇ v ( C ) L = v {∇ ˜ θ ℓ (˜ θ ) s T + C − T ss T } . (15) L. S. L. Tan

A third unbiased estimator, b ∇ µ L = ∇ ˜ θ ℓ (˜ θ ) − C − T s , can be obtained by omitting thesecond term in (13). Next, we show that b ∇ µ L and b ∇ v ( C ) L have smaller variance atconvergence.Consider a second-order Taylor approximation to ℓ (˜ θ ) at the posterior mode ˜ θ ∗ , ℓ (˜ θ ) ≈ ℓ (˜ θ ∗ ) + (˜ θ − ˜ θ ∗ ) T ∇ θ ℓ (˜ θ ∗ )(˜ θ − ˜ θ ∗ ) / . This implies that p (˜ θ | y ) can be approximated by N (˜ θ ∗ , −{∇ θ ℓ (˜ θ ∗ ) } − ). Diﬀerentiatingthe Taylor approximation with respect to ˜ θ yields ∇ ˜ θ ℓ (˜ θ ) ≈ ∇ θ ℓ (˜ θ ∗ )(˜ θ − ˜ θ ∗ ) . (16)Since q (˜ θ | µ, Σ) is a Gaussian approximation to p (˜ θ | y ), µ ≈ ˜ θ ∗ and Σ ≈ −{∇ θ ℓ (˜ θ ∗ ) } − at convergence. Thus, ∇ ˜ θ ℓ (˜ θ ) ≈ − Σ − (˜ θ − µ ) = − C − T s and b ∇ µ L ≈ − C − T s, b ∇ µ L ≈ , b ∇ v ( C ) L ≈ v {− C − T ss T + C − T } , b ∇ v ( C ) L ≈ . The estimators b ∇ µ L and b ∇ v ( C ) L are close to zero as the contributions from ℓ (˜ θ ) andlog q (˜ θ | µ, C ) cancel out each other when the stochastic approximation algorithm is closeto convergence. However b ∇ µ L and b ∇ v ( C ) L still contain a certain amount of noise and b ∇ µ L ≈ − C − T s is even noisier than b ∇ µ L . From (16),cov φ ( b ∇ µ L ) = cov q ( ∇ ˜ θ ℓ (˜ θ )) ≈ ∇ θ ℓ (˜ θ ∗ )Σ {∇ θ ℓ (˜ θ ∗ ) } T ≈ Σ − , cov φ ( b ∇ µ L ) = cov q ( ∇ ˜ θ ℓ (˜ θ ) + Σ − ˜ θ ) ≈ {∇ θ ℓ (˜ θ ∗ ) + Σ − } Σ {∇ θ ℓ (˜ θ ∗ ) + Σ − } T ≈ , which supports the claim that b ∇ µ L is less noisy than b ∇ µ L at convergence. From ourexperience, the less noisy estimators contributes greatly to improved convergence of thestochastic variational algorithm (see also Roeder et al., 2017). As the update for v ( C ) in (10) does not ensure diagonal elements of C remain positive,we introduce lower triangular matrix C ∗ such that C ∗ ii = log( C ii ) and C ∗ ij = C ij if i = j ,and apply stochastic gradient updates to C ∗ instead. Let D C be a diagonal matrix oforder d ( d + 1) / v ( J C ) and J C be a d × d matrix with diagonalgiven by diag( C ) and ones everywhere else. Then ∇ v ( C ∗ ) L = D C ∇ v ( C ) L . The stochasticvariational algorithm using b ∇ µ L and b ∇ v ( C ) L is summarized in Algorithm 1.For computing the stepsize ρ t , we use Adam (Kingma and Welling, 2014), whichautomatically adapts to diﬀerent parameters. There are some parameters in Adamwhich can be used to ﬁne-tune the stepsize but we ﬁnd that the default values worksatisfactorily. For diagnosing the convergence of Algorithm 1, we compute an unbiasedestimate of L at each iteration, ˆ L = ℓ (˜ θ ) − log q (˜ θ | µ, C ) where ˜ θ is computed in step 2 ofAlgorithm 1, and µ and C are the current estimates. As these estimates are stochastic, eparametrized variational Bayes Initialize µ (1) = 0 and C (1) = blockdiag( I nr , . I g ). For t = 1 , . . . , T ,1. Generate s ∼ N (0 , I d ).2. Compute ˜ θ ( t ) = C ( t ) s + µ ( t ) and G ( t ) = ∇ ˜ θ ℓ (˜ θ ( t ) ) + C ( t ) − T s .3. Update µ ( t +1) = µ ( t ) + ρ t G ( t ) and v ( C ′ ( t +1) ) = v ( C ′ ( t ) ) + ρ t D C v {G ( t ) s T } .

4. Obtain C ( t +1) from C ∗ ( t +1) .Algorithm 1: RVB algorithm. Iteration (in thousands) A v e r age l o w e r bound Fig. 2.

Plot of average lower bound against iteration for epilepsy data Model II. Dotted lines arethe ﬁtted least square regression lines, whose gradients decrease to zero. we consider the lower bound averaged over every 1000 iterations and monitor the pathof these means. At the beginning, the means tend to increase monotonically. However,when the algorithm is close to convergence, these means start bouncing around the truemaximum. Hence we consider the gradient of a least square regression line ﬁtted to thepast τ means and Algorithm 1 is terminated once the gradient becomes negative. Thisprocess is illustrated in Figure 2. For the experiments, we set τ = 5. When the numberof means is less than τ , we only ﬁt the regression lines to existing means. The RVB algorithm does not return the posterior distributions of { b i } directly as it con-siders a Gaussian approximation for transformed local variables { ˜ b i } . Marginal posteriordistributions of { b i } can be estimated using simulation:(a) Generate θ G from q ( θ G ).(b) For i = 1 , . . . , n ,(i) Generate ˜ b i from q (˜ b i ).(ii) Compute λ i and L i from θ G using (7) or (8).(iii) Compute b i = L i ˜ b i + λ i .While this takes more work, a possible advantage is that the posterior distributions of { b i } are not constrained to be Gaussian and hence may be better able to accommodateany skewness present in the true marginal posterior p ( b i | y ). L. S. L. Tan

6. Gradient of the log joint density

To implement Algorithm 1, we require ∇ ˜ θ ℓ (˜ θ ) = [ ∇ ˜ b ℓ (˜ θ ) , . . . , ∇ ˜ b n ℓ (˜ θ ) , ∇ θ G ℓ (˜ θ )]. As b i = L i ˜ b i + λ i , p (˜ b i | θ G ) = p ( b i | θ G ) | L i | . Hence p ( y, ˜ θ ) = p ( y, θ ) Q ni =1 | L i | and the log jointdensity of the model in (1) after reparametrization is ℓ (˜ θ ) = log p ( θ G ) + n X i =1 { log p ( y i , b i | θ G ) + log | L i |} . The gradients ∇ ˜ b i ℓ (˜ θ ) for i = 1 , . . . , n , and ∇ θ G ℓ (˜ θ ) are derived in Theorem 2. Thisresult is applicable to any model of the form in (1). Proof of Theorem 2 is given in thesupplementary material. Theorem 2.

For i = 1 , . . . , n , let a i = ∇ b i log p ( y i , b i | θ G ) , B i = L Ti a i ˜ b Ti and ˜ B i =¯ B i + ¯ B Ti − dg( B i ) . Then ∇ ˜ b i ℓ (˜ θ ) = L Ti a i for i = 1 , . . . , n . Suppose θ G is partitioned as [ θ TG , . . . , θ TG M ] T . For m = 1 , . . . , M , ∇ θ Gm ℓ (˜ θ ) = n X i =1 ( ∇ θ Gm λ i ) a i + 12 n X i =1 {∇ θ Gm vec(Λ i ) } vec(Λ − i + L − Ti ˜ B i L − i )+ n X i =1 ∇ θ Gm log p ( y i , b i | θ G ) + ∇ θ Gm log p ( θ G ) . For the GLMM in Section 4, after deriving λ i and Λ i as in (7) or (8), we reparametrizethe model by transforming each b i so that ˜ b i = L − i ( b i − λ i ), where L i L Ti = Λ i . FromTheorem 2, to evaluate ∇ ˜ b i ℓ (˜ θ ), we require a i = Z Ti ( y i − g i ( η i )) − Ω b i for i = 1 , . . . , n .As θ G is partitioned into two components, β and ω , we use Theorem 2 to ﬁnd ∇ β ℓ (˜ θ )and ∇ ω ℓ (˜ θ ). We have ∇ β log p ( θ G ) = − β/σ β , ∇ β log p ( y i , b i | θ G ) = X Ti ( y i − g i ( η i )) andthe results in Lemma 2. Lemma 2.

Let D W be a diagonal matrix of order r ( r + 1) / where the diagonal isgiven by v ( J W ) , and J W is an r × r matrix with J Wii = W ii and J Wij = 1 if i = j . Inaddition, let u be a vector of length r where the i th element is r − i + 2 . Then(a) ∇ ω log p ( y i , b i | θ G ) = D W v ( W − T − b i b Ti W ) ,(b) ∇ ω log p ( θ G ) = D W v { ( ν − r − W − T − S − W } + v (diag( u )) . Next, we need to ﬁnd ∇ β vec(Λ i ), ∇ β λ i , ∇ ω vec(Λ i ) and ∇ ω λ i for approaches 1 and2. These results are then substituted into Theorem 2 to obtain the results in Lemma 3and 4. Proofs of Lemmas 2, 3 and 4 are given in the supplementary material. Lemma 3.

For approach 1, ∇ β ℓ (˜ θ ) = P ni =1 X Ti ( y i − g i ( η i ) − H i (ˆ η i ) Z i Λ i a i ) − β/σ β and ∇ ω ℓ (˜ θ ) = v (diag( u )) + D W v (cid:2) ( n + ν − r − W − T − S − W − n X i =1 ( b i b Ti + Λ i a i λ Ti + λ i a Ti Λ i + Λ i + L i ˜ B i L Ti ) W (cid:3) . eparametrized variational Bayes Lemma 4.

For approach 2, let α i = h ′′′ i ( X i β + Z i ˆ b i ) ⊙ diag { Z i (Λ i + L i ˜ B i L Ti ) Z Ti } . ∇ β ℓ (˜ θ ) = n X i =1 X Ti { y i − g i ( η i ) − H i ( X i β + Z i ˆ b i ) Z i Λ i ( a i − Z Ti α i ) − α i } − β/σ β . ∇ ω ℓ (˜ θ ) = v (diag( u )) + D W v [( n + ν − r − W − T − S − W − n X i =1 { b i b Ti + Λ i ( a i − Z Ti α i )ˆ b Ti + ˆ b i ( a i − Z Ti α i ) T Λ i + Λ i + L i ˜ B i L Ti } W ] .

7. Extension to large data sets

In this section, we discuss how the RVB algorithm can be extended to large data usinga divide and recombine strategy (Broderick et al., 2013; Tran et al., 2016). Suppose the n observations in the data are partitioned into V parts such that y = ( y , . . . , y V ) T andlet ˜ b v be the set of transformed local variables corresponding to y v . The true posterior, p (˜ θ | y ) = p (˜ b, θ G | y ) ∝ p ( θ G ) V Y v =1 p ( y v | ˜ b v , θ G ) p (˜ b v | θ G ) ∝ Q Vv =1 { p ( y v | ˜ b v , θ G ) p (˜ b v | θ G ) p ( θ G ) } p ( θ G ) V − ∝ Q Vv =1 p ( θ G , ˜ b v | y v ) p ( θ G ) V − . If we estimate p ( θ G , ˜ b v | y v ) by our variational approximation q v ( θ G ) q v (˜ b v ), which is ob-tained using the portion y v of the data only, then p (˜ b, θ G | y ) ∝ Q Vv =1 { q v ( θ G ) q v (˜ b v ) } p ( θ G ) V − approximately . Due to the assumption of independent variational posteriors for the local and globalvariables, we can integrate out the random eﬀects ˜ b on both sides to obtain p ( θ G | y ) ∝ Q Vv =1 q v ( θ G ) /p ( θ G ) V − . Suppose p ( θ G ) is N ( µ , Σ ). As q v ( θ G ) is Gaussian, say N ( µ v , Σ v ), p ( θ G | y ) ∝ exp " − ( V X v =1 ( θ G − µ v ) T Σ − v ( θ G − µ v ) − ( V − θ G − µ ) T Σ − ( θ G − µ ) ) . Thus the distribution of p ( θ G | y ) can be approximated by N ( µ, Σ), whereΣ = ( V X v =1 Σ − v − ( V − − ) − , µ = Σ ( V X v =1 Σ − v µ v − ( V − − µ ) . Another possible way of extending RVB to large data sets is to ﬁx q (˜ b i ) as N (0 , I r )and optimize only q ( θ G ), since p ( b i | θ G , y ) is approximately N (0 , I r ) after transformation.To increase eﬃciency, one can also replace ∇ θ Gm ℓ (˜ θ ) by an unbiased estimate, computedby choosing a random sample of M subjects from { , . . . , n } at each iteration, say S = { i , . . . , i M } , and replacing P ni =1 in Theorem 2 by n P i ∈S /M . Further work is requiredto determine the accuracy of such an approach. L. S. L. Tan

8. Experimental results

We present the results of ﬁtting the RVB algorithm to GLMMs and also use a largedataset to investigate the quality of results obtained via dataset partitioning and parallelprocessing. Results of the RVB algorithm are compared with GVA (Tan and Nott, 2018)and INLA, with the posterior distributions estimated using MCMC via RStan regardedas ground truth. We refer to the RVB algorithm implemented using approaches 1 and 2as RVB1 and RVB2 respectively.GVA similarly approximates p ( θ | y ) by N ( µ, Σ) and is also based on the stochas-tic variational algorithm (Titsias and L´azaro-Gredilla, 2014). The posterior dependencestructure is captured using a Cholesky decomposition

T T T of the precision matrix Σ − ,where T is lower triangular. T and Σ − are assumed to be sparse matrices of the form, T =  T . . . . . . T nn T n +1 , . . . T n +1 ,n T n +1 ,n +1  , Σ − =  Σ − . . . − ,n +1 ... . . . ... ...0 . . . Σ − nn Σ − n,n +1 Σ − n +1 , . . . Σ − n +1 ,n Σ − n +1 ,n +1 .  , where the block matrices Σ − , . . . , Σ − nn , Σ − n +1 ,n +1 correspond to b , . . . , b n , θ G respec-tively. As { b i } are independent conditional on θ G a posteriori, a block diagonal structureis assumed for these components while Σ − n +1 ,i captures the conditional posterior depen-dence between b i and θ G . On the other hand, RVB minimizes posterior dependencebetween local and global variables through model reparametrization and considers aCholesky decomposition of Σ, which is of a block diagonal structure. The number ofvariational parameters in RVB is smaller than GVA by nrg (corresponding to absence ofthe oﬀ-diagonal blocks T n +1 , , . . . , T n +1 ,n ). This reduction can be signiﬁcant when n , r and g are large. GVA and RVB both account for posterior dependence, albeit in diﬀerentmanners, and we observe that RVB can often achieve a better posterior approximationand higher lower bound than GVA at a faster convergence rate. The code for RVBand GVA is written in Julia version 0.6.4 and are available as supplementary material.Ormerod and Wand (2012) consider non-Bayesian Gaussian variational approximate in-ference for GLMMs (also known as GVA), which derives estimates of model parametersby optimizing a lower bound on the log-likelihood. They use Gaussian approximationsof the conditional distributions of random eﬀects and then integrate out the randomeﬀects using adaptive Gauss-Hermite quadrature.For comparability, the same priors are used for all algorithms and we set σ β = 100.MCMC is performed using either the centered or noncentered parametrization dependingon which is more eﬃcient, and we run four chains in parallel with 25,000 iterations each.The ﬁrst half of each chain is used as warmup and the remaining 50,000 iterations fromall four chains are used for inference with no thinning. We checked the trace plotsfor convergence and ensured that the potential scale reduction ˆ R ≤ .

001 and eﬀectivesample size per iteration is at least 0.1. All algorithms are run on a Intel Core i9-9900KCPU @ 3.60Ghz, 16.0 GB RAM. Lower bounds are estimated using 1000 simulations ineach case and exclude constants independent of the variational parameters. The stoppingcriterion described in Section 5.2 was used for RVB1, RVB2 and GVA. eparametrized variational Bayes We simulate data from the Poisson, Bernoulli and binomial GLMMs to compare theperformance of RVB1 and RVB2 in diﬀerent scenarios, and identify the circumstancesin which RVB2 is preferred to RVB1 despite it being computationally more intensive.Consider the random intercept model, η ij = β + β x ij + b i , b i ∼ N (0 , σ ) , for i = 1 , . . . , n, j = 1 , . . . n i . We let n = 500, n i = 7 for each i and σ = 1 .

5. MCMC results reported in Table 3correspond to the centered parametrization.For the Poisson model, we let x ij = ( j − /

10 and consider β = − . β = − β = 1 . β = 0 . ,

5] and45.3% greater than 5. Results in Table 3 indicate that RVB2 attains a higher lowerbound and hence provides a more accurate posterior approximation than both RVB1and GVA for model I. In particular, the posterior variance of the global parameters iscaptured more accurately by RVB2. For model II, RVB1 performs as well as RVB2 andboth perform better than GVA. These observations are in line with the discussion inSection 4.1.2 that RVB2 works better than RVB1 for data with small counts. However,RVB2 is slower than RVB1 for both models even though it converges in the same orsmaller number of iterations. GVA takes much more iterations to converge than RVB,especially for model II. For MCMC, centering is much faster than noncentering (484.8 sfor model I and 840.9 s for model II) and has much higher eﬀective sample sizes. INLAprovides a ﬁt closest to MCMC in both cases and is the fastest.For the Bernoulli model, we set β = − . β = 4 . x ij ∼ Bernoulli(0 .

5) for model I. For model II, β = 0, β = 1 and x ij = ( j − /

10. Thisresults in { µ ij } being closer to 0 or 1 for model I, and more concentrated about 0.5for model II (see Figure 3). Table 3 shows that RVB2 produces a ﬁt for the globalparameters closest to MCMC for both models, among all other methods. As expected,RVB1 performs worse than RVB2 for model I and nearly as well for model II. For MCMC,noncentering takes more than twice as long (385.3 s for model I and 461.4 s for modelII), although eﬀective sample sizes are higher than centering for model I.We use the same settings for the binomial model as we did for the Bernoulli model,and set m ij = 20 uniformly. For model I, RVB1’s performance is still weaker than othervariational methods even though m ij is large. For model II, RVB1 performs as well asRVB2. All methods produce marginal posterior distributions for the global parametersthat are very close to that of MCMC. RVB2 required the least number of iterations toconverge for both models among the variational methods. For MCMC, noncenteringrequired a shorter runtime of 939.7 s for model I and 1466.9 s for model II, but theeﬀective sample size per iteration for β is less than 0.1 for both models.The simulations support recommendations in Section 4.1.2 that RVB1 is suﬃcientfor count data without an excess of zeros, and binary data or binomial outcomes whosesuccess probabilities are not concentrated at 0 or 1. Otherwise, RVB2 is preferred. Inaddition, we observe that RVB is able to yield improvements in estimating the posteriorvariance as compared to GVA, which has a tendency towards underestimation. L. S. L. Tan

Table 3.

Simulated data: estimates of posterior mean and standard deviation for eachparameter, runtime in seconds (number of iterations ( × ) in brackets) and lower bound. GVA RVB1 RVB2 INLA MCMC P o i ss o n I β -2.46 ± ± ± ± ± β -2.26 ± ± ± ± ± σ ± ± ± ± ± L -1316.9 -1316.7 -1316.3 - -II β ± ± ± ± ± β ± ± ± ± ± σ ± ± ± ± ± L B e r n o u lli I β -2.54 ± ± ± ± ± β ± ± ± ± ± σ ± ± ± ± ± L -1398.2 -1398.2 -1397.8 - -II β ± ± ± ± ± β ± ± ± ± ± σ ± ± ± ± ± L -2127.3 -2127.2 -2127.0 - - B i n o m i a l I β -2.52 ± ± ± ± ± β ± ± ± ± ± σ ± ± ± ± ± L -24301.3 -24301.6 -24301.3 - -II β ± ± ± ± ± β ± ± ± ± ± σ ± ± ± ± ± L -37531.5 -37531.4 -37531.4 - - Model I m ij F r equen cy Model II m ij F r equen cy Fig. 3.

Distribution of µ ij for data simulated from Bernoulli and binomial models. eparametrized variational Bayes Table 4.

Epilepsy data. Estimates of posterior mean and standard deviation for eachparameter, runtime in seconds (number of iterations ( × ) in brackets) and lower bound. GVA RVB1 RVB2 INLA MCMCI β ± ± ± ± ± β Base ± ± ± ± ± β Trt -0.93 ± ± ± ± ± β Base × Trt ± ± ± ± ± β Age ± ± ± ± ± β V4 -0.16 ± ± ± ± ± σ ± ± ± ± ± L β ± ± ± ± ± β Base ± ± ± ± ± β Trt -0.93 ± ± ± ± ± β Base × Trt ± ± ± ± ± β Age ± ± ± ± ± β Visit -0.26 ± ± ± ± ± σ ± ± ± ± ± σ ± ± ± ± ± ρ ± ± ± ± ± L In the epilepsy data of Thall and Vail (1990) ( R package MASS , data(epil) ), n = 59epileptics were assigned either a new drug Progabide or a placebo randomly. The re-sponse y ij is the number of epileptic seizures of patient i in the two weeks before clinicvisit j for j = 1 , . . . ,

4. The covariates for patient i are Base i (log of 1/4 the numberof baseline seizures), Trt i (1 for drug treatment and 0 for placebo), Age i (log of ageof patient at baseline, centered at zero), Visit ij (coded as − . − .

1, 0 . . j = 1 , . . . , µ ij = β + β Base

Base i + β Trt

Trt i + β Base × Trt

Base i × Trt i + β Age

Age i + β V4 V4 ij + b i , b i ∼ N (0 , σ ) , Model II : log µ ij = β + β Base

Base i + β Trt

Trt i + β Base × Trt

Base i × Trt i + β Age

Age i , + β Visit

Visit ij + b i + b i Visit ij , (cid:20) b i b i (cid:21) ∼ N (0 , Ω − ) , Ω − = (cid:20) σ ρσ σ ρσ σ σ (cid:21) . For model I, the prior for σ − is Gamma(0.5, 0.0151) and for model II, the prior for Ωis W (3 , S ), where S = 11 . S = − . S = 0 . L. S. L. Tan

Model I

Number of iterations (in thousands)

GVARVB1RVB2 1 2 5 1022002400260028003000

Model II

Number of iterations (in thousands)

GVARVB1RVB2

Fig. 4.

Epilepsy data. Average lower bound attained by GVA, RVB1, RVB2 every iterations. mean sd − − Model I (RVB1) mean sd − − Model I (RVB2) mean sd − − Model II (RVB1) mean sd − − Model II (RVB2)

Fig. 5.

Epilepsy data. Boxplots of posterior means and standard deviations of { ˜ b i } . but underestimates the posterior variance of several global parameters. Figure 4 showsthat RVB converges towards the mode of the lower bound much faster than GVA. Inthe ﬁrst 1000 iterations, RVB has achieved a lower bound more than 3000 while GVAonly reaches slightly over 2000. We observe again that for data with large counts, GVAconverges very slowly, taking ∼ { ˜ b i } estimated using RVB are somewhatclose to zero while the standard deviations are close to one. This holds more strongly forRVB2 than RVB1 as Gaussian approximation of the conditional distributions of { b i } ismore accurate in RVB2 (mode found using Newton-Raphson). These observations echothe proposition that { ˜ b i } will be approximately uncorrelated a posteriori with zero meansand unit variances after model reparametrization. Indeed, this property is very usefulin improving the convergence of RVB as it makes initializing and optimizing variationalparameters for { ˜ b i } much easier. In high-dimensional problems with large number oflocal variables, computation time can be reduced by exploiting this feature.Next we compare the marginal posteriors of random eﬀects { b i } estimated using GVAand RVB with MCMC. For GVA, variational posterior of each b i is Gaussian, where themean and variance can be obtained directly from the approximating density. For RVB,we use the approach in Section 5.3 to obtain 50000 samples of each b i . Figure 6 plots r i = (mean VA b i − mean MCMC b i ) / sd VA b i and r i = sd MCMC b i / sd VA b i (17) eparametrized variational Bayes mean i r i mean i r i mean i r i sd i r i sd i r i sd i r i GVARVB1RVB2

Fig. 6.

Epilepsy data. Plots of r i and r i against i for GVA, RVB1 and RVB2. First column isfor Model I, second and third columns are for b i and b i of Model II respectively. where mean method b i and sd method b i denote the marginal posterior mean and standard de-viation of b i estimated using a certain method. There are some instances where GVAunderestimates the standard deviation quite severely while results of RVB are more uni-form across all subjects and closer to MCMC generally. The posterior mean of b i wasseverely underestimated for subject 10. In the seeds germination data (Crowder, 1978) ( R package hglm , data(seeds) ), theresponse y i is the number of seeds that germinated out of m i , which were brushed onplate i for i = 1 , . . . ,

21. This data arise from a 2 factorial experiment and the factorsare type of seed (seed i = 1 if O. aegyptica 73 and 0 if O. aegyptica 75) and type of rootextract (extract i = 1 if cucumber and 0 if bean). We consider the binomial GLMM forhandling overdispersion (Breslow and Clayton, 1993), where y i ∼ binomial( m i , p i ),logit( p i ) = β + β seed seed i + β extract extract i + b i , b i ∼ N (0 , σ ) . The prior for σ − is Gamma(0 . , . { y i /m i } (Figure 7, left) does not indicate anyconcentration close to 0 or 1. Hence we postulate that RVB1 will perform as well asRVB2, and this is conﬁrmed in Table 5 (they achieve the same lower bound). RVB alsoachieves a higher lower bound than GVA, converging in half the number of iterations.The center plot of Figure 7 shows a trend similar to that observed in the epilepsy data,where RVB converges to the mode much faster than GVA. The right plot shows that themeans and standard deviations of { ˜ b i } are concentrated around zero and one respectively(RVB2 more so than RVB1), which suggests that the aﬃne transformation is eﬀective innormalizing the random eﬀects. Marginal posteriors of the global parameters estimatedby INLA are closest to and indistinguishable from MCMC. MCMC results are from L. S. L. Tan estimated probabilities F r equen cy Number of iterations (in thousands) Lo w e r B ound GVARVB1RVB2 RVB1 RVB2 RVB1 RVB2−0.20.00.20.40.60.81.0 mean sd

Fig. 7.

Seeds data. Left: histogram of { y i /m i } . Center: lower bound attained after every iterations. Right: boxplots of posterior means and standard deviations of { ˜ b i } . Table 5.

Seeds data. Estimates of posterior mean and standard deviation for each pa-rameter, runtime in seconds (number of iterations ( × ) in brackets) and lower bound. GVA RVB1 RVB2 INLA MCMC β -0.38 ± ± ± ± ± β seed -0.37 ± ± ± ± ± β extract ± ± ± ± ± σ ± ± ± ± ± L -550.4 -549.9 -549.9 - - centering; noncentering takes 16.3 s but the eﬀective sample size for σ is lower thanin centering. RVB also performs quite well while GVA underestimates the posteriorvariance of σ . Figure 8 summarizes the diﬀerences in mean and standard deviation ofthe marginal posteriors of { b i } between the variational methods and MCMC. RVB isgenerally able to capture the posterior means and standard deviations better than GVAalthough both methods underestimate the true standard deviation. This dataset ( R package HSAUR3 , data("toenail") ) contains the results of a clinicaltrial for comparing two oral antifungal treatments for toenail infection for 294 patients mean i r i sd i r i GVARVB1RVB2

Fig. 8.

Seeds data. Plots of r i and r i against i for GVA, RVB1 and RVB2. eparametrized variational Bayes Table 6.

Toenail. Estimates of posterior mean and standard deviation for each param-eter, runtime in seconds (number of iterations ( × ) in brackets) and lower bound. GVA RVB1 RVB2 INLA MCMC β -3.22 ± ± ± ± ± β Trt -0.76 ± ± ± ± ± β t -1.64 ± ± ± ± ± β Trt × t -0.58 ± ± ± ± ± σ ± ± ± ± ± L -645.8 -646.6 -645.1 - - −6 −5 −4 −3 −2 . . . . b −4 −3 −2 −1 0 1 2 . . . . . b Trt −2.5 −2.0 −1.5 −1.0 . . . . b t −2.0 −1.0 0.0 0.5 . . . . . b Trtxt . . . . . . s GVARVB1RVB2INLAMCMC

Fig. 9.

Toenail data. Plots of marginal posterior distributions of global parameters. (De Backer et al., 1998). The response y ij of patient i at the j th clinic visit, 1 ≤ j ≤ i th patient, 250mg of terbinaﬁne (Trt i = 1) or 200mg ofitraconazole (Trt i = 0) is received per day, while t ij denotes the time in months whenpatient is evaluated at the j th visit. We consider the random intercept model,logit( p ij ) = β + β Trt

Trt i + β t t ij + β Trt × t Trt i × t ij + b i , b i ∼ N (0 , σ ) . for i = 1 , . . . , ≤ j ≤

7. The prior for σ − is Gamma(0.5, 0.4962).Results in Table 6 indicate that RVB2 achieved the highest lower bound, followed byGVA and RVB1. For binary data, it is diﬃcult to estimate if the success probabilitiesare close to 0 or 1 and we recommend using RVB2 as default. RVB2 converged in thesmallest number of iterations compared with RVB1 and GVA. From Figure 9, RVB2performs as well as INLA for β Trt , slightly better for β t and β Trt × t and worse for β and σ . However, it still improves signiﬁcantly on GVA and RVB1. For MCMC, noncenteringtakes a longer time (404.3 s) than centering. We consider a large longitudinal data set from the Heart and Estrogen/Progestin Study(HERS, Hulley et al., 1998) available at . Thestudy aims to determine if estrogen plus Progestin therapy reduces the risk of coronaryheart disease (CHD) events for post-menopausal women with CHD. In this clinical trial,2763 women were randomly assigned to a hormone or placebo group and followed upfor the next ﬁve years with an annual clinic visit. Some patients did not turn up forall 5 visits and data for certain covariates were missing. Here we consider 2031 women L. S. L. Tan

Table 7.

HERS. Estimates of posterior mean and standard deviation for each param-eter, runtime in seconds (number of iterations ( × ) in brackets) and lower bound. GVA RVB1 RVB2 INLA MCMC β -0.75 ± ± ± ± ± β age ± ± ± ± ± β BMI ± ± ± ± ± β HTN -0.35 ± ± ± ± ± β visit ± ± ± ± ± σ ± ± ± ± ± L -5041.9 -5041.4 -5041.1 - - −1.2 −0.8 −0.4 b b age b BMI −0.8 −0.4 0.0 b HTN b visit s GVARVB1RVB2INLAMCMC

Fig. 10.

HERS data. Marginal posterior distributions of global parameters. for whom data on all covariates are available. The binary response y ij is an indicator ofwhether the systolic blood pressure of patient i is higher than 140 at the j th visit. Weconsider the random intercept model,logit( p ij ) = β + β age age i + β BMI

BMI ij + β HTN

HTN ij + β visit visit ij + b i , where b i ∼ N (0 , σ ) for i = 1 , . . . , ≤ j ≤

5. For patient i , age i is the age ofpatient at baseline, BMI ij is the body mass index at the j th visit, HTN ij is an indicatorof whether high blood pressure medication is taken at the j th visit and visit ij is codedas − − . − .

2, 0.2, 0.6, 1 for j = 0 , , . . . , σ − is Gamma(0.5, 0.5079).From Table 7, RVB2 achieves the highest lower bound followed by RVB1 and GVA.RVB2 provides a very good ﬁt to the marginal posteriors of the global parameters, whichis better than INLA, although it still underestimates the mean and standard deviationof σ (see Figure 10). The improvement it provides over GVA in estimating the posteriorvariance is evident. Both RVB1 and RVB2 converge in 11 iterations, which is less thanhalf that taken by GVA. As a result, RVB2 is faster than GVA even though each iterationof RVB2 is computationally more intensive. Figure 11 shows that GVA converges to themode of the lower bound very slowly, unlike RVB1 and RVB2. For MCMC, centering(11000.2 s) takes more than thrice as long as noncentering and warnings on exceedingmaximum treedepth were issued. The boxplots in Figure 11 show that the means andvariances of { ˜ b i } are again close to zero and one respectively. However, the variationseems to be larger than previous examples, even for RVB2. This may suggest thatapproximating p ( b i | θ G , y ) by a Gaussian distribution is not as eﬀective for logistic mixedmodels due to skewness or heavy tails. A direction for future work is to use skewed eparametrized variational Bayes Lower bound

Number of iterations (in thousands)

GVARVB1RVB2

RVB1 RVB2 RVB1 RVB2−0.50.00.51.0 mean sd

Fig. 11.

HERS. Lower bound and boxplots of posterior means and standard deviations of { ˜ b i } . Table 8.

HERS. Estimates of posterior mean and standard deviation for each parameter,runtime in seconds (number of iterations ( × ) in brackets) and lower bound. For thedivide and recombine strategy, the means over ten replicates are reported with the standarddeviation in brackets. A normal prior is used for ω . RVB1 RVB1 (divide) RVB2 RVB2 (divide) β -0.75 ± ± ± ± β age ± ± ± ± β BMI ± ± ± ± β HTN -0.36 ± ± ± ± β visit ± ± ± ± ω -0.64 ± ± ± ± L -5040.7 -5174.4 (2.2) -5040.3 -5095.0 (1.3) Gaussians or mixture of Gaussians to improve conditional posterior approximation.We investigate the performance of the divide and recombine strategy by partitioningthe subjects randomly into three groups, each with 677. RVB1 and RVB2 were applied tothe three partial datasets in parallel and results are combined using techniques in Section7. Here we use a N(0 , ) prior on ω instead of a Gamma prior on σ − for greater easein applying the divide and recombine strategy. This experiment was replicated tentimes. Table 8 shows the results of RVB1 and RVB2 (with a normal prior on ω ) andcorresponding results obtained using the divide and recombine strategy. The runtime isreduced on average by 5.9 times for RVB1 and 4.2 times for RVB2. We also computean estimate of the lower bound attained in each experiment and the mean and standarddeviation across the ten trials are reported. RVB2 performs better than RVB1 in thisrespect. Estimates of the global parameters obtained using the divide and recombinestrategy are quite close to that obtained using the full algorithm and the standarddeviation is less than or equal to 0.01 in all cases.

9. Conclusion

We propose a model reparametrization approach to improve VB inference for hierar-chical models, where the local variables are transformed via an aﬃne transformationto minimize their posterior dependency on the global variables. The resulting Gaussian L. S. L. Tan variational approximation (obtained using stochastic gradient ascent) is low-dimensionaland can be readily extended to large datasets using a “divide and recombine” strategyvia parallel processing. In the application to GLMMs, we ﬁnd that RVB1, which relies ondata-based estimates of linear predictors, works well for Poisson models with large countsand binomial outcomes where the success probabilities are not concentrated at 0 or 1.Binary data are more challenging to ﬁt and RVB2, which searches for the conditionalposterior modes of the random eﬀects, works better for these models. The performanceof MCMC in Stan relies heavily on the model parametrization both in terms of mixingand eﬃciency, and it is not always easy to ﬁnd a parametrization that works well inboth aspects. INLA is very fast and produces very good approximations of the marginalposteriors of the global parameters for Poisson and binomial models, but its performanceis more varied for binary data and RVB2 can perform better sometimes. In addition,RVB provides an approximation of the full joint posterior distribution which can be veryuseful. Compared to GVA, RVB is often able to yield improvements in both convergencerate and accuracy of posterior approximation. The results obtained from GLMMs areoverall very promising and it will be interesting to investigate the performance of RVBfor more complex models.

References

Attias, H. (1999). Inferring parameters and structure of latent variable models by variational Bayes.In K. Laskey and H. Prade (Eds.),

Proceedings of the 15th Conference on Uncertainty in ArtiﬁcialIntelligence , San Francisco, CA, pp. 21–30. Morgan Kaufmann.Blei, D. M., A. Kucukelbir, and J. D. McAuliﬀe (2017). Variational inference: A review for statisticians.

Journal of the American Statistical Association 112 , 859–877.Breslow, N. E. and D. G. Clayton (1993). Approximate inference in generalized linear mixed models.

Journal of the American Statistical Association 88 , 9–25.Broderick, T., N. Boyd, A. Wibisono, A. C. Wilson, and M. I. Jordan (2013). Streaming variationalbayes. In

Proceedings of the 26th International Conference on Neural Information Processing Systems- Volume 2 , NIPS’13, USA, pp. 1727–1735. Curran Associates Inc.Carpenter, B., A. Gelman, M. Hoﬀman, D. Lee, B. Goodrich, M. Betancourt, M. Brubaker, J. Guo,P. Li, and A. Riddell (2017). Stan: A probabilistic programming language.

Journal of StatisticalSoftware, Articles 76 , 1–32.Christensen, O. F., G. O. Roberts, and M. Skld (2006). Robust markov chain monte carlo methods forspatial generalized linear mixed models.

Journal of Computational and Graphical Statistics 15 , 1–17.Cressie, N., C. A. Calder, J. S. Clark, J. M. V. Hoef, and C. K. Wikle (2009). Accounting for uncertaintyin ecological analysis: the strengths and limitations of hierarchical statistical modeling.

EcologicalApplications 19 , 553–570.Crowder, M. J. (1978). Beta-binomial anova for proportions.

Journal of the Royal Statistical Society.Series C (Applied Statistics) 27 , 34–37.De Backer, M., C. De Vroey, E. Lesaﬀre, I. Scheys, and P. D. Keyser (1998). Twelve weeks of continuousoral therapy for toenail onychomycosis caused by dermatophytes: A double-blind comparative trialof terbinaﬁne 250 mg/day versus itraconazole 200 mg/day.

Journal of the American Academy ofDermatology 38 , 57–63.Firth, D. (1993). Bias reduction of maximum likelihood estimates.

Biometrika 80 , 27–38.Gelfand, A. E., S. K. Sahu, and B. P. Carlin (1995). Eﬃcient parametrisations for normal linear mixedmodels.

Biometrika 82 , 479–488.Gelfand, A. E., S. K. Sahu, and B. P. Carlin (1996). Eﬃcient parametrisations for generalized linearmixed models. In J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith (Eds.),

BayesianStatistics 5 .Gelman, A. and J. Hill (2006).

Data Analysis Using Regression and Multilevel/Hierarchical Models .Analytical Methods for Social Research. Cambridge University Press. eparametrized variational Bayes Gershman, S., M. Hoﬀman, and D. Blei (2012). Nonparametric variational inference. In J. Langford andJ. Pineau (Eds.),

Proceedings of the 29th International Conference on Machine Learning , pp. 663–670.Hoﬀman, M. and D. Blei (2015). Stochastic structured variational inference. In G. Lebanon and S. Vish-wanathan (Eds.),

Proceedings of the Eighteenth International Conference on Artiﬁcial Intelligence andStatistics , Volume 38, pp. 361–369. JMLR Workshop and Conference Proceedings.Hoﬀman, M. D., D. M. Blei, C. Wang, and J. Paisley (2013). Stochastic variational inference.

Journalof Machine Learning Research 14 , 1303–1347.Hulley, S., D. Grady, T. Bush, and et al. (1998). Randomized trial of estrogen plus progestin for secondaryprevention of coronary heart disease in postmenopausal women.

JAMA 280 (7), 605–613.Jacobs, R. A., M. I. Jordan, S. J. Nowlan, and G. E. Hinton (1991, March). Adaptive mixtures of localexperts.

Neural Comput. 3 , 79–87.Kass, R. E. and R. Natarajan (2006). A default conjugate prior for variance components in generalizedlinear mixed models (comment on article by browne and draper).

Bayesian Anal. 1 , 535–542.Kingma, D. P. and M. Welling (2014). Auto-encoding variational Bayes. In

Proceedings of the 2ndInternational Conference on Learning Representations (ICLR) .Kucukelbir, A., D. Tran, R. Ranganath, A. Gelman, and D. M. Blei (2016). Automatic diﬀerentiationvariational inference. arXiv: 1603.00788.Lee, C. Y. Y. and M. P. Wand (2016). Variational methods for ﬁtting complex bayesian mixed eﬀectsmodels to health data.

Statistics in Medicine 35 , 165–188.Liu, X. (2016).

Methods and Applications of Longitudinal Data Analysis . Oxford: Academic Press.Lunn, D., D. Spiegelhalter, A. Thomas, and N. Best (2009). The bugs project: Evolution, critique andfuture directions.

Statistics in Medicine 28 , 3049–3067.Magnus, J. R. and H. Neudecker (1980). The elimination matrix: Some lemmas and applications.

SIAMJournal on Algebraic Discrete Methods 1 , 422–449.Magnus, J. R. and H. Neudecker (1999).

Matrix diﬀerential calculus with applications in statistics andeconometrics (3rd ed.). New York: John Wiley & Sons.McCormick, T. H., C. Rudin, and D. Madigan (2012). Bayesian hierarchical rule modeling for predictingmedical conditions.

The Annals of Applied Statistics 6 , 652–668.Murray, I. (2016). Diﬀerentiation of the Cholesky decomposition. arXiv:1602.07527.Ormerod, J. T. and M. P. Wand (2010). Explaining variational approximations.

The American Statis-tician 64 , 140–153.Ormerod, J. T. and M. P. Wand (2012). Gaussian variational approximate inference for generalizedlinear mixed models.

Journal of Computational and Graphical Statistics 21 , 2–17.Paisley, J. W., D. M. Blei, and M. I. Jordan (2012). Variational Bayesian inference with stochasticsearch. In

Proceedings of the 29th International Conference on Machine Learning (ICML-12) .Papaspiliopoulos, O., G. O. Roberts, and M. Sk¨old (2003). Non-centered parameterisations for hierar-chical models and data augmentation. In J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid,D. Heckerman, A. F. M. Smith, and M. West (Eds.),

Bayesian Statistics 7 , pp. 307–326. New York:Oxford University Press.Papaspiliopoulos, O., G. O. Roberts, and M. Sk¨old (2007). A general framework for the parametrizationof hierarchical models.

Statist. Sci. 22 , 59–73.Rezende, D. J. and S. Mohamed (2015). Variational inference with normalizing ﬂows. In F. Bach andD. Blei (Eds.),

Proceedings of The 32nd International Conference on Machine Learning , pp. 1530–1538.JMLR Workshop and Conference Proceedings.Rezende, D. J., S. Mohamed, and D. Wierstra (2014). Stochastic backpropagation and approximateinference in deep generative models. In E. P. Xing and T. Jebara (Eds.),

Proceedings of The 31stInternational Conference on Machine Learning , pp. 1278–1286. JMLR Workshop and ConferenceProceedings.Robbins, H. and S. Monro (1951). A stochastic approximation method.

The Annals of MathematicalStatistics 22 , 400–407.Roeder, G., Y. Wu, and D. K. Duvenaud (2017). Sticking the landing: Simple, lower-variance gradientestimators for variational inference. In I. Guyon, U. Luxburg, S. Bengio, H. Wallach, R. Fergus,S. Vishwanathan, and R. Garnett. (Eds.),

Advances in Neural Information Processing Systems 30 .Rohde, D. and M. P. Wand (2016). Semiparametric mean ﬁeld variational bayes: General principles andnumerical issues.

Journal of Machine Learning Research 17 , 1–47.Rue, H., S. Martino, and N. Chopin (2009). Approximate Bayesian inference for latent Gaussian modelsby using integrated nested Laplace approximations.

Journal of the Royal Statistical Society: Series B(Statistical Methodology) 71 , 319–392.Salimans, T. and D. A. Knowles (2013). Fixed-form variational posterior approximation through stochas- L. S. L. Tan tic linear regression.

Bayesian Analysis 8 , 837–882.Spall, J. C. (2003).

Introduction to stochastic search and optimization: estimation, simulation andcontrol . New Jersey: Wiley.Tan, L. S. L. and D. J. Nott (2013). Variational inference for generalized linear mixed models usingpartially non-centered parametrizations.

Statistical Science 28 , 168–188.Tan, L. S. L. and D. J. Nott (2018). Gaussian variational approximation with sparse precision matrices.

Statistics and Computing 28 , 259–275.Thall, P. F. and S. C. Vail (1990). Some covariance models for longitudinal count data with overdisper-sion.

Biometrics 46 , 657–671.Titsias, M. and M. L´azaro-Gredilla (2014). Doubly stochastic variational Bayes for non-conjugate in-ference. In

Proceedings of the 31st International Conference on Machine Learning (ICML-14) , pp.1971–1979.Titsias, M. K. (2017). Learning model reparametrizations: Implicit variational inference by ﬁtting mcmcdistributions. arXiv:1708.01529.Tran, M.-N., D. J. Nott, A. Y. C. Kuk, and R. Kohn (2016). Parallel variational bayes for large datasetswith an application to generalized linear mixed models.

Journal of Computational and GraphicalStatistics 25 (2), 626–646.Yu, Y. and X.-L. Meng (2011). To center or not to center: That is not the question—An ancillarity-suﬃciency interweaving strategy (ASIS) for boosting MCMC eﬃciency.

Journal of Computational andGraphical Statistics 20 , 531–570.

Supplementary material

S1. Default conjugate prior

Kass and Natarajan (2006) propose a default conjugate inverse Wishart prior, IW ( ρ, ρR ),for the r × r covariance matrix Ω − , which is designed to be minimally informative bytaking ρ to be small, and the scale matrix R is determined through ﬁrst-stage data vari-ability based on a prior guess for Ω − . They begin by considering the GLM obtainedby pooling all data and setting b i = 0 for all i . Let φ denote the dispersion parameterand V ( µ ij ) denote the variance function of the GLM. Suppose ˆ β is an estimate of theregression coeﬃcients of the GLM. Deﬁne W i ( ˆ β ) as the n i × n i diagonal weight matrixof the GLM, whose j th diagonal entry is given by { φV ( µ ij )[ g ′ ( µ ij )] } − evaluated at µ ij = ˆ µ ij = g − ( X Tij ˆ β ). The authors suggested taking ρ = r and R = n n X i =1 Z Ti W i ( ˆ β ) Z i ! − , For the Poisson and binomial model, the dispersion parameter is one. The j th diagonalentry of W i ( ˆ β ) is given by ˆ µ ij = exp( X Tij ˆ β ) for the Poisson model and ˆ µ ij ( m ij − ˆ µ ij ) /m ij = m ij exp( X Tij ˆ β ) { X Tij ˆ β ) } − for the binomial model. See Table S1 for details. Aninverse Wishart prior IW ( ρ, ρR ) for Ω − is equivalent to a Wishart prior W ( ν, S ) forΩ, where ν = ρ and S = R − /ρ . If r = 1, the Wishart prior W ( ν, S ) reduces toGamma( ν/ , S − / r = 2, INLA crashed when we set ν = 2, even though RStanand the variational methods worked well. Hence we modify the default conjugate priorslightly by setting ρ = r + 1 instead so that ν = r + 1 and S = R − / ( r + 1) so thatresults are comparable with INLA. eparametrized variational Bayes Table S1.

Estimate of the mean, variance functio, link function and derivative ofthe link function for computing the default conjugate prior.

Model ˆ µ ij V ( µ ij ) g ( µ ij ) g ′ ( µ ij )Poisson( µ ij ) exp( X Tij ˆ β ) µ ij log( µ ij ) 1 /µ ij Binomial (cid:16) m ij , µ ij m ij (cid:17) m ij exp( X Tij ˆ β )1+exp( X Tij ˆ β ) µ ij ( m ij − µ ij ) m ij log (cid:0) µ ij m ij − µ ij (cid:1) m ij µ ij ( m ij − µ ij ) S2. Induced prior p ( ω )Recall that W ii = exp( W ∗ ii ) where W ii > W ∗ ii ∈ R . Let D W be a diagonal matrixof order r ( r + 1) / v ( J W ), and J W is an r × r matrixwith J Wii = W ii and J Wij = 1 is i = j . Let d denote the diﬀerential operator (see e.g.Magnus and Neudecker, 1999). Diﬀerentiating w.r.t. ω ,dvec( W ) = E Tr d v ( W ) = E Tr D W d ω. Since Ω =

W W T , diﬀerentiating w.r.t. ω ,dvec(Ω) = vec { (d W ) W T + W (d W ) T } = ( W ⊗ I )dvec( W ) + ( I ⊗ W ) K r dvec( W )= 2 N r ( W ⊗ I )dvec( W )= 2 N r ( W ⊗ I ) E Tr D W d ω. ∴ ∂ vec(Ω) ∂ω = 2 D W E r ( W T ⊗ I ) N r . (S1)We have p ( ω ) = p (Ω) (cid:12)(cid:12)(cid:12)(cid:12) ∂v (Ω) ∂ω (cid:12)(cid:12)(cid:12)(cid:12) . As v (Ω) = E r vec(Ω), d v (Ω) = E r dvec(Ω) = 2 E r N r ( W ⊗ I ) E Tr D W d ω from (S1). Hence ∂v (Ω) ∂ω = 2 D W E r ( W T ⊗ I ) N r E Tr and (cid:12)(cid:12)(cid:12)(cid:12) ∂v (Ω) ∂ω (cid:12)(cid:12)(cid:12)(cid:12) = 2 r ( r +1) / | D W | | E r ( W T ⊗ I ) N r E Tr | . We have | D W | = Q ri =1 W ii and | E r ( W T ⊗ I ) N r E Tr | = | E r ( W T ⊗ I ) D r E r N r E Tr | = | E r ( W T ⊗ I ) D r | | E r N r E Tr | = 2 − r ( r − / r Y i =1 W r − i +1 ii . In evaluating the above expression, we have made use of the following results fromMagnus and Neudecker (1980). L. S. L. Tan • Lemma 3.5 (ii): N r = D r E r N r • Lemma 3.4 (i): | E r N r E Tr | = 2 − r ( r − / . • Lemma 4.1 (iii): | E r ( W T ⊗ I ) D r | = Q ri =1 W r − i +1 ii .Hence log p ( ω ) = log p (Ω) + r log 2 + r X i =1 ( r − i + 2) log( W ii ) . S3. Proof of Lemma 1

Proof. If ∄ η ij ∈ R such that p ( y ij | η ij ) attains its maximum value, then ∄ η ij ∈ R such that h ′ ij ( η ij ) = y ij . Since h ′ ij ( η ij ) is a continuous monotone increasing func-tion ( h ′′ ij ( η ij ) = var( y ij ) ≥ h ′ ij ( η ij ) must be either bounded below or boundedabove by y ij . If h ′ ij ( η ij ) is bounded below by y ij , then ∃ y ij ≤ L < ∞ such thatlim η ij →−∞ h ′ ij ( η ij ) = L . Otherwise, ∃ −∞ < L ≤ y ij such that lim η ij →∞ h ′ ij ( η ij ) = L .Thus we have lim η ij → c h ′ ij ( η ij ) = L , where c = −∞ or c = ∞ . By L’Hospital’s rule(lim x → c f ( x ) /g ( x ) = lim x → c f ′ ( x ) /g ′ ( x ) if lim x → c | g ( x ) | = ∞ ),0 = lim η ij → c h ′ ij ( η ij ) η ij = lim η ij → c h ′′ ij ( η ij ) ,L = lim η ij → c h ′ ij ( η ij ) η ij η ij = lim η ij → c { h ′′ ij ( η ij ) η ij + h ′ ij ( η ij ) } . Hence, lim η ij → c h ′′ ij ( η ij ) = 0 and lim η ij → c h ′′ ij ( η ij ) η ij = 0. Since var( y ij ) → η ij → c , p ( y ij | η ij ) approaches a degenerate distribution with support only on L ( ∵ E ( y ij ) → L ).Since p ( y ij | η ij ) is concave down and the likelihood of observing y ij is maximized as η ij → c , L = y ij and lim ˆ η ij → c h ′ ij ( η ij ) = y ij . S4. Proof of Theorem 1

Proof.

First, Λ i is well-deﬁned for any value of ˆ η i since h ′′ (ˆ η ij ) ≥

0. The j th elementof ˆ g i + ˆ H i (ˆ η i − X i β ) is y ij − h ′ (ˆ η ij ) + h ′′ ij (ˆ η ij )ˆ η ij − h ′′ ij (ˆ η ij ) X Tij β, (S2)which is well-deﬁned if ˆ η ML ij exist. If ˆ η ML ij does not exist, then from Lemma 1, (S2) has alimit of zero as ˆ η ij → c ij . Thus the limit of λ i exists as ˆ η i → c i . If ˆ η ML ij does not exist for j = 1 , . . . , n i , then lim ˆ η i → c i { ˆ g i + ˆ H i (ˆ η i − X i β ) } = 0 which implies that lim ˆ η i → c i λ i = 0. eparametrized variational Bayes S5. Unbiased estimates of stochastic gradients

As ˜ θ = Cs + µ , diﬀerentiating ℓ (˜ θ ) with respect to µ and v ( C ) separately,d { ℓ (˜ θ ) } = ∇ ˜ θ ℓ (˜ θ ) T d µ, d { ℓ (˜ θ ) } = ∇ ˜ θ ℓ (˜ θ ) T (d C ) s = ( s T ⊗ ∇ ˜ θ ℓ (˜ θ ) T )dvec( C )= vec( ∇ ˜ θ ℓ (˜ θ ) s T ) T E Td d v ( C )= v ( ∇ ˜ θ ℓ (˜ θ ) s T ) T d v ( C ) . Therefore ∇ µ ℓ (˜ θ ) = ∇ ˜ θ ℓ (˜ θ ) and ∇ v ( C ) ℓ (˜ θ ) = v ( ∇ ˜ θ ℓ (˜ θ ) s T ). In addition,d log | C | = tr( C − d C ) = vec( C − T ) T E Td d v ( C ) = v ( C − T ) T d v ( C ) . Hence ∇ v ( C ) log | C | = v ( C − T ). For the second estimator, diﬀerentiating log q (˜ θ | µ, C )with respect to ˜ θ ,d log q (˜ θ | µ, C ) = − (˜ θ − µ ) T C − T C − d˜ θ = − s T C − d˜ θ, Hence ∇ ˜ θ log q (˜ θ | µ, C ) = − C − T s . Diﬀerentiating with respect to C , − d { (˜ θ − µ ) T C − T C − (˜ θ − µ ) } = s T { (d C T ) C − T + C − (d C ) } s = { ( s T C − ⊗ s T ) K d + ( s T ⊗ s T C − ) } dvec( C )= { vec( ss T C − ) T K d + vec( C − T ss T ) T } dvec( C )= 2vec( C − T ss T ) T E Td d v ( C )= 2 v ( C − T ss T ) T d v ( C )Hence ∇ v ( C ) { log q (˜ θ | µ, C ) } = v ( C − T ss T − C − T ). S6. Proof of Theorem 2

Definition 2.

For any square matrix A , let k ( · ) denote a function such that k ( A ) = ¯ A − dg( A ) . Recall that dg( A ) denotes the diagonal matrix derived from A by setting non-diagonalelements to zero and ¯ A denotes the lower triangular matrix derived from A by settingall superdiagonal elements to zero. Lemma S1.

Consider the diﬀerential with respect to θ G . Let A i = L − i dΛ i L − Ti for i = 1 , . . . , n . Then(a) d L i = L i k ( A i ) ,(b) vec { k ( A i ) } = E Tr D Tr ( L − i ⊗ L − i )dvec(Λ i ) ,(c) tr( L − i d L i ) = vec(Λ − i ) T dvec(Λ i ) ,(d) a Ti d b i = vec( L − Ti ˜ B i L − i ) T dvec(Λ i ) + a Ti d λ i . L. S. L. Tan

Proof.

For (a), to ﬁnd the diﬀerential of the Cholesky factor L i , we diﬀerentiate L i L Ti = Λ i and then multiply by L − i on the left and L − Ti on the right (Murray, 2016):(d L i ) L Ti + L i (d L i ) T = dΛ i ,L − i d L i + (d L i ) T L − Ti = L − i dΛ i L − Ti . On the left-hand side, the ﬁrst term is lower triangular and the second term is uppertriangular (transpose of the ﬁrst term). The term on the right-hand side is A i . Thus, L − i d L i = k ( A i ) which implies that d L i = L i k ( A i ).For (b), we make use of some results from Magnus and Neudecker (1980): • Lemma 3.3 (i): vec( ¯ A i ) = E Tr E r vec( A i ) • Lemma 3.3 (iv): vec(dg( A i )) = E Tr E r K r E Tr E r vec( A i ). • Lemma 3.4 (ii) and Lemma 3.6 (iii): 2 I r ( r +1) / − E r K r E Tr = D Tr D r . • Lemma 4.4 (i): D r E r ( L − i ⊗ L − i ) D r = ( L − i ⊗ L − i ) D r .Combining these results with the Kronecker product property that vec( A i ) = ( L − i ⊗ L − i )dvec(Λ i ), we havevec { k ( A i ) } = vec( ¯ A i ) − vec { dg( A i ) } = E Tr E r vec( A i ) − E Tr E r K r E Tr E r vec( A i )= E Tr (2 I r ( r +1) / − E r K r E Tr ) E r vec( A i )= E Tr D Tr D r E r ( L − i ⊗ L − i )dvec(Λ i )= E Tr D Tr D r E r ( L − i ⊗ L − i ) D r d v (Λ i )= E Tr D Tr ( L − i ⊗ L − i )dvec(Λ i ) . For (c), we can use the result from Lemma S1 (a) to obtaintr( L − i d L i ) = tr( k ( A i )) = tr( A i ) = tr( L − Ti L − i dΛ i ) = vec(Λ − i ) T dvec(Λ i ) . For (d), diﬀerentiating b i = L i ˜ b i + λ i with respect to θ G , we have d b i = (d L i )˜ b i + d λ i .Hence a Ti d b i = a Ti (d L i )˜ b i + a Ti d λ i . Note that ( u ⊗ v ) = vec( vu T ) if u and v are vectorsand recall that B i = L Ti a i ˜ b Ti , ˜ B i = ¯ B i + ¯ B Ti − dg( B i ). From Lemma S1 (a) and (b), a Ti (d L i )˜ b i = a Ti L i k ( A i )˜ b i = (˜ b i ⊗ L Ti a i ) T vec( k ( A i ))= vec( B i ) T E Tr D Tr ( L − i ⊗ L − i )dvec(Λ i )= { ( L − Ti ⊗ L − Ti ) D r v ( B i ) } T dvec(Λ i )= vec( L − Ti ˜ B i L − i ) T dvec(Λ i ) . Proof (Theorem 2).

Note that b i = L i ˜ b i + λ i . Diﬀerentiating ℓ (˜ θ ) with respect to˜ b i , we obtain d ℓ (˜ θ ) = {∇ b i log p ( y i , b i | θ G ) } T d b i = a Ti L i d˜ b i = ⇒ ∇ ˜ b i ℓ (˜ θ ) = L Ti a i . eparametrized variational Bayes Diﬀerentiating ℓ (˜ θ ) with respect to θ G m for m = 1 , . . . , M ,d ℓ (˜ θ ) = n X i =1 [ a Ti d b i + {∇ θ Gm log p ( y i , b i | θ G ) } T d θ G m + tr( L − i d L i )]+ {∇ θ Gm log p ( θ G ) } T d θ G m . The ﬁrst two terms are obtained using chain rule as log p ( y i , b i | θ G ) depends on θ G directlyas well as through b i . Substituting the expressions of tr( L − i d L i ) and a Ti d b i from LemmaS1 (c) and (d) respectively, we obtaind ℓ (˜ θ ) = n X i =1 [ a Ti d λ i + vec( L − Ti ˜ B i L − i ) T dvec(Λ i ) + vec(Λ − i ) T dvec(Λ i )]+ n X i =1 {∇ θ Gm log p ( y i , b i | θ G ) } T d θ G m + {∇ θ Gm log p ( θ G ) } T d θ G m . ∴ ∇ θ Gm ℓ (˜ θ ) = n X i =1 [( ∇ θ Gm λ i ) a i + {∇ θ Gm vec(Λ i ) } vec( L − Ti ˜ B i L − i + Λ − i )]+ n X i =1 {∇ θ Gm log p ( y i , b i | θ G ) } + {∇ θ Gm log p ( θ G ) } . S7. Proof of Lemma 2

Proof.

We diﬀerentiate each expression below with respect to ω . For (a),d log p ( y i , b i | θ G ) = d[log | Ω | − b Ti Ω b i ]= [tr(Ω − dΩ) − tr( b i b Ti dΩ)]= [vec(Ω − ) − vec( b i b Ti )] T dvec(Ω)= vec(Ω − − b i b Ti ) T N r ( W ⊗ I ) E Tr D W d ω ∴ ∇ ω log p ( y i , b i | θ G ) = D W E r ( W T ⊗ I )vec(Ω − − b i b Ti )= D W E r vec(Ω − W − b i b Ti W )= D W v ( W − T − b i b Ti W ) . For (b), recall thatlog p ( ω ) = ν − r − log | Ω | − tr( S − Ω) + r X i =1 ( r − i + 2) log( W ii ) + C , where C is a constant independent of ω . Henced log p ( θ G ) = ν − r − tr(Ω − dΩ) − tr( S − dΩ) + r X i =1 ( r − i + 2)d W ∗ ii = vec { ( ν − r − − − S − } N r ( W ⊗ I ) E Tr D W d ω + v (diag( u )) T d ω L. S. L. Tan where u is a vector of length r where the i th element is r − i + 2. ∇ ω log p ( θ G ) = D W E r ( W T ⊗ I )vec { ( ν − r − − − S − } + v (diag( u ))= D W v { ( ν − r − W − T − S − W } + v (diag( u )) . S8. Proof of Lemma 3

Proof.

In the ﬁrst approach, we haveΛ i = (Ω + Z Ti H i (ˆ η i ) Z i ) − , λ i = Λ i Z Ti { y i − g i (ˆ η i ) + H i (ˆ η i )(ˆ η i − X i β ) } . Diﬀerentiating each of the above expressions with respect to β , ∇ β vec(Λ i ) = 0 sinceΛ i is independent of β andd λ i = − Λ i Z Ti H i (ˆ η i ) X i d β = ⇒ ∇ β λ i = − X Ti H i (ˆ η i ) Z i Λ i . We also have ∇ β log p ( θ G ) = − β/σ β and ∇ β log p ( y i , b i | θ G ) = X Ti ( y i − g i ( η i )). Pluggingthese into Theorem 2 ∇ β ℓ (˜ θ ) = − n X i =1 X Ti H i (ˆ η i ) Z i Λ i a i + n X i =1 X Ti ( y i − g i ( η i )) − β/σ β = n X i =1 X Ti ( y i − g i ( η i ) − H i (ˆ η i ) Z i Λ i a i ) − β/σ β . Diﬀerentiating λ i and Λ i with respect to ω , we havedvec(Λ i ) = − vec { Λ i (dΩ)Λ i } = − (Λ i ⊗ Λ i )dvec(Ω)= − i ⊗ Λ i ) N r ( W ⊗ I ) E Tr D W d ω = − N r (Λ i W ⊗ Λ i ) E Tr D W d ω. ∴ ∇ ω vec(Λ i ) = − D W E r ( W T Λ i ⊗ Λ i ) N r . In addition, d λ i = (dΛ i ) Z Ti { y i − g i (ˆ η i ) + H i (ˆ η i )(ˆ η i − X i β ) } = − Λ i (dΩ)Λ i Z Ti { y i − g i (ˆ η i ) + H i (ˆ η i )(ˆ η i − X i β ) } = − Λ i (dΩ) λ i = − ( λ Ti ⊗ Λ i )2 N r ( W ⊗ I ) E Tr D W d ω = −{ ( λ Ti ⊗ Λ i ) K r + ( λ Ti ⊗ Λ i ) } ( W ⊗ I ) E Tr D W d ω = − (Λ i ⊗ λ Ti + λ Ti ⊗ Λ i )( W ⊗ I ) E Tr D W d ω = − (Λ i W ⊗ λ Ti + λ Ti W ⊗ Λ i ) E Tr D W d ω. ∴ ∇ ω λ i = − D W E r ( W T Λ i ⊗ λ i + W T λ i ⊗ Λ i ) . eparametrized variational Bayes Plugging these results and Lemma 2 into Theorem 2, ∇ ω ℓ (˜ θ ) = D W n X i =1 v ( W − T − b i b Ti W ) − D W E r n X i =1 ( W T Λ i ⊗ λ i + W T λ i ⊗ Λ i ) a i − D W E r n X i =1 ( W T Λ i ⊗ Λ i )vec(Λ − i + L − Ti ˜ B i L − i ) + v (diag( u ))+ D W v { ( ν − r − W − T − S − W } = v (diag( u )) + D W v (cid:2) ( n + ν − r − W − T − S − W + n X i =1 ( b i b Ti + λ i a Ti Λ i + Λ i a i λ Ti + Λ i + L i ˜ B i L Ti ) W (cid:3) . S9. Proof of Lemma 4

In the second approach, we consider a Taylor expansion oflog p ( b i | θ G , y i ) = y Ti Z i b i − n i X j =1 h ij ( X Tij β + Z Tij b i ) − b Ti Ω b i + C , about ˆ b i , which satisﬁes Z Ti ( y i − g i ( X i β + Z i ˆ b i )) = Ωˆ b i , (S3)and λ i = ˆ b i , Λ i = { Z Ti H i ( X i β + Z i ˆ b i ) Z i + Ω } − . Diﬀerentiating (S3) w.r.t. β implicitly, − Z Ti H i ( X i β + Z i ˆ b i )( X i d β + Z i dˆ b i ) = Ωdˆ b i − Z Ti H i ( X i β + Z i ˆ b i ) X i d β = Λ − i dˆ b i = ⇒ ∇ β λ i = ∇ β ˆ b i = − X Ti H i ( X i β + Z i ˆ b i ) Z i Λ i . Diﬀerentiating Λ i w.r.t. β ,dΛ i = − Λ i Z Ti diag { h ′′′ i ( X i β + Z i ˆ b i ) ⊙ ( X i d β + Z i dˆ b i ) } Z i Λ i dΛ i = − Λ i Z Ti diag { h ′′′ i ( X i β + Z i ˆ b i ) ⊙ ( X i d β + Z i ( ∇ β ˆ b i ) T d β ) } Z i Λ i dvec(Λ i ) = − (Λ i Z Ti ⊗ Λ i Z Ti ) E Tn i v [diag { h ′′′ i ( X i β + Z i ˆ b i ) ⊙ ( X i + Z i ( ∇ β ˆ b i ) T )d β } ] ∇ β vec(Λ i ) = − U Ti E n i ( Z i Λ i ⊗ Z i Λ i )where U i is a n i ( n i + 1) / × p matrix where each row corresponds to an element of v ( I n i ) and only the rows corresponding to the diagonal elements are nonzero. The rowcorresponding to the j th diagonal element is given by h ′′′ ij ( X Tij β + Z Tij ˆ b i )( X Tij + Z Tij ( ∇ β ˆ b i ) T ). L. S. L. Tan

Let α i = h ′′′ i ( X i β + Z i ˆ b i ) ⊙ diag { Z i (Λ i + L i ˜ B i L Ti ) Z Ti } . Substituting these results inTheorem 2, ∇ β ℓ (˜ θ ) = − n X i =1 X Ti H i ( X i β + Z i ˆ b i ) Z i Λ i a i + n X i =1 X Ti ( y i − g i ( η i )) − n X i =1 U Ti E n i ( Z i Λ i ⊗ Z i Λ i )vec(Λ − i + L − Ti ˜ B i L − i ) − β/σ β = n X i =1 (cid:2) X Ti { y i − g i ( η i ) − H i ( X i β + Z i ˆ b i ) Z i Λ i a i } − U Ti v { Z i (Λ i + L i ˜ B i L Ti ) Z Ti } (cid:3) − β/σ β = n X i =1 X Ti { y i − g i ( η i ) − H i ( X i β + Z i ˆ b i ) Z i Λ i a i } − β/σ β − n X i =1 ( X Ti + ∇ β ˆ b i Z Ti )[ h ′′′ i ( X i β + Z i ˆ b i ) ⊙ diag { Z i (Λ i + L i ˜ B i L Ti ) Z Ti } ]= n X i =1 X Ti { y i − g i ( η i ) − H i ( X i β + Z i ˆ b i ) Z i Λ i ( a i − Z Ti α i ) − α i } − β/σ β , Diﬀerentiating (S3) w.r.t. ω implicitly, − Z Ti H i ( X i β + Z i ˆ b i ) Z i dˆ b i = dΩˆ b i + Ωdˆ b i dˆ b i = − Λ i dΩˆ b i = − (ˆ b Ti ⊗ Λ i )dvec(Ω) = − (ˆ b Ti ⊗ Λ i )2 N r ( W ⊗ I ) E Tr D W d ω dˆ b i = − (ˆ b Ti ⊗ Λ i + Λ i ⊗ ˆ b Ti )( W ⊗ I ) E Tr D W d ω dˆ b i = − (ˆ b Ti W ⊗ Λ i + Λ i W ⊗ ˆ b Ti ) E Tr D W d ω = ⇒ ∇ ω λ i = ∇ ω ˆ b i = − D W E r ( W T ˆ b i ⊗ Λ i + W T Λ i ⊗ ˆ b i )Diﬀerentiating Λ i w.r.t. ω ,dΛ i = − Λ i [ Z Ti diag { h ′′′ ( X i β + Z i ˆ b i ) ⊙ Z i dˆ b i } Z i + dΩ]Λ i dΛ i = − Λ i Z Ti diag { h ′′′ ( X i β + Z i ˆ b i ) ⊙ Z i ( ∇ ω ˆ b i ) T d ω } Z i Λ i − Λ i dΩΛ i dvec(Λ i ) = − (Λ i Z Ti ⊗ Λ i Z Ti ) E Tn i d v [diag { h ′′′ ( X i β + Z i ˆ b i ) ⊙ Z i ( ∇ ω ˆ b i ) T d ω } ] − (Λ i ⊗ Λ i )2 N r ( W ⊗ I ) E Tr D W d ω dvec(Λ i ) = − (Λ i Z Ti ⊗ Λ i Z Ti ) E Tn i d v [diag { h ′′′ ( X i β + Z i ˆ b i ) ⊙ Z i ( ∇ ω ˆ b i ) T d ω } ] − N r (Λ i W ⊗ Λ i ) E Tr D W d ω ∇ ω vec(Λ i ) = − V Ti E n i ( Z i Λ i ⊗ Z i Λ i ) − D W E r ( W T Λ i ⊗ Λ i ) N r where V i is a n i ( n i + 1) / × r matrix where each row corresponds to an element of v ( I n i ) and only the rows corresponding to the diagonal elements are nonzero. The rowcorresponding to the j th diagonal element is given by h ′′′ ( X Tij β + Z Tij ˆ b i ) Z Tij ( ∇ ω ˆ b i ) T . eparametrized variational Bayes Hence we have {∇ ω vec(Λ i ) } vec(Λ − i + L − Ti ˜ B i L − i )= {− V Ti E n i ( Z i Λ i ⊗ Z i Λ i ) − D W E r ( W T Λ i ⊗ Λ i ) N r } vec(Λ − i + L − Ti ˜ B i L − i )= − V Ti v ( Z i (Λ i + L i ˜ B i L Ti ) Z Ti ) − D W v { (Λ i + L i ˜ B i L Ti ) W } = − ∇ ω ˆ b i Z Ti [ h ′′′ ( X i β + Z i ˆ b i ) ⊙ diag { Z i (Λ i + L i ˜ B i L Ti ) Z Ti } ] − D W v { (Λ i + L i ˜ B i L Ti ) W } = D W E r ( W T ˆ b i ⊗ Λ i + W T Λ i ⊗ ˆ b i ) Z Ti α i − D W v { (Λ i + L i ˜ B i L Ti ) W } = D W v { (Λ i Z Ti α i ˆ b Ti + ˆ b i α Ti Z i Λ i − Λ i − L i ˜ B i L Ti ) W } . Substituting these results into Theorem 2, ∇ ω ℓ (˜ θ ) = v (diag( u )) + D W v { ( ν − r − W − T − S − W } + D W n X i =1 v { (Λ i Z Ti α i ˆ b Ti + ˆ b i α Ti Z i Λ i − Λ i − L i ˜ B i L Ti ) W } + D W n X i =1 v ( W − T − b i b Ti W ) − D W E r n X i =1 ( W T ˆ b i ⊗ Λ i + W T Λ i ⊗ ˆ b i ) a i = v (diag( u )) + D W v { ( n + ν − r − W − T − S − W } + D W n X i =1 v { (Λ i Z Ti α i ˆ b Ti + ˆ b i α Ti Z i Λ i − Λ i − L i ˜ B i L Ti − b i b Ti − Λ i a i ˆ b Ti − ˆ b i a Ti Λ i ) W } = v (diag( u )) + D W v { ( n + ν − r − W − T − S − W − n X i =1 (Λ i ( a i − Z Ti α i )ˆ b Ti + ˆ b i ( a i − Z Ti α i ) T Λ i + Λ i + L i ˜ B i L Ti + b i b Ti ) W }}