Variational Bayes for Gaussian Factor Models under the Cumulative Shrinkage Process
aa r X i v : . [ s t a t . C O ] A ug Variational Bayes for Gaussian Factor Models under theCumulative Shrinkage Process
Sirio Legramanti
Department of Decision Sciences, Bocconi University, 20136 Milan, Italy [email protected]
Abstract
The cumulative shrinkage process is an increasing shrinkage prior that can be employedwithin models in which additional terms are supposed to play a progressively negligible role.A natural application is to Gaussian factor models, where such a process has proved effec-tive in inducing parsimonious representations while providing accurate inference on the datacovariance matrix. The cumulative shrinkage process came with an adaptive Gibbs samplerthat tunes the number of latent factors throughout iterations, which makes it faster thanthe non-adaptive Gibbs sampler. In this work we propose a variational algorithm for Gaus-sian factor models endowed with a cumulative shrinkage process. Such a strategy providescomparable inference with respect to the adaptive Gibbs sampler and further reduces runtime.
Keywords:
Shrinkage prior; Spike and slab; Stick-breaking representation
The cumulative shrinkage process [5] is an increasing shrinkage prior based on a sequence ofspike-and-slab distributions, with growing mass assigned to the spike. It can be defined for bothcountable and finite sequences. A definition for the countable case can be found in [5], while thefollowing is a definition for finite sequences.
Definition 1.1
We say that { θ h ∈ Θ ⊆ R : h = 1 , . . . , H } is distributed according to a cumulativeshrinkage process with shrinkage parameter α > , slab P and spike P ∞ if, conditionally on { π h ∈ (0 ,
1) : h = 1 , . . . , H } , each θ h is independent and ( θ h | π h ) ∼ (1 − π h ) P + π h P ∞ , ( h = 1 , . . . , H ) , (1) where π h = P hl =1 ω l for h = 1 , . . . , H and ω l = v l Q l − m =1 (1 − v m ) for l = 1 , . . . , H , with v , . . . , v H − being independent Beta(1 , α ) random variables and v H = 1 . This construction, based on the stick-breaking representation of the Dirichlet process [4], impliesthat the sequence π h is non-decreasing and that π H = 1.The cumulative shrinkage process can be used in a variety of models, e.g. Poisson factorization[3], but here we focus on Gaussian factor models, which are ubiquitous in statistics and have beenused in [5] as illustrative example. In [5] posterior inference for this model under the cumulativeshrinkage process is carried out through an adaptive Gibbs sampler which tunes H as it progresses.This algorithm, together with the ability of the prior to favor the recovery of the number ofactive latent factors, allows for reduced runtime with respect to the non-adaptive Gibbs sampler.However, the increasing availability of large datasets demands for even faster algorithms. Thisneed for scalability has pushed Bayesian statisticians towards approximate methods for posteriorinference, including Laplace approximation, variational Bayes and expectation propagation [2].In this work we employ mean-field variational Bayes, which is straightforward to derive forGaussian factor models under a convenient specification of the cumulative shrinkage process which1lightly differs from the one in [5]. Such a specification is detailed in §
2, while the variationalapproximation is described in §
3. Finally, in § We focus on learning the structure of the p × p covariance matrix Ω = ΛΛ T + Σ of the data y i ∈ R p from the Gaussian factor model y i = Λ η i + ǫ i ( i = 1 , . . . , n ), where Λ = [ λ jh ] ∈ R p × H , η i ∼ N H (0 , I H ), ǫ i ∼ N p (0 , Σ) and Σ = diag( σ , . . . , σ p ). As priors, we let σ j ∼ InvGa( a σ , b σ )for j = 1 , . . . , p and, differently from [5], we place a cumulative shrinkage process directly on theloadings, with π h as in Def. 1.1:( λ jh | π h ) ∼ (1 − π h ) N (0 , θ ) + π h N (0 , θ ∞ ) , ( j = 1 , . . . , p ; h = 1 , . . . , H ) . (2)This simpler specification facilitates the derivation of the variational algorithm, while preservingthe increasing shrinkage property. In fact, setting θ > θ ∞ , the loadings are increasingly shrunktowards zero in probability, i.e. pr {| λ j,h +1 | < ǫ } ≥ pr {| λ jh | < ǫ } for any ǫ >
0, encoding the priorassumption that additional factors provide a decreasing contribution to the model. However,setting both the spike and the slab to Gaussians is suboptimal to the specification in [5], wherethe Student-t slab is more differentiated from the Gaussian spike, thus facilitating the separationof active and inactive factors.The derivation of the variational algorithm is further facilitated by the introduction of theaugmented data z h = ( z h , . . . , z hH ) ∼ Mult { , ( ω , . . . , ω H ) } , which exploits the fact that equa-tion (2) can be obtained by marginalizing out z h from( λ jh | z h ) ∼ { − X hl =1 z hl } N (0 , θ ) + X hl =1 z hl N (0 , θ ∞ ) , ( j = 1 , . . . , p ; h = 1 , . . . , H ) . Variational Bayes approximates the posterior density with the density q ∗ that is closest to it, inKullback-Leibler (KL) divergence, within a family Q of tractable densities (see [1] for a review).The ideal variational family Q should combine flexibility, that allows for a good approximation,and tractability. Here we use the mean-field variational family, whose elements factorize as follows: q ( λ, η, σ, z, v ) = q ( λ ) q ( η ) q ( σ ) q ( z ) q ( v ) . (3)The KL divergence between such a q and the intractable posterior cannot be computed or mini-mized directly. Equivalently, we maximize the evidence lower bound ELBO ( q ) = log p ( y ) − KL ( q ( λ, η, σ, z, v ) || p ( λ, η, σ, z, v | y )) = (4)= E q [log p ( y, λ, η, σ, z, v )] − E q [log q ( λ, η, σ, z, v )] . (5)Equation (4) highlights that, since the KL divergence is always non-negative, the ELBO lower-bounds the log-evidence, thus justifying its name. Moreover, since log p ( y ) does not depend on q ,maximizing the ELBO is equivalent to minimizing the KL divergence with respect to q . Since (4)involves the intractable posterior, the equivalent expression (5) is used to actually compute theELBO. The optimization is solved through coordinate ascent, iteratively maximizing the ELBOwith respect to each factor on the right-hand side of (3). Following [2, Ch. 10], each factor updateis derived as follows (we report only the loadings term for illustrative purposes):log q ∗ ( λ ) = E = λ [log p ( y, λ, η, σ, z, v )] + const, for j from 1 to p do set V ( λ ) j = { diag ( θ ∗ , . . . , θ ∗ H ) + ( A ( σ ) /B ( σ ) j )( µ ( η ) T µ ( η ) + nV ( η ) ) } − ,where θ ∗ h = (1 − P hl =1 κ hl ) θ − + ( P hl =1 κ hl ) θ − ∞ , and µ ( λ ) j = ( A ( σ ) /B ( σ ) j ) V ( λ ) j µ ( η ) T y · j ; Set A ( σ ) = a σ + n/ for j from 1 to p do set B ( σ ) j = b σ + 0 . · P ni =1 { y ij − y ij µ ( η ) i T µ ( λ ) j + P Hh =1 P Hk =1 ( µ ( η ) ih µ ( η ) ik + V ( η ) hk )( µ ( λ ) jh µ ( λ ) jk + V ( λ ) j ; hk ) } ; Set V ( η ) = ( I H + µ ( λ ) T diag ( A ( σ ) /B ( σ ) ) µ ( λ ) + P pj =1 ( A ( σ ) /B ( σ ) j ) V ( λ ) j ) − ; for i from 1 to n do set µ ( η ) i = V ( η ) µ ( λ ) T diag ( A ( σ ) /B ( σ ) ) y i · ; for h from 1 to H dofor l from 1 to h do set κ hl ∝ exp { E (log ω l ) − . · p log θ ∞ − . · θ − ∞ E [ λ T · h λ · h ] } ; for l from h+1 to H do set κ hl ∝ exp { E (log ω l ) − . · p log θ − . · θ − E [ λ T · h λ · h ] } ;where E [ λ T · h λ · h ] = P pj =1 ( µ ( λ ) jh + V ( λ ) j ; hh ) and, with Ψ being the digamma function, E (log ω l ) = { l < H }{ Ψ( A ( v ) l ) − Ψ( A ( v ) l + B ( v ) l ) } + { l > } P l − m =1 { Ψ( B ( v ) m ) − Ψ( A ( v ) m + B ( v ) m ) } ; for h from 1 to ( H − do set A ( v ) h = 1 + P Hl =1 κ lh and B ( v ) h = α + P Hl =1 P Hm = h +1 κ lm . Algorithm 1:
One cycle of the variational algorithm for Gaussian factor modelswhere E = λ denotes the expectation under q with respect to all variables other than the loadings.With no parametric assumption on the factors in (3), we obtain: q ∗ ( λ, η, σ, z, v ) = p Y j =1 N H ( λ j · ; µ ( λ ) j , V ( λ ) j ) n Y i =1 N H ( η i · ; µ ( η ) i , V ( η ) ) p Y j =1 InvGa( σ j ; A ( σ ) , B ( σ ) j ) ·· H Y h =1 Mult( z h ; 1 , κ h ) H − Y h =1 Beta( v h ; A ( v ) h , B ( v ) h ) . Notice that each factor further factorizes into exponential-family distributions, thus facilitatingcomputations. The update equations for the parameters are coupled, meaning that each factorupdate involves expectations with respect to other factors. We then proceed iteratively cyclingover the steps of Algorithm 1. This routine converges to a local maximum, hence should berun from several initializations [1]. Convergence of each run can be assessed by monitoring themonotone growth of the ELBO. From the optimal variational parameters we can also computethe variational expectation of the number H ∗ of factors that are active, in the sense that they aremodeled by the slab: E q ∗ [ H ∗ ] = P Hh =1 P Hl = h +1 κ hl . We compare our variational algorithm for the model in § bfi from the R package psych , containing the six-point-scale answers of n = 126individuals older than fifty years to p = 25 questions about five personality traits. As in [5], wecenter the 25 items and, to have coherent answers within each personality trait, we change signto answers 1 , , , , ,
22 and 25, as suggested in the documentation of the bfi dataset.For the adaptive Gibbs sampler, the model and the hyperparameters are specified as in [5]. Forour variational algorithm, we set α = 5, θ = 1, θ ∞ = 10 − and we conservatively let H = p + 1,which coincides with the initial value of H for the adaptive Gibbs sampler and corresponds to atmost p latent factors. 3able 1: Performance of adaptive Gibbs sampler and variational algorithm on the bfi datasetMethod MSE E[H*] Running time (s)Adaptive Gibbs sampler 0.01 2.7 340Variational inference 0.01 3.0 63We run the variational algorithm from 20 random initializations, stopping each run when theELBO grows less than 0.05. We then pick the run reaching the highest ELBO. Using the optimalvariational parameters of this run, we get a sample of size 2000 for Ω, from which we derive a samplefor the correlation matrix Ω ∗ = (Ω ⊙ I p ) − / Ω(Ω ⊙ I p ) − / , with ⊙ denoting the element-wiseproduct. From this sample we compute a Monte Carlo estimate of the mean squared deviations P pj =1 P pq = j E (Ω ∗ jq − S jq ) / { p ( p + 1) / } between Ω ∗ and the sample correlation matrix S . Thesame quantity is computed from a posterior sample of equal size obtained running the adaptiveGibbs sampler in [5] for 10000 iterations after a burn-in of 5000 and then thinning every five. Thetwo quantities are reported as MSE (Mean Square Error) in Table 1, together with the expectednumber of active factors and the total running time for each of the two methods.With respect to the adaptive Gibbs sampler, the proposed variational algorithm provides thesame MSE (rounded off to the second decimal digit) and a similar expected number of activefactors, but is more than five times faster. Acknowledgments
The author is grateful to Daniele Durante for his helpful comments, and acknowledges the supportfrom MIUR-PRIN 2017 project 20177BRJXS.
References [1] Blei, D. M., Kucukelbir, A., McAuliffe, J. D.: Variational inference: a review for statisticians.J. Am. Stat. Assoc. , 859–877 (2017)[2] Bishop, C. M.: Pattern recognition and machine learning. Springer (2006)[3] Dunson, D. B., Herring, A. H.: Bayesian latent variable models for mixed discrete outcomes.Biostatistics , 11–25 (2005)[4] Ishwaran, H., James, L. F.: Gibbs sampling methods for stick-breaking priors. J. Am. Stat.Assoc.96