[PDF] Stein's Lemma for the Reparameterization Trick with Exponential Family Mixtures

Abstract

Stein's method (Stein, 1973; 1981) is a powerful tool for statistical applications, and has had a significant impact in machine learning. Stein's lemma plays an essential role in Stein's method. Previous applications of Stein's lemma either required strong technical assumptions or were limited to Gaussian distributions with restricted covariance structures. In this work, we extend Stein's lemma to exponential-family mixture distributions including Gaussian distributions with full covariance structures. Our generalization enables us to establish a connection between Stein's lemma and the reparamterization trick to derive gradients of expectations of a large class of functions under weak assumptions. Using this connection, we can derive many new reparameterizable gradient-identities that goes beyond the reach of existing works. For example, we give gradient identities when expectation is taken with respect to Student's t-distribution, skew Gaussian, exponentially modified Gaussian, and normal inverse Gaussian.

Full PDF

aa r X i v : . [ s t a t . M L ] O c t Stein’s Lemma for the Reparameterization Trickwith Exponential Family Mixtures

Wu Lin Mohammad Emtiyaz Khan Mark Schmidt Abstract

Stein’s method (Stein, 1973; 1981) is a power-ful tool for statistical applications, and has hada signiﬁcant impact in machine learning. Stein’slemma plays an essential role in Stein’s method.Previous applications of Stein’s lemma either re-quired strong technical assumptions or were lim-ited to Gaussian distributions with restricted co-variance structures.In this work, we extend Stein’s lemma toexponential-family mixture distributions includ-ing Gaussian distributions with full covariancestructures. Our generalization enables us to es-tablish a connection between Stein’s lemma andthe reparamterization trick to derive gradients ofexpectations of a large class of functions underweak assumptions. Using this connection, wecan derive many new reparameterizable gradient-identities that goes beyond the reach of existingworks. For example, we give gradient identi-ties when expectation is taken with respect toStudent’s t-distribution, skew Gaussian, expo-nentially modiﬁed Gaussian, and normal inverseGaussian.

1. Introduction

Stein’s lemma (Stein, 1973; 1981; Liu, 1994) plays anessential role in Stein’s method. The lemma gives aﬁrst-order identity to estimate the mean of a multivari-ate Gaussian distribution. In machine learning, Fan et al.(2015); Erdogdu (2015); Rezende et al. (2014) use inte-gration by parts to extend the lemma without givingtechnical conditions. Another applications of Stein’slemma are De Bruijn’s identity (Park et al., 2012) andthe heat equation identity (Brown et al., 2006). Thesetwo works give the same second-order identity to esti- University of British Columbia, Vancouver, Canada. RIKEN Center for Advanced Intelligence Project, Tokyo, Japan.Correspondence to: Wu Lin < [email protected] > . Technical Report, Working in Progress. mate the covariance of a multivariate Gaussian distribu-tion. In practices, the second-order identity gives a bet-ter unbiased estimation than the one obtained from theﬁrst-order identity. (Khan & Lin, 2017; Khan et al., 2018;Salimans & Knowles, 2013). However, Park et al. (2012);Brown et al. (2006) use stronger assumptions to simplifyproofs where the authors either assume diagonal covariancestructure or twice continuous differentiability. In gradientestimation, the ﬁrst-order identity is known as Bonnet’s the-orem (Bonnet, 1964). Bonnet’s theorem gives the repa-rameterization gradient for the mean (Kingma & Welling,2013). The second-order identity is known as Price’s theo-rem (Price, 1958). However, Price (1958); Bonnet (1964)use characteristic functions as the proof technique. Thistechnique is not easy to be extended to Gaussian mix-ture and to be used to identify weak assumptions. Be-yond Gaussian distribution, Stein’s lemma is proposed byHudson et al. (1978); Brown (1986). In machine learning,Salimans & Knowles (2013); Figurnov et al. (2018) pro-pose an implicit reparameterization trick under continuousdifferentiability for a class of distributions.In this work, we generalize Stein’s lemma to Gaussianvariance-mean mixtures with arbitrary covariance structureand exponentially family mixtures while keeping assump-tions week and proofs simple. Moreover, we present asecond-order identity for the covariance estimation of theGaussian mixtures. Our theory also shows a direct connec-tion between Stein’s lemma and the reparameterizable gra-dient estimation. Furthermore, we show that we can obtainthe implicit reparameterization trick via Stein’s lemma un-der weaker assumptions than Salimans & Knowles (2013);Figurnov et al. (2018). Last but not least, we give examplesof gradient identities derived from our theory such as multi-variate Student’s t distribution, multivariate skew Gaussian,multivariate exponentially modiﬁed Gaussian and multi-variate normal inverse Gaussian. We ﬁnd out that theseidentities are useful in variational inference with Gaus-sian variance-mean mixture approximations as shown inLin et al. (2019). tein’s Lemma for the Reparameterization Trick

2. Related Works

There are many existing works about Stein’s lemma. Stein(1973; 1981) give a ﬁrst-order identity for a multivariateGaussian with diagonal covariance structure. Liu (1994)extends the ﬁrst-order identity to a multivariate Gaussianwith arbitrary covariance structure. For gradient estima-tion, Stein’s lemma indeed recovers Bonnet’s theorem(Bonnet, 1964) . Price’s theorem (Price, 1958) gives thesecond-order identity for a multivariate Gaussian witharbitrary covariance structure. However, Price (1958);Opper & Archambeau (2009) use the characteristic func-tion of Gaussian to prove Price’s theorem, which is noteasy to extend to the Gaussian mixture case. Hudson et al.(1978); Brown (1986); Arnold et al. (2001); Landsman(2006); Landsman & Neˇslehov´a (2008); Kattumannil(2009); Kattumannil & Dewan (2016); Adcock (2007);Adcock & Shutes (2012) further extend Stein’s lemmato exponential family and beyond. Unfortunately, theseworks neither show the connection between the gradientidentity and the implicit reparameterization trick nor giveany second-order identity.

3. Smoothness Assumptions

We ﬁrst give smoothness conditions. These conditions willbe used in the gradient identities. The key deﬁnition is theabsolute continuity (AC) of a function: h ( · ) : [ a, b ]

7→ R m ,where [ a, b ] is a compact interval in R . Deﬁnition 1

A vector function: h ( · ) : [ a, b ]

7→ R m is ACif the following assumptions are satisﬁed. • Its derivative ∇ z h ( z ) exists almost everywhere for z ∈ [ a, b ] . • The derivative is Lebesgue integrable. In other words, R ba ||∇ z h ( z ) || dz < ∞ , where || · || denotes the Eu-clidean norm. • The fundamental theorem of calculus holds, that is, h ( z ) = h ( a ) + R za ∇ z h ( t ) dt for any z ∈ [ a, b ] . Since we want to deal with a class of functions whose do-main is R , we deﬁne the locally AC of this class of func-tions. Deﬁnition 2

Let h ( · ) : R 7→ R m be a vector function. If h ( · ) is AC at every compact interval of its domain R , wesay that the function is locally AC. The property below is essential in the following sections.

Property 1

A product of two locally AC functions is alsolocally AC.

Now, we extend the deﬁnition of locally AC to a set of func-tions whose domain is R n . It is known as the absolute con-tinuity on almost every straight line (ACL) (Leoni, 2017). Deﬁnition 3

Let h ( z ) : R d

7→ R m be a vector function.Given z − i ∈ R d − is ﬁxed, let’s deﬁne a function h i ( · ) := h ( · , z − i ) : R 7→ R m . We say the function h ( · ) is locallyACL if for all i and almost every point z − i ∈ R d − , h i ( · ) is locally AC. A locally ACL function is a member of the Sobolevfamily (Leoni, 2017). Obviously, the derivative ∇ z h ( z ) exists almost everywhere if h ( z ) is locally ACL.Due to Royen & Fitzpatrick (2010), a locally Lipschitz-continuous function is locally ACL. The immediate con-sequence is that a function is locally ACL and continuousif it is either locally Lipschitz-continuous or continuouslydifferentiable.In the following sections, we assume that the regular con-ditions for swapping of differentiation and integration aresatisﬁed so that the following identify holds. ∇ λ E q ( z | λ ) [ h ( z )] = E q ( z | λ ) (cid:20) h ( z ) ∇ λ q ( z | λ ) q ( z | λ ) (cid:21) The regular conditions are required to use the dominatedconvergence theorem, which allows us to interchange dif-ferentiation and integration. One particular condition is E q ( z | λ ) (cid:20)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h ( z ) ∇ λ q ( z | λ ) q ( z | λ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:21) < ∞ . For simplicity, we assume the regular conditions hold with-out explicitly mentioning them.

4. Expectation and Conditional Expectation

By deﬁnition, an expectation can be either non-existent ornon-ﬁnite. To avoid such cases, we say an expectation E q ( z ) [ h ( z )] is well-deﬁned if E q ( z ) [ || h ( z ) || ] < ∞ , where ||·|| is an appropriate norm. Due to Fubini’s theo-rem, the following identity holds for a random vector z ona product measure when the expectation is well-deﬁned. E q ( z ) [ h ( z )] = E q ( z − i ) [ E q ( z i | z − i ) [ h ( z )]] The above expression shows that conditional expectation E q ( z i | z − i ) [ h ( z )] is also well-deﬁned for almost every z − i .For simplicity, we implicitly assume expectations are well-deﬁned in the following sections. tein’s Lemma for the Reparameterization Trick

5. Identities for Gaussian Distribution

Now, we describe the univariate case of Stein’s lemma.

Lemma 1 (Stein’s Lemma):

Let h ( · ) : R 7→ R be locallyAC. q ( z ) is an univariate Gaussian distribution denoted by N ( z | µ, σ ) with mean µ and variance σ . The following ﬁrst-order identity holds. E q (cid:20) −∇ z q ( z ) q ( z ) h ( z ) (cid:21) = E q [ ∇ z h ( z )] . where −∇ z q ( z ) q ( z ) = σ − ( z − µ ) . The proof of this lemma is given at Appendix A.1.Let’s consider Bonnet’s theorem given below. This theo-rem establishes the connection between this lemma and thereparameterizable gradient for the mean µ . Theorem 1 (Bonnet’s Theorem) :

Let h ( · ) : R 7→ R belocally AC. q ( z ) is a univariate Gaussian distribution withmean µ and variance σ . We have the following gradientidentity. ∇ µ E q [ h ( z )] = E q [ ∇ z h ( z )] = E q (cid:2) σ − ( z − µ ) h ( z ) (cid:3) The proof of Bonnet’s Theorem is fairly simple. First, weswap differentiation and integration. We obtain the fol-lowing expression ∇ µ E q [ h ( z )] = E q (cid:2) σ − ( z − µ ) h ( z ) (cid:3) .By applying Lemma 1 to h ( z ) , we obtain the identity E q [ ∇ z h ( z )] = E q (cid:2) σ − ( z − µ ) h ( z ) (cid:3) . Clearly, the re-parameterizable gradient for the mean µ is directly derivedfrom Stein’s lemma. Now, we discuss the reparameteriz-able gradient for the variance. First, we give the followinglemma. Lemma 2

Let h ( · ) : R 7→ R be locally AC. We assume E q [ h ( z )] is well-deﬁned. The following identity holds. E q h σ − (cid:16) ( z − µ ) − σ (cid:17) h ( z ) i = E q (cid:2) σ − ( z − µ ) ∇ z h ( z ) (cid:3) The key idea of the proof is that we deﬁne an auxiliary func-tion f ( z ) := σ − ( z − µ ) h ( z ) and apply Lemma 1 to f ( z ) . By the lemma, we have the following result. E q h σ − ( z − µ ) h ( z ) i = E q (cid:2) σ − ( z − µ ) f ( z ) (cid:3) = E q [ ∇ z f ( z )]= E q (cid:2) ∇ z (cid:0) σ − ( z − µ ) h ( z ) (cid:1)(cid:3) = E q (cid:2) σ − h ( z ) + σ − ( z − µ ) ∇ z h ( z ) (cid:3) From this expression, we can easily obtain the gradientidentity. Now, we discuss the identity to compute the repa-rameterizable gradient for the variance.

Lemma 3

Let h ( z ) : R 7→ R be locally AC. We assumethe conditions of Lemma 2 are satisﬁed. The following gra-dient identity holds. ∇ σ E q [ h ( z )] = E q (cid:2) σ − ( z − µ ) ∇ z h ( z ) (cid:3) , Likewise, the proof of this lemma is trivial. First, we swapdifferentiation and integration. We obtain the following ex-pression ∇ σ E q [ h ( z )] = E q h σ − (cid:16) ( z − µ ) − σ (cid:17) h ( z ) i .After that, we obtain the above identity by lemma 2. Atthis point, we can see that the reparameterizable gradientfor Gaussian can be derived from Stein’s lemma. Further-more, Stein’s lemma empirically gives a low-variance andunbiased gradient estimator if we allow to use the second-order information. This idea is known as Price’s theorem.Before we discuss Price’s theorem, we ﬁrst describe the fol-lowing lemma. Lemma 4

Let h ( z ) : R 7→ R be continuously differen-tiable. Additionally, its derivative ∇ z h ( z ) : R 7→ R islocally AC. q ( z ) is an univariate Gaussian distribution de-noted by N ( z | µ, σ ) with mean µ and variance σ . We havethe following identity. E q (cid:2) ∇ z h ( z ) (cid:3) = E q (cid:2) σ − ( z − µ ) ∇ z h ( z ) (cid:3) . The key idea of the proof that we deﬁne auxiliary functions f ( z ) := ∇ h ( z ) and apply Lemma 1 to f ( z ) .Using the above lemmas, we obtain Price’s theorem asshown below. Theorem 2 (Price’s Theorem):

Let h ( z ) : R 7→ R becontinuously differentiable and its derivative ∇ h ( z ) be lo-cally AC. We further assume E q [ h ( z )] is well-deﬁned. Thefollowing second-order identity holds. ∇ σ E q [ h ( z )] = E q (cid:2) σ − ( z − µ ) ∇ Tz h ( z ) (cid:3) = E q (cid:2) ∇ z h ( z ) (cid:3) The above theorem can be readily shown by Lemma 4 andLemma 3.Now, we describe Stein’s lemma for a multivariate Gaus-sian with arbitrary covariance structure.

Lemma 5 (Stein’s Lemma):

Let h ( z ) : R d

7→ R be lo-cally ACL and continuous. q ( z ) be a multivariate Gaussiandistribution denoted by N ( z | µ , Σ ) with mean µ and vari-ance Σ . The following identity holds. E q (cid:2) Σ − ( z − µ ) h ( z ) (cid:3) = E q [ ∇ z h ( z )] . The proof can be found at Appendix A.2. Bonnet’s theoremand Price’s theorem are given below. The proof of Bonnet’stheorem and Price’s theorem can be found at Appendix A.3and A.4 respectively. tein’s Lemma for the Reparameterization Trick

Theorem 3 (Bonnet’s Theorem):

Let h ( z ) : R d

7→ R be locally ACL and continuous. q ( z ) be a multivariateGaussian distribution denoted by N ( z | µ , Σ ) . The follow-ing ﬁrst-order identity holds. ∇ µ E q [ h ( z )] = E q [ ∇ z h ( z )] = E q (cid:2) Σ − ( z − µ ) h ( z ) (cid:3) Theorem 4 (Price’s Theorem):

Let h ( z ) : R d

7→ R be continuously differentiable and its derivative ∇ h ( z ) belocally ACL. Furthermore, we assume E q [ h ( z )] is well-deﬁned. The following second-order identity holds. ∇ Σ E q [ h ( z )] = E q (cid:2) Σ − ( z − µ ) ∇ Tz h ( z ) (cid:3) = E q (cid:2) ∇ z h ( z ) (cid:3)

6. Identities for Univariate ContinuousExponential-family

We can generalize Stein’s identity to a class of exponentialfamily. First of all, we say a function is locally AC withits domain ( l, u ) , where −∞ ≤ l < u ≤ ∞ if the func-tion is AC at every compact interval inside its domain. Weconsider the following exponential-family (EF) distributionwith z ∈ ( l, u ) , where −∞ ≤ l < u ≤ ∞ . Furthermore,we assume q ( z | λ ) is locally AC w.r.t z and differentiablew.r.t. λ . In the case when φ z ( λ ) = λ , it can be shown that q ( z | λ ) is differentiable w.r.t. λ . q ( z | λ ) = h z ( z ) exp {h T z ( z ) , φ z ( λ ) i − A z ( λ ) } where l and u do not depend on λ .We denote the CDF of q ( z | λ ) by ψ ( z, λ ) := R zl q ( t | λ ) dt .The following assumption is known as the boundary condi-tion in the literature. lim z ↓ l h ( z ) q ( z | λ ) = 0 , lim z ↑ u h ( z ) q ( z | λ ) = 0 Lemma 6 (Stein’s Lemma):

Let h ( · ) : ( l, u )

7→ R belocally AC. If the boundary condition is satisﬁed, the fol-lowing identity holds. − E q (cid:20) h ( z ) ∇ z q ( z | λ ) q ( z | λ ) (cid:21) = E q [ ∇ z h ( z )] . In the Gaussian case, we can further exploit the structureof Gaussian as shown in Lemma 8 so that the boundarycondition is implicitly satisﬁed. For general cases, we haveto explicitly assume that the boundary condition is satisﬁed.The proof is exactly the same as the proof for Lemma 1 asshown in A.1.Applying Lemma 6 to ˜ f i ( z ) deﬁned below, we obtain theimplicit reparameterization trick . Theorem 5 (Implicit Reparametrization Trick) :

Let h ( · ) : ( l, u )

7→ R be a locally AC function. We deﬁne f i ( z ) := ∇ λi ψ ( z, λ ) q ( z | λ ) where λ i is a scalar. If the conditionsof Lemma 6 for ˜ f i ( z ) := h ( z ) f i ( z ) are satisﬁed, we havethe following identity. ∇ λ i E q [ h ( z )] = − E q [ f i ( z ) ∇ z h ( z )] ,

7. Identities for Exponential-family Mixtures

We consider the following Gaussian mixture. q ( w, z | µ , α , Σ ) := N ( z | µ + u ( w ) α , v ( w ) Σ ) q ( w ) q ( z | µ , α , Σ ) := Z q ( w, z | µ , α , Σ ) dw where v ( w ) > . Theorem 6 (Bonnet’s Theorem):

Let h ( z ) : R d

7→ R belocally ACL and continuous. q ( z ) is a Gaussian variance-mean mixture and q ( w, z ) is the joint distribution. The fol-lowing gradient identity holds. ∇ µ E q ( z ) [ h ( z )] = E q ( z ) [ ∇ z h ( z )] ∇ α E q ( z ) [ h ( z )] = E q ( w,z ) [ u ( w ) ∇ z h ( z )] Corollary 6.1 If u ( w ) has the following property, Z u ( w ) q ( w, z | µ , α , Σ ) dw = k X j u j ( z , µ , α , Σ ) ˆ q j ( z ) , where each ˆ q j ( z ) is a normalized distribution and k is ﬁnite,the following identity also holds. ∇ α E q ( z ) [ h ( z )] = k X j E ˆ q j ( z ) [ u j ( z , µ , α , Σ ) ∇ z h ( z )] Example 6.1

A concrete example is the multivariate skewGaussian distribution, which can be found at Appendix C.2.

Example 6.2

Another example is the multivariate expo-nentially modiﬁed Gaussian distribution, which can befound at Appendix C.3.

Theorem 7 (Price’s Theorem):

Let h ( z ) : R d

7→ R becontinuously differentiable that its derivative ∇ h ( z ) be lo-cally ACL. If E q ( w,z ) [ v ( w ) h ( z )] is well-deﬁned, the follow-ing identity holds. ∇ Σ E q ( z ) [ h ( z )] = E q ( w,z ) (cid:2) v ( w ) ∇ z h ( z ) (cid:3) = E q ( w,z ) (cid:2) Σ − ( z − µ − u ( w ) α ) ∇ Tz h ( z ) (cid:3) tein’s Lemma for the Reparameterization Trick Corollary 7.1 If v ( w ) has the following property, Z v ( w ) q ( w, z | µ , α , Σ ) dw = k X j v j ( z , µ , α , Σ ) ˆ q j ( z ) , where each ˆ q j ( z ) is a normalized distribution and k is ﬁnite,the following identity also holds. ∇ Σ E q ( z ) [ h ( z )] = E q ( w,z ) (cid:2) v ( w ) ∇ z h ( z ) (cid:3) = k X j E ˆ q j ( z ) (cid:2) v j ( z , µ , α , Σ ) ∇ z h ( z ) (cid:3) Example 7.1

A concrete example is the multivariate Stu-dent’s t-distribution, which can be found at Appendix C.5.

Example 7.2

Another example is the multivariate normalinverse-Gaussian distribution, which can be found at Ap-pendix C.6.

We consider the following EF mixtures in a product space z = ( z , z ) ∈ ( l , u ) × ( l , u ) , where −∞ ≤ l < u ≤∞ and −∞ ≤ l < u ≤ ∞ . Moreover, we assume q ( z ) q ( z | z ) , and q ( z , z ) are locally AC, locally ACL, andcontinuous, respectively. q ( z | λ ) = h ( z ) exp {h T ( z ) , φ ( λ ) i − A ( λ ) } q ( z | z , λ ) = h ( z , z ) exp {h T ( z , z ) , φ ( λ ) i − A ( λ , z ) } Let’s denote the CDF of q ( z | λ ) and the conditional CDFof q ( z | z , λ ) by ψ ( z , λ ) = Z z l q ( t | λ ) dt ψ ( z , z , λ ) = Z z l q ( t | z , λ ) dt . We deﬁne the following functions: Ψ ( z , λ ) := [ ψ ( z , λ ) , ψ ( z , z , λ )] T ∇ λ i Ψ ( z , λ ) := [ ∇ λ i ψ ( z , λ ) , ∇ λ i ψ ( z , z , λ )] T ∇ z Ψ ( z , λ ) := (cid:20) q ( z | λ ) 0 ∇ z ψ ( z , z , λ ) q ( z | z , λ ) (cid:21) . Applying Lemma 6 to ˜ f i,j ( z j ) deﬁned below, we obtain thefollowing identity. Theorem 8 (Bivariate Implicit ReparametrizationTrick):

Let h ( · ) : ( l , u ) × ( l , u )

7→ R be locallyACL and continuous. First, we deﬁne function f i,j ( z ) := e Tj [ ∇ z Ψ ( z , λ )] − ∇ λ i Ψ ( z , λ ) and function ˜ f i,j ( z j ) := f i,j ( z j , z − j ) h ( z j , z − j ) Q k ≥ j +1 q ( z k | z k − , λ ) . If theconditions of Lemma 6 for each ˜ f i,j ( z j ) are satisﬁed, wehave the following identity. ∇ λ i E q (cid:2) h ( z ) (cid:3) = − E q (cid:2) X j f i,j ( z ) ∇ z j h ( z ) (cid:3) The proof of this theorem can be found at Appendix D.1.The identity can be easily extended to multivariate ver-sion for the implicit reparametrization trick. Figurnov et al.(2018) assume that h ( z ) is continuously differentiable. Asshown in Theorem 8, the identity holds even when h ( z ) isnot continuously differentiable. References

Adcock, C. Extensions of Stein’s lemma for the skew-normal distribution.

Communications in Statistics-Theory and Methods , 36(9):1661–1671, 2007.Adcock, C. and Shutes, K. On the multivariate extendedskew-normal, normal-exponential, and normal-gammadistributions.

Journal of Statistical Theory and Practice ,6(4):636–664, 2012.Arnold, B. C., Castillo, E., and Sarabia, J. M. A multivari-ate version of stein’s identity with applications to mo-ment calculations and estimation of conditionally speci-ﬁed distributions. 2001.Bonnet, G. Transformations des signaux al´eatoires a traversles systemes non lin´eaires sans m´emoire. In

Annales desT´el´ecommunications , volume 19, pp. 203–220. Springer,1964.Border, K. C. Lecture Notes: integration by parts. ,1996. Accessed: 2019/06.Brown, L., DasGupta, A., Haff, L. R., and Strawderman,W. E. The heat equation and Stein’s identity: Connec-tions, applications.

Journal of Statistical Planning andInference , 136(7):2254–2278, 2006.Brown, L. D. Fundamentals of statistical exponential fami-lies: with applications in statistical decision theory. Ims,1986.Erdogdu, M. A. Newton-Stein method: a second ordermethod for GLMs via Stein’s Lemma. In

Advances inNeural Information Processing Systems , pp. 1216–1224,2015.Fan, K., Wang, Z., Beck, J., Kwok, J., and Heller, K. A.Fast second order stochastic backpropagation for varia-tional inference. In

Advances in Neural Information Pro-cessing Systems , pp. 1387–1395, 2015. tein’s Lemma for the Reparameterization Trick

Figurnov, M., Mohamed, S., and Mnih, A. Implicit Repa-rameterization Gradients. 2018.Hudson, H. M. et al. A natural identity for exponentialfamilies with applications in multiparameter estimation.

The Annals of Statistics , 6(3):473–484, 1978.Jia, R.-Q. Lecture Notes: Honors Real Variable II. sites.ualberta.ca/˜rjia/Math418/Notes/chap3.pdf ,2010. Accessed: 2019/03/25.Kattumannil, S. K. On Steins identity and its applica-tions.

Statistics & Probability Letters , 79(12):1444–1449, 2009.Kattumannil, S. K. and Dewan, I. On generalized momentidentity and its applications: a uniﬁed approach.

Statis-tics , 50(5):1149–1160, 2016.Khan, M. and Lin, W. Conjugate-computation varia-tional inference: Converting variational inference in non-conjugate models to inferences in conjugate models. In

Artiﬁcial Intelligence and Statistics , pp. 878–887, 2017.Khan, M. E., Nielsen, D., Tangkaratt, V., Lin, W., Gal,Y., and Srivastava, A. Fast and Scalable Bayesian DeepLearning by Weight-Perturbation in Adam. In

Proceed-ings of the 35th International Conference on MachineLearning , pp. 2611–2620, 2018.Kingma, D. P. and Welling, M. Auto-encoding variationalbayes. arXiv preprint arXiv:1312.6114 , 2013.Landsman, Z. On the generalization of Stein’s Lemma forelliptical class of distributions.

Statistics & probabilityletters , 76(10):1012–1016, 2006.Landsman, Z. and Neˇslehov´a, J. Stein’s Lemma for ellipti-cal random vectors.

Journal of Multivariate Analysis , 99(5):912–927, 2008.Leoni, G.

A ﬁrst course in Sobolev spaces , volume 181.American Mathematical Soc., 2017.Lin, W., Khan, M. E., and Schmidt, M. Fast and sim-ple natural-gradient variational inference with mixtureof exponential-family approximations. In

InternationalConference on Machine Learning , pp. 3992–4002, 2019.Liu, J. S. Siegel’s formula via Stein’s identities.

Statistics& Probability Letters , 21(3):247–251, 1994.Opper, M. and Archambeau, C. The variational Gaus-sian approximation revisited.

Neural computation , 21(3):786–792, 2009.Park, S., Serpedin, E., and Qaraqe, K. On the equivalencebetween Stein and de Bruijn identities.

IEEE Transac-tions on Information Theory , 58(12):7045–7067, 2012. Price, R. A useful theorem for nonlinear devices havingGaussian inputs.

IRE Transactions on Information The-ory , 4(2):69–72, 1958.Rezende, D. J., Mohamed, S., and Wierstra, D. Stochasticbackpropagation and approximate inference in deep gen-erative models. arXiv preprint arXiv:1401.4082 , 2014.Royen, H. and Fitzpatrick, P. Real analysis. 2010.Salimans, T. and Knowles, D. Fixed-form variational pos-terior approximation through stochastic linear regression.

Bayesian Analysis , 8(4):837–882, 2013.Stein, C. Estimation of the Mean of a Multivariate NormalDistribution.

Proc. Prague Sympos. Asymptotic Statis-tics , 1973.Stein, C. M. Estimation of the mean of a multivariate nor-mal distribution.

The annals of Statistics , pp. 1135–1151,1981. tein’s Lemma for the Reparameterization Trick

A. Gradient Identities for Gaussian Distribution

For completeness, we give a proof of integration by parts for AC functions, which is essential for many proofs in this paper.

Lemma 7 (Integration by parts):

Let h ( · ) , q ( · ) : [ a, b ]

7→ R be AC functions, where −∞ < a < b < ∞ . The followingidentity holds. h ( b ) q ( b ) − h ( a ) q ( a ) = Z ba q ( z ) ∇ z h ( z ) dz + Z ba h ( z ) ∇ z q ( z ) dz Proof:

Since h ( z ) and q ( z ) are AC in [ a, b ] , we know that the product h ( z ) q ( z ) is AC in [ a, b ] . By the product rule forAC, the following identity holds almost everywhere for z ∈ [ a, b ] . ∇ z ( h ( z ) q ( z )) = q ( z ) ∇ z h ( z ) + h ( z ) ∇ z q ( z ) (1)Since q ( z ) is continuous and h ( z ) is AC in [ a, b ] , we know that q ( z ) ∇ z h ( z ) is integrable over [ a, b ] . Similarly, we canshow h ( z ) ∇ z q ( z ) is integrable over [ a, b ] . Integrating both sides of (1) over [ a, b ] , we obtain the identity. h ( b ) q ( b ) − h ( a ) q ( a ) = Z ba q ( z ) ∇ z h ( z ) dz + Z ba h ( z ) ∇ z q ( z ) dz (cid:3) An alternative proof of Lemma 7 using Fubini’s theorem can be found at Theorem 6 of Border (1996). Note that thecondition of Fubini’s theorem shown below is satisﬁed. Z ba Z ba |∇ x h ( x ) ∇ y q ( y ) | dxdy < ∞ , since by the deﬁnition of AC, we have Z ba |∇ x h ( x ) | dx < ∞ , Z ba |∇ y q ( y ) | dy < ∞ . To use integration by parts in the proof of Lemma 1, we ﬁrst prove the following lemma.

Lemma 8

Let h ( · ) : R 7→ R be a locally AC function and q ( z ) := N ( z | µ, σ ) be an univariate Gaussian distribution withmean µ and variance σ . If E q [ ∇ z h ( z )] is well-deﬁned ( E q [ |∇ z h ( z ) | ] < ∞ ), then the following boundary conditions aresatisﬁed: lim z ↑∞ h ( z ) q ( z ) = 0lim z ↓−∞ h ( z ) q ( z ) = 0 Proof:

We show that lim z ↑∞ h ( z ) q ( z ) = 0 by showing that lim z ↑∞ | h ( z ) | q ( z ) = 0 , where q ( z ) = N ( z | µ, σ ) . Givena compact interval, since h ( z ) is AC, by Theorem 3.1 of Jia (2010), we know that | h ( z ) | is also AC and the followingidentity holds almost everywhere for t in the interval. − |∇ t h ( t ) | ≤ ∇ t | h ( t ) | ≤ |∇ t h ( t ) | . (2)Since | h ( z ) | is AC in the interval, given c in the interval, by the fundamental theorem of calculus, we have | h ( z ) | = | h ( c ) | + Z zc ∇ t | h ( t ) | dt. (3) tein’s Lemma for the Reparameterization Trick Given any z ≥ c ≥ µ , by (3) and then (2), we have | h ( z ) | q ( z ) = q ( z ) (cid:18) | h ( c ) | + Z zc ∇ t | h ( t ) | dt (cid:19) ≤ q ( z ) (cid:18) | h ( c ) | + Z zc |∇ t h ( t ) | dt (cid:19) ≤ q ( z ) | h ( c ) | + Z zc q ( t ) |∇ t h ( t ) | dt where we use the monotonicity of Gaussian: q ( z ) ≤ q ( t ) when µ ≤ c ≤ t ≤ z .Therefore, by the assumption E q ( t ) [ |∇ t h ( t ) | ] < ∞ , we have lim z ↑∞ sup | h ( z ) | q ( z ) ≤ lim z ↑∞ q ( z ) constant z }| { | h ( c ) | | {z } + Z ∞ c q ( t ) |∇ t h ( t ) | dt | {z } < ∞ = Z ∞ c q ( t ) |∇ t h ( t ) | dt, where we use the Gaussian identity that lim z ↑∞ q ( z ) = 0 .Taking c to the positive inﬁnity, we obtain the following result, which implies that lim z ↑∞ | h ( z ) | q ( z ) = 0 . ≤ lim z ↑∞ inf | h ( z ) | q ( z ) ≤ lim z ↑∞ sup | h ( z ) | q ( z ) ≤ lim c ↑∞ Z ∞ c q ( t ) |∇ t h ( t ) | dt | {z } Likewise, we can show that lim z ↓−∞ h ( z ) q ( z ) = 0 . (cid:3) A.1. Proof of Lemma 1 and Lemma 6Proof:

First, we denote the support by ( l, u ) . In the Gaussian case, l = −∞ and u = ∞ . Since E q [ ∇ z h ( z )] is well-deﬁned, we use the following expression to prove the claim. E q [ ∇ z h ( z )] = lim r ↓ l Z cr q ( z ) ∇ z h ( z ) dz + lim r ↑ u Z r c q ( z ) ∇ z h ( z ) dz where c ∈ ( l, u ) is a constant number.Given any compact interval [ r , c ] , we know that h ( z ) and q ( z ) are AC in this interval. By integration by parts (Lemma 7),we have h ( c ) q ( c ) − h ( r ) q ( r ) = Z cr q ( z ) ∇ z h ( z ) dz + Z cr h ( z ) ∇ z q ( z ) dz In the Gaussian case, we have lim r ↓ l h ( r ) q ( r ) = 0 due to Lemma 8. Taking r to l , we have h ( c ) q ( c ) = lim r ↓ l (cid:20)Z cr q ( z ) ∇ z h ( z ) dz + Z cr h ( z ) ∇ z q ( z ) dz (cid:21) = lim r ↓ l Z cr q ( z ) ∇ z h ( z ) dz + lim r ↓ l Z cr h ( z ) ∇ z q ( z ) dz. (4)Note that lim r ↓ l R cr q ( z ) ∇ z h ( z ) dz exists since E q [ |∇ z h ( z ) | ] < ∞ ( E q [ ∇ z h ( z )] is well-deﬁned). Since h ( c ) q ( c ) is ﬁnite,we know that lim r ↓ l R cr h ( z ) ∇ z q ( z ) dz is also ﬁnite. Therefore, (4) is valid.Likewise, given any compact interval [ c, r ] , by integration by parts and taking r to u , we have − h ( c ) q ( c ) = lim r ↑ u Z r c q ( z ) ∇ z h ( z ) dz + lim r ↑ u Z r c h ( z ) ∇ z q ( z ) dz. (5) tein’s Lemma for the Reparameterization Trick where lim r ↑ u h ( r ) q ( r ) = 0 .By (4) and (5), we have E q [ ∇ z h ( z )] = − E q (cid:20) h ( z ) ∇ z q ( z ) q ( z ) (cid:21) , In the Gaussian case, we have ∇ z q ( z ) = σ − ( µ − z ) q ( z ) , which shows that the above expression is the identity. (cid:3) A.2. Proof of Lemma 5Proof:

We denote the i -th element of z by z i . Given z − i is known, we deﬁne a function h i ( z i ) := h ( z i , z − i ) . Since E q [ ∇ z h ( z )] is well-deﬁned, we have E q [ ∇ z i h ( z )] = E q ( z − i ) q ( z i | z − i ) [ ∇ z i h ( z i , z − i )] = E q ( z − i ) [ E q ( z i | z − i ) [ ∇ z i h i ( z i )]] Without loss of generality, we assume z i is the last element of z . It is possible since we can permute the elements of z toachieve that. Therefore, we can re-express the mean and the covariance matrix as below. µ = (cid:20) µ − i µ i (cid:21) , Σ = (cid:20) Σ − i, − i Σ − i,i Σ T − i,i Σ i,i (cid:21) Note that q ( z i | z − i ) is an univariate Gaussian distribution denote by N ( z i | m, σ ) , where m = µ i + Σ T − i,i Σ − − i, − i (cid:0) z − i − µ − i (cid:1) (6) σ = Σ i,i − Σ T − i,i Σ − − i, − i Σ − i,i . (7)Since h i ( z i ) is locally AC, we have the following result by applying Lemma 1 to h i ( z i ) . E q [ ∇ z i h ( z )] = E q ( z − i ) [ E q ( z i | z − i ) [ ∇ z i h i ( z i )]]= E q ( z − i ) (cid:2) E q ( z i | z − i ) (cid:2) σ − ( z i − m ) h i ( z i ) (cid:3)(cid:3) = E q (cid:2) σ − ( z i − m ) h i ( z i ) (cid:3) = E q (cid:2) σ − ( z i − m ) h ( z ) (cid:3) Recall that by assumptions the above expectations are well-deﬁned. It can be veriﬁed that σ − ( z i − m ) = e Ti Σ − ( z − µ ) (8)where e i is an one-hot vector where it has all zero elements except the i -th element with value .Using the result at (8), we have E q [ ∇ z i h ( z )] = E q (cid:2) σ − ( z i − m ) h ( z ) (cid:3) = E q (cid:2) e Ti Σ − ( z − µ ) h ( z ) (cid:3) (9)Therefore, we have E q [ ∇ z h ( z )] = E q (cid:2) Σ − ( z − µ ) h ( z ) (cid:3) (cid:3) A.3. Proof of Theorem 3Proof:

We swap integration and differentiation and obtain the following result. ∇ µ E q [ h ( z )] = Z h ( z ) ∇ µ N ( z | µ , Σ ) d z = Z h ( z ) Σ − ( z − µ ) N ( z | µ , Σ ) d z = E q (cid:2) Σ − ( z − µ ) h ( z ) (cid:3) tein’s Lemma for the Reparameterization Trick which is known as the score function estimator.Using Lemma 5 to move from line 1 to line 2, we obtain the gradient identity given below. ∇ µ E q [ h ( z )] = E q (cid:2) Σ − ( z − µ ) h ( z ) (cid:3) = E q [ ∇ z h ( z )] which is known as the re-parametrization trick for the mean µ . (cid:3) A.4. Proof of Theorem 4

To prove Theorem 4, we ﬁrst prove the following lemma, which is a multivariate extension of Lemma 2. By convention, e i is an one-hot vector where it has all zero elements except the i -th element with value . Lemma 9

Let h ( z ) : R d

7→ R be locally ACL and continuous. We deﬁne an auxiliary vector function f ( z ) = Σ − ( z − µ ) h ( z ) . If E q [ h ( z )] is well-deﬁned, the following identity holds. E q h Σ − h ( z − µ ) ( z − µ ) T − Σ i Σ − h ( z ) i = E q (cid:2) Σ − ( z − µ ) ∇ Tz h ( z ) (cid:3) Proof:

We deﬁne an auxiliary function f i ( z ) := e Ti f ( z ) . By applying Lemma 5 to f i ( z ) , we have E q (cid:2) Σ − ( z − µ ) e Ti f ( z ) (cid:3) = E q (cid:2) ∇ z (cid:0) e Ti f ( z ) (cid:1)(cid:3) Recall that E q (cid:2) ∇ z (cid:0) e Ti f ( z ) (cid:1)(cid:3) = E q (cid:2) Σ − e Ti h ( z ) + e Ti Σ − ( z − µ ) ∇ z h ( z ) (cid:3) . Therefore, we know that E q h e Tj Σ − ( z − µ ) ( z − µ ) T Σ − e i h ( z ) i = E q (cid:2) e Tj Σ − ( z − µ ) e Ti f ( z ) (cid:3) = E q (cid:2) e Tj ∇ z (cid:0) e Ti f ( z ) (cid:1)(cid:3) = E q (cid:2) e Tj Σ − e i h ( z ) + e Ti Σ − ( z − µ ) e Tj ∇ z h ( z ) (cid:3) Since E q [ h ( z )] is also well-deﬁned, we have E q h e Ti Σ − (cid:16) ( z − µ ) ( z − µ ) T − Σ (cid:17) Σ − e j h ( z ) i = E q h e Tj Σ − (cid:16) ( z − µ ) ( z − µ ) T − Σ (cid:17) Σ − e i h ( z ) i = E q (cid:2) e Ti Σ − ( z − µ ) e Tj ∇ z h ( z ) (cid:3) = E q (cid:2) e Ti Σ − ( z − µ ) ∇ Tz h ( z ) e j (cid:3) , which implies that E q h Σ − (cid:16) ( z − µ ) ( z − µ ) T − Σ (cid:17) Σ − h ( z ) i = E q (cid:2) Σ − ( z − µ ) ∇ Tz h ( z ) (cid:3) . (cid:3) Next, we prove the following lemma, which is a multivariate extension of Lemma 3.

Lemma 10

Let h ( z ) : R d

7→ R be locally ACL and continuous. We assume the conditions of Lemma 9 are satisﬁed. Thefollowing gradient identity holds. ∇ Σ E q [ h ( z )] = E q (cid:2) Σ − ( z − µ ) ∇ Tz h ( z ) (cid:3) tein’s Lemma for the Reparameterization Trick Proof:

By the assumptions, we can interchange the integration and differentiation to obtain the following result. ∇ Σ E q [ h ( z )] = Z h ( z ) ∇ Σ N ( z | µ , Σ ) d z = Z h ( z ) Σ − h ( z − µ ) ( z − µ ) T − Σ i Σ − N ( z | µ , Σ ) d z = E q h Σ − h ( z − µ ) ( z − µ ) T − Σ i Σ − h ( z ) i which is known as the score function estimator.By Lemma 9, we have ∇ Σ E q [ h ( z )] = E q h Σ − h ( z − µ ) ( z − µ ) T − Σ i Σ − h ( z ) i = E q (cid:2) Σ − ( z − µ ) ∇ Tz h ( z ) (cid:3) . (cid:3) The following lemma is also useful when we prove Theorem 4. This lemma is a multivariate extension of Lemma 4.

Lemma 11

Let a function h ( z ) : R d

7→ R be continuously differentiable and its derivative ∇ z h ( z ) : R d

7→ R d belocally ACL. q ( z ) is a multivariate Gaussian distribution denoted by N ( z | µ , Σ ) with mean µ and variance Σ . Thefollowing identity holds. E q (cid:2) ∇ z h ( z ) (cid:3) = E q (cid:2) Σ − ( z − µ ) ∇ Tz h ( z ) (cid:3) , (10) Proof:

We deﬁne an auxiliary function g i ( z ) = ∇ Tz h ( z ) e i . By applying Lemma 5 to g i ( z ) , we have E q (cid:2) Σ − ( z − µ ) g i ( z ) (cid:3) = E q [ ∇ z g i ( z )] . Therefore, we have E q (cid:2) Σ − ( z − µ ) ∇ Tz h ( z ) e i (cid:3) = E q (cid:2) Σ − ( z − µ ) g i ( z ) (cid:3) = E q [ ∇ z g i ( z )]= E q (cid:2) ∇ z (cid:0) ∇ Tz h ( z ) e i (cid:1)(cid:3) = E q (cid:2) ∇ z h ( z ) e i (cid:3) , which implies that E q (cid:2) ∇ z h ( z ) (cid:3) = E q (cid:2) Σ − ( z − µ ) ∇ Tz h ( z ) (cid:3) (11) (cid:3) Now, it is time for us to prove Theorem 4.

Proof:

Note that all conditions of Lemma 11 are satisﬁed. By Lemma 11, we have E q (cid:2) ∇ z h ( z ) (cid:3) = E q (cid:2) Σ − ( z − µ ) ∇ Tz h ( z ) (cid:3) , (12)Since all conditions of Lemma 10 are satisﬁed, by Lemma 10, we have ∇ Σ E q [ h ( z )] = E q (cid:2) Σ − ( z − µ ) ∇ Tz h ( z ) (cid:3) (13)Therefore, by (12) and (13) we have ∇ Σ E q [ h ( z )] = E q (cid:2) Σ − ( z − µ ) ∇ Tz h ( z ) (cid:3) = E q (cid:2) ∇ z h ( z ) (cid:3) (cid:3) tein’s Lemma for the Reparameterization Trick B. Gradient Identities for Univariate Continuous Exponentially-family Distributions

B.1. Proof of Theorem 5Proof:

It is easy to verify that ˜ f i ( z ) = f i ( z ) h ( z ) is locally AC since f i ( z ) and h ( z ) are both locally AC. By Lemma 6,we have − E q (cid:20) ˜ f i ( z ) ∇ z q ( z | λ z ) q ( z | λ z ) (cid:21) = E q h ∇ z ˜ f i ( z ) i (14)Notice that ∇ z ˜ f i ( z ) = − h ∇ λ i ψ ( z, λ z ) q ( z | λ z ) | {z } = f i ( z ) ∇ z h ( z ) + h ( z ) ∇ λ i q ( z | λ z ) q ( z | λ z ) + ˜ f i ( z ) ∇ z q ( z | λ z ) q ( z | λ z ) i The expression (14) can be re-expressed as − E q h ✘✘✘✘✘✘✘✘✿ ˜ f i ( z ) ∇ z q ( z | λ z ) q ( z | λ z ) i = − E q h f i ( z ) ∇ z h ( z ) + h ( z ) ∇ λ i q ( z | λ z ) q ( z | λ z ) + ✘✘✘✘✘✘✘✘✿ ˜ f i ( z ) ∇ z q ( z | λ z ) q ( z | λ z ) i (15)By (15), we have the following identity E q (cid:20) h ( z ) ∇ λ i q ( z | λ z ) q ( z | λ z ) (cid:21) = − E q [ f i ( z ) ∇ z h ( z )] Since we can interchange the integration and differentiation, we know that ∇ λ i E q [ h ( z )] = E q (cid:20) h ( z ) ∇ λ i q ( z | λ z ) q ( z | λ z ) (cid:21) = − E q [ f i ( z ) ∇ z h ( z )] (cid:3) C. Gradient Identities for Gaussian Variance-mean Mixtures

C.1. Proof of Theorem 6Proof:

Let’s consider the gradient identity for α . By the assumptions, we can the integration and differentiation to obtainthe following expression. ∇ α E q ( z ) [ h ( z )] = Z ∇ α q ( z | µ , α , Σ ) h ( z ) d z = Z ∇ α (cid:20)Z q ( w, z | µ , α , Σ ) dw (cid:21) h ( z ) d z = Z (cid:20)Z u ( w ) v ( w ) Σ − ( z − µ − u ( w ) α ) q ( w, z | µ , α , Σ ) dw (cid:21) h ( z ) d z = E q ( w,z ) (cid:20) u ( w ) v ( w ) Σ − ( z − µ − u ( w ) α ) h ( z ) (cid:21) Recall that q ( z | w ) is Gaussian denoted by q ( z | w ) := N ( z | µ + u ( w ) α , v ( w ) Σ ) . By applying Lemma 5 to u ( w ) h ( z ) , wehave E q ( w,z ) [ ∇ z ( u ( w ) h ( z ))] = E q ( w ) [ E q ( z | w ) [ ∇ z ( u ( w ) h ( z ))]]= E q ( w ) h E q ( z | w ) h ( v ( w ) Σ ) − ( z − µ − u ( w ) α ) ( u ( w ) h ( z )) ii = E q ( w,z ) (cid:20) u ( w ) v ( w ) Σ − ( z − µ − u ( w ) α ) h ( z ) (cid:21) tein’s Lemma for the Reparameterization Trick Therefore, we have ∇ α E q ( z ) [ h ( z )] = E q ( w,z ) [ u ( w ) ∇ z h ( z )] Similarly, we can show that ∇ µ E q ( z ) [ h ( z )] = E q ( w,z ) [ ∇ z h ( z )] = E q ( z ) [ ∇ z h ( z )] (cid:3) C.2. Example 6.1

Another example is the multivariate exponentially modiﬁed Gaussian distribution. q ( w, z | µ , α , Σ ) := N ( z | µ + w α , Σ ) Exp ( w | q ( z | µ , α , Σ ) := Z + ∞ q ( w, z | µ , α , Σ ) dw = √ π det (2 π Σ ) − √ α T Σ − α Φ ( z − µ ) T Σ − α − √ α T Σ − α ! exp   (cid:16) ( z − µ ) T Σ − α − (cid:17) α T Σ − α − ( z − µ ) T Σ − ( z − µ )  where u ( w ) = w and v ( w ) = 1 .Furthermore, we have Z + ∞ wq ( w, z | µ , α , Σ ) dw = u ( z , µ , α , Σ ) N ( z | µ , Σ ) + u ( z , µ , α , Σ ) q ( z | µ , α , Σ ) (16)where u ( z , µ , α , Σ ) = α T Σ − α and u ( z , µ , α , Σ ) = ( z − µ ) T Σ − α − α T Σ − α . C.4. Proof of Theorem 7Proof:

Firstly, note that E q ( w,z ) (cid:2) v ( w ) ∇ z h ( z ) (cid:3) = E q ( w ) (cid:2) E q ( z | w ) (cid:2) ∇ z ( v ( w ) h ( z )) (cid:3)(cid:3) tein’s Lemma for the Reparameterization Trick Conditioned on w , we know that q ( z | w ) = N ( z | µ + u ( w ) α , v ( w ) Σ ) is Gaussian with mean µ + u ( w ) α and variance v ( w ) Σ . By applying Lemma 11 to v ( w ) h ( z ) , we have E q ( w,z ) (cid:2) v ( w ) ∇ z h ( z ) (cid:3) = E q ( w ) (cid:2) E q ( z | w ) (cid:2) ∇ z ( v ( w ) h ( z )) (cid:3)(cid:3) = E q ( w ) h E q ( z | w ) h ( v ( w ) Σ ) − ( z − µ − u ( w ) α ) ∇ Tz ( v ( w ) h ( z )) ii = E q ( w ) (cid:2) E q ( z | w ) (cid:2) Σ − ( z − µ − u ( w ) α ) ∇ Tz h ( z ) (cid:3)(cid:3) = E q ( w,z ) (cid:2) Σ − ( z − µ − u ( w ) α ) ∇ Tz h ( z ) (cid:3) (17)Recall that q ( z | w ) = N ( z | µ + u ( w ) α | {z } ˆ µ , v ( w ) Σ | {z } ˆΣ ) is Gaussian. By applying Lemma 9 to v ( w ) h ( z ) , we have the followingexpression. E q ( w,z ) h Σ − h v − ( w ) ( z − µ − u ( w ) α ) ( z − µ − u ( w ) α ) T − Σ i Σ − h ( z ) i = E q ( w ) h E q ( z | w ) h ( v ( w ) Σ ) − h ( z − µ − u ( w ) α ) ( z − µ − u ( w ) α ) T − v ( w ) Σ i ( v ( w ) Σ ) − ( v ( w ) h ( z )) ii = E q ( w ) (cid:20) E q ( z | w ) (cid:20)(cid:16) ˆ Σ (cid:17) − h ( z − ˆ µ ) ( z − ˆ µ ) T − ˆ Σ i (cid:16) ˆ Σ (cid:17) − ( v ( w ) h ( z )) (cid:21)(cid:21) = E q ( w ) (cid:20) E q ( z | w ) (cid:20)(cid:16) ˆ Σ (cid:17) − ( z − ˆ µ ) ∇ Tz ( v ( w ) h ( z )) (cid:21)(cid:21) = E q ( w,z ) (cid:2) Σ − ( z − µ − u ( w ) α ) ∇ Tz h ( z ) (cid:3) (18)By the regular assumptions, we can swap the integration and differentiation to get the following expression. ∇ Σ E q ( z ) [ h ( z )] = Z ∇ Σ q ( z | µ , Σ ) h ( z ) d z = Z Z [ ∇ Σ N ( z | µ , v ( w ) Σ ) q ( w ) dw ] h ( z ) d z = E q ( w,z ) h Σ − h v − ( w ) ( z − µ − u ( w ) α ) ( z − µ − u ( w ) α ) T − Σ i Σ − h ( z ) i (19)Finally, by Eq. (17) , (18), and (19), we have ∇ Σ E q ( z ) [ h ( z )] = E q ( w,z ) (cid:2) Σ − ( z − µ − u ( w ) α ) ∇ Tz h ( z ) (cid:3) = E q ( w,z ) (cid:2) v ( w ) ∇ z h ( z ) (cid:3) (cid:3) C.5. Example 7.1

A concrete example is the multivariate Student’s t-distribution with ﬁxed degree of freedom β . We consider a case when β > , since the variance does not exist when β ≤ . q ( w, z | µ , α , Σ ) := N ( z | µ , w Σ ) IG ( w | β, β ) q ( z | µ , α , Σ ) := Z q ( w, z | µ , α , Σ ) dw =det ( π Σ ) − / Γ( β + d/ (cid:16) β + ( z − µ ) T Σ − ( z − µ ) (cid:17) − β − d/ Γ( β ) (2 β ) − β . where u ( w ) = 0 and v ( w ) = w > since w is generate from inverse Gamma distribution IG ( w | β, β ) .When β > , we have Z wq ( w, z | µ , α , Σ ) dw = v ( z , µ , α , Σ ) q ( z | µ , α , Σ ) (20)where v ( z , µ , α , Σ ) := β ( β + d/ − (cid:16) β ) − ( z − µ ) T Σ − ( z − µ ) (cid:17) . tein’s Lemma for the Reparameterization Trick C.6. Example 7.2

Another example is the multivariate normal inverse-Gaussian distribution, where β > is ﬁxed. q ( w, z | µ , α , Σ ) := N ( z | µ + w α , w Σ ) InvGauss ( w | , β ) q ( z | µ , α , Σ ) := Z q ( w, z | µ , α , Σ ) dw = β (2 π ) d +12 det ( Σ ) − / exp h ( z − µ ) T Σ − α + β i K d +12 (cid:18)r(cid:0) α T Σ − α + β (cid:1) (cid:16) ( z − µ ) T Σ − ( z − µ ) + β (cid:17)(cid:19)(cid:18)r α T Σ − α + β ( z − µ ) T Σ − ( z − µ )+ β (cid:19) − d − where u ( w ) = v ( w ) = w > since w is generate from an inverse Gaussian distribution InvGauss ( w | , β ) = (cid:16) β πw (cid:17) exp n − β (cid:0) w + w − (cid:1) + β o We have Z wq ( w, z | µ , α , Σ ) dw = v ( z , µ , α , Σ ) q ( z | µ , α , Σ ) (21)where v ( z , µ , α , Σ ) := s ( z − µ ) T Σ − ( z − µ ) + β α T Σ − α + β K d − (cid:18)r(cid:0) α T Σ − α + β (cid:1) (cid:16) ( z − µ ) T Σ − ( z − µ ) + β (cid:17)(cid:19) K d +12 (cid:18)r(cid:0) α T Σ − α + β (cid:1) (cid:16) ( z − µ ) T Σ − ( z − µ ) + β (cid:17)(cid:19) D. Gradient Identities for Continuous Exponential-family Mixtures

D.1. Proof of Theorem 8Proof:

It is easy to verify that ˜ f i,j ( z j ) := f i,j ( z j , z − j ) h ( z j , z − j ) Q k ≥ j +1 q ( z k | z k − , λ ) is locally AC since h ( z j , z − j ) , f i,j ( z j , z − j ) , and q ( z k | z k − , λ ) are all locally AC w.r.t. z j for almost every z − j .By the deﬁnitions, we have ( ∇ z Ψ ( z , λ )) − =  q ( z | λ ) − ∇ z ψ ( z ,z , λ ) q ( z | λ ) q ( z | z , λ ) 1 q ( z | z , λ )  f i, ( z ) = ∇ λ i ψ ( z , λ ) q ( z | λ ) f i, ( z ) = − f i, ( z ) ∇ z ψ ( z , z , λ ) q ( z | z , λ ) + ∇ λ i ψ ( z , z , λ ) q ( z | z , λ ) Note that the following expression holds almost everywhere due to the product rule for locally AC functions. ∇ z f i, ( z ) = ∇ λ i q ( z | λ ) q ( z | λ ) − f i, ( z ) ∇ z q ( z | λ ) q ( z | λ ) (22)Recall that ˜ f i, ( z ) = f i, ( z ) h ( z ) q ( z | z , λ ) . tein’s Lemma for the Reparameterization Trick Therefore, by Lemma 6, we have − E q ( z ) (cid:2) ✘✘✘✘✘✘✘✘✘✘✘✘✘✘✘✿ q ( z | z , λ ) h ( z ) f i, ( z ) ∇ z q ( z | λ ) q ( z | λ ) (cid:3) = E q ( z ) [ ∇ z [ q ( z | z , λ ) h ( z ) f i, ( z )]]= E q ( z ) [ h ( z ) f i, ( z ) ∇ z q ( z | z , λ ) + q ( z | z , λ ) f i, ( z ) ∇ z h ( z ) + q ( z | z , λ ) h ( z ) ∇ z f i, ( z )]= E q ( z ) (cid:2) h ( z ) f i, ( z ) ∇ z q ( z | z , λ ) + q ( z | z , λ ) f i, ( z ) ∇ z h ( z ) + q ( z | z , λ ) h ( z ) (cid:0) ∇ λ i q ( z | λ ) q ( z | λ ) ✘✘✘✘✘✘✘✘✘✘✿ − f i, ( z ) ∇ z q ( z | λ ) q ( z | λ ) (cid:1)(cid:3) where we obtain the last equation by (22).The above expression gives the following identity. E q ( z ) (cid:20) h ( z ) f i, ( z ) ∇ z q ( z | z , λ ) + q ( z | z , λ ) f i, ( z ) ∇ z h ( z ) + q ( z | z , λ ) h ( z ) ∇ λ i q ( z | λ ) q ( z | λ ) (cid:21) (23)Likewise, the following expression holds for almost every z due to the product rule for locally AC functions. ∇ z f i, ( z ) = − f i, ( z ) ∇ z q ( z | z , λ ) q ( z | z , λ ) − f i, ( z ) ∇ z ( z | z , λ ) q ( z | z , λ ) + ∇ λ i q ( z | z , λ ) q ( z | z , λ ) (24)Note that ˜ f i, ( z ) = f i, ( z ) h ( z ) . By Lemma 6, we have − E q ( z ) (cid:2) E q ( z | z ) (cid:2) ✘✘✘✘✘✘✘✘✘✘✘✘✿ h ( z ) f i, ( z ) ∇ z ( z | z , λ ) q ( z | z , λ ) (cid:3)(cid:3) = E q ( z ) [ E q ( z | z ) [ ∇ z [ h ( z ) f i, ( z )]]]= E q ( z ) [ E q ( z | z ) [ f i, ( z ) ∇ z h ( z ) + h ( z ) ∇ z f i, ( z )]]= E q ( z ) (cid:2) E q ( z | z ) (cid:2) f i, ( z ) ∇ z h ( z ) + h ( z ) (cid:0) − f i, ( z ) ∇ z q ( z | z , λ ) q ( z | z , λ ) ✘✘✘✘✘✘✘✘✘✘✘✿ − f i, ( z ) ∇ z ( z | z , λ ) q ( z | z , λ ) + ∇ λ i q ( z | z , λ ) q ( z | z , λ ) (cid:1)(cid:3)(cid:3) where we obtain the last equation by (24).The above expression gives the following identity. E q ( z ) q ( z | z ) (cid:20) f i, ( z ) ∇ z h ( z ) + h ( z ) (cid:18) − f i, ( z ) ∇ z q ( z | z , λ ) q ( z | z , λ ) + ∇ λ i q ( z | z , λ ) q ( z | z , λ ) (cid:19)(cid:21) (25)By (23) and (25), we have E q ( z ) q ( z | z ) (cid:20) f i, ( z ) ∇ z h ( z ) + f i, ( z ) ∇ z h ( z ) + h ( z ) (cid:18) ∇ λ i q ( z | z , λ ) q ( z | z , λ ) + ∇ λ i q ( z | λ ) q ( z | λ ) (cid:19)(cid:21) (26)Therefore, by (26), we have the following identity E q ( z ,z ) (cid:20) h ( z ) (cid:18) ∇ λ i q ( z | z , λ ) q ( z | z , λ ) + ∇ λ i q ( z | λ ) q ( z | λ ) (cid:19)(cid:21) = − E q ( z ,z ) [ f i, ( z ) ∇ z h ( z ) + f i, ( z ) ∇ z h ( z )] Since we can interchange the integration and differentiation, we know that ∇ λ i E q [ h ( z )] = E q (cid:20) h ( z ) (cid:18) ∇ λ i q ( z | λ z ) q ( z | λ z ) + ∇ λ i q ( z | z , λ z ) q ( z | z , λ z ) (cid:19)(cid:21) , which gives the desired identity.which gives the desired identity.