Variational Inference as Iterative Projection in a Bayesian Hilbert Space
VV ARIATIONAL I NFERENCE AS I TERATIVE P ROJECTIONIN A B AYESIAN H ILBERT S PACE
Timothy D. BarfootInstitute for Aerospace StudiesUniversity of Toronto [email protected]
Gabriele M. T. D’EleuterioInstitute for Aerospace StudiesUniversity of Toronto [email protected] A BSTRACT
Variational Bayesian inference is an important machine-learning tool that finds application fromstatistics to robotics. The goal is to find an approximate probability density function (PDF) from achosen family that is in some sense ‘closest’ to the full Bayesian posterior. Closeness is typicallydefined through the selection of an appropriate loss functional such as the Kullback-Leibler (KL)divergence. In this paper, we explore a new formulation of variational inference by exploiting thefact that the set of PDFs constitutes a Bayesian Hilbert space under careful definitions of vectoraddition, scalar multiplication and an inner product. We show that variational inference based onKL divergence then amounts to an iterative projection of the Bayesian posterior onto a subspacecorresponding to the selected approximation family. In fact, the inner product chosen for the BayesianHilbert space suggests the definition of a new measure of the information contained in a PDF andin turn a new divergence is introduced. Each step in the iterative projection is equivalent to a localminimization of this divergence. We present an example Bayesian subspace based on exponentiatedHermite polynomials as well as work through the details of this general framework for the specificcase of the multivariate Gaussian approximation family and show the equivalence to another Gaussianvariational inference approach. We furthermore discuss the implications for systems that exhibitsparsity, which is handled naturally in Bayesian space. K eywords Aitchison geometry · Bayesian Hilbert spaces · variational inference · Bayesian inference · compositionaldata · stochastic algebra In 1763, Richard Price published on behalf of his recently deceased friend, the Reverend Thomas Bayes, a paperthat introduced what would become the atomic element of probabilistic inference: Bayes’ rule (Bayes, 1763). Thepaper though was widely ignored. About a decade later, the same rule was discovered by Pierre-Simon Laplace and,while Laplace laid its foundations, the theory of inference based on this rule became known as Bayesian inference.So confident was Laplace in the theory that he famously calculated the odds at 11,000 to 1 that the mass of Saturn asdetermined by a former student was correct to within 1%, 1,000,000-to-1 odds on the mass of Jupiter (Laplace, 1995,translated from 1825 French edition, pp. 46-47). (Based on the most recent available data, he would have collected onthe bet on Saturn.) Bayesian inference has been used in a great variety of applications from Henri Poincaré’s defenseof Captain Dreyfus to Alan Turing’s breaking of the Enigma code (McGrayne, 2011). In modern day, it providesthe crucial framework for inference in such fields as statistics, decision theory, computational neuroscience, machinelearning, computer vision, state estimation and robotics.The objective common to all these applications is the determination of a posterior probability to test some hypothesis orto calculate some estimate based on prior information and observed measurements. However, it is not always possibleto find the posterior exactly. Indeed, we must often resort to approximate techniques. One such technique, which will a r X i v : . [ c s . L G ] M a y ariational Inference as Iterative Projection in a Bayesian Hilbert Spaceoccupy us here, is that of variational inference or variational Bayes (Bishop, 2006). In this variational approach, thegoal is to find the probability density function that comes closest to the posterior as determined by minimizing a lossfunctional subject to the constraint that the distribution sought be drawn from a tractable class of densities, for example,where the posterior has to take the form of a Gaussian distribution. A common choice for the loss functional is theKullback-Leibler divergence (Csiszár, 1975; Hinton and van Camp, 1993; Jordan et al., 1999; Bishop, 2006; Barber,2012; Amari, 2016; Blei et al., 2017) although others such as Bregman (Adamˇcík, 2014; Painsky and Wornell, 2020),Wasserstein (Ambrogioni et al., 2018) and Rényi divergences (Li and Turner, 2016) have been used.The field of variational inference based on the KL divergence is already well trodden although the research is hardlyexhausted. The chosen class of densities from which the approximate posterior is to be shaped is key to variationalinference. In the mean-field approximation, for example, the solution to the minimization of the divergence is constructedas a product of densities from a chosen family of admissible functions such as a Bayesian mixture of Gaussians (Bishop,2006). Another possibility is using Bayesian mixtures of exponential families (Wainwright and Jordan, 2008; Amari,2016). A number of algorithms by which to execute the minimization exist including the variational EM algorithm,natural gradient descent and Gaussian variational inference.Jordan et al. (1999) observed that “there is not as yet a systematic algebra that allows particular variational transforma-tions to be matched optimally to particular graphical models.” While this was written two decades ago and specificallyabout graphical models, the remark finds resonance in the present work.Our aim is to introduce a kind of information algebra to variational inference that not only provides a convenientand effective framework for analysis but also reveals key relationships to past work. This algebra has its origins inthe work of Aitchison (1982, 1986) on compositional data in statistics. Compositional data can be represented on asimplex as with probability distributions for a finite set of discrete events. The resulting Aitchison geometry or Aitchisonsimplex establishes a vector space, in which vector addition is a normalized multiplication (perturbation) and scalarmultiplication is a normalized exponentiation (powering). With an appropriate inner product, the set of PDFs over afinite discrete space was formalized as a Hilbert space by Pawlowsky-Glahn and Egozcue (2001) and independentlyinvestigated by Barfoot (2002) and Barfoot and D’Eleuterio (2002) in their stochastic algebra . The extension tocontinuous variables was first published by Egozcue et al. (2006) and also studied by Barfoot and D’Eleuterio (2003)for the case of finite domains. The generalization to include probabilities and measures on the whole real line was madeby van den Boogaart et al. (2010, 2014). These spaces are called
Bayesian Hilbert spaces .In such a space, Bayes’ venerated rule becomes p ( x | z ) = p ( z | x ) ⊕ p ( x ) (1)where ⊕ indicates vector addition. (The normalization inherent in the operation accounts for the marginal p ( z ) automatically.) Each new measurement made to refine the posterior becomes one more term added to the sum. It is thislinear feature of a Bayesian Hilbert space that makes the structure ideally suited to variational inference.The set of Gaussians, in an appropriately extended sense, constitutes a subspace of the space of PDFs as do exponentialfamilies. An arbitrary PDF in one of these subspaces can be expressed in the simple and usual manner as a linearcombination of a basis for the subspace. The problem of variational inference can thus be expressed as the minimizationof a divergence over a set of Fourier coefficients.The linear-algebraic structure of these spaces affords us a new perspective and provides new insight. We show, forexample, that the solution to variational inference based on the KL divergence can be viewed as an iterative projection,in the Euclidean sense, onto a given subspace of densities. Indeed, this algorithm is essentially a Newton-like iterationscheme to solve for the minimum of the divergence, having a form identical to the natural-gradient-descent technique ofAmari (1998). Moreover, using a subspace of Gaussians reproduces the recent results of Barfoot et al. (2020).We also introduce a new information measure using a norm for the space of PDFs. This allows for a metric to be definedon the space, which can be interpreted as the distance between two PDFs. A new (symmetric and quadratic) divergencebetween PDFs can be based on the distance metric. It is notable that each step in our iterative-projection scheme is alocal minimization of this divergence.We shall begin with an overview of Bayesian Hilbert spaces in the next section. In §3, we discuss subspaces and bases,including exponentiated Hermite polynomials and Gaussian distributions. The variational inference problem for the KLdivergence as viewed from the purchase of a Bayesian Hilbert space is considered in §4. The specific case of using aGaussian subspace, that is, Gaussian variational inference, is treated in §5. Some numerical examples are provided in§6. We discuss the implications of sparsity in §7 and end with a few concluding remarks.2ariational Inference as Iterative Projection in a Bayesian Hilbert Space Let us consider some domain X for our PDFs, e.g., R N ; we shall refer to x ∈ X as the state . A PDF p ( x ) assigns anonnegative, finite value to each element of X such that (cid:90) X p ( x ) d x = 1 . (2)We define the set of all such bounded PDFs to be P = (cid:26) p ( x ) (cid:12)(cid:12)(cid:12)(cid:12) ( ∀ x ∈ X ) 0 ≤ p ( x ) < C, (cid:90) X p ( x ) d x = 1 (cid:27) , (3)for some finite C . As a technicality, we must also compel each p only to take on the value zero on a set of measure zero,but this will not be of major practical concern.We define vector addition , ⊕ : P × P → P , between two elements p , p ∈ P to be p ⊕ p = p ( x ) p ( x ) (cid:82) X p ( x ) p ( x ) d x (4)and scalar multiplication , · : R × P → P , of p ∈ P by α ∈ R as α · p = p α ( x ) (cid:82) X p α ( x ) d x . (5)With these operations, P is established as a vector space, termed a Bayesian linear space , over the field R (van denBoogaart et al., 2010).Vector subtraction is defined in the usual way, p (cid:9) p = p ⊕ ( − · p . (6)We note that subtraction, or the inverse additive operation, is equivalent to the Radon-Nikodym derivative (van denBoogaart et al., 2010).It will sometimes be convenient to avoid being explicit about the normalization constant for a PDF and so we define the normalization operator , ↓ p = p ( x ) (cid:82) X p ( x ) d x ∈ P . (7)Vector addition and scalar multiplication can then be simply expressed as p ⊕ p = ↓ p ( x ) p ( x ) , α · p = ↓ p α ( x ) . (8)As mentioned above, Bayes’ rule can be rendered as p ( x | z ) = p ( z | x ) ⊕ p ( x ) . The normalizing marginal p ( z ) isinvisible to the operation. Inner Product.
We endow the vector space with an inner product defined as (cid:104) p , p (cid:105) = 12 (cid:90) X (cid:90) X ln (cid:18) p ( x ) p ( y ) (cid:19) ln (cid:18) p ( x ) p ( y ) (cid:19) ν ( x ) ν ( y ) d x d y , (9)where ν ( · ) is a density function corresponding to an appropriate measure for X .If we take the measure ν to be also an element of P , which we may think of as a reference PDF, then we can write theinner product in (9) as (cid:104) p , p (cid:105) = E ν [ln p ln p ] − E ν [ln p ] E ν [ln p ] , (10)where E ν [ · ] is the expectation with respect to ν . In this work, we shall always take the measure to be a PDF; however,we shall refer to it as the measure to distinguish it from the other densities involved.Let us further refine P to require the PDFs to be square-log integrable, that is, (cid:90) X | ln p | ν ( x ) d x < ∞ . (11)Following van den Boogaart et al. (2014), then, we can claim that P with inner product (9) forms a separable Hilbertspace. This is called a Bayesian Hilbert space . We shall sometimes briefly refer to it as a
Bayesian space .3ariational Inference as Iterative Projection in a Bayesian Hilbert Space
Information and Divergence.
The norm of p ∈ P can be taken as (cid:107) p (cid:107) = (cid:104) p, p (cid:105) / . Accordingly, we can define the distance between two PDFs, p and q , simply as d ( p, q ) = (cid:107) p (cid:9) q (cid:107) , which induces a metric on the Bayesian space.The norm of p can be used to express the information content of the PDF. In fact, we shall define I ( p ) = 12 (cid:107) p (cid:107) = 12 (cid:104) p, p (cid:105) (12)as the information in p . (The reason for the factor of will become evident.) As an example, consider p = N ( µ, σ ) (over the domain R ) and measure ν = N (0 , . The information is I ( p ) = (1 + 2 µ ) / σ . The smaller the variancethe larger the information indicating that the PDF concentrates the random variable more tightly about its mean; that is,we know better where to expect the random variable so we may say that we have more information about it.We shall furthermore find it useful to define a new divergence between two PDFs p and q as I ( p (cid:9) q ) = 12 (cid:104) p (cid:9) q, p (cid:9) q (cid:105) . (13)This is the information contained in the difference of p and q . Unlike the Kullback-Leibler divergence, this divergenceis symmetric in p and q and quadratic in Bayesian space. Clearly, p = q if and only if I ( p (cid:9) q ) = 0 . Stochastic Derivative.
Accompanying this algebra is a functional calculus. Consider a PDF p ( x | θ ) ∈ P dependingcontinuously on some parameter θ . We define the stochastic partial derivative of p with respect to θ as (Barfoot andD’Eleuterio, 2003; Egozcue et al., 2013) ð p ð θ = lim λ → λ · (cid:0) p ( x | θ + λ ) (cid:9) p ( x | θ ) (cid:1) . (14)Note that the result of this operation remains an element in P . We can also define directional derivatives and a gradientoperator but these will not be required here. While P is an infinite-dimensional space, it contains infinitely many finite-dimensional subspaces. We can in fact builda subspace Q by taking the span of a set of M vectors B = { b , . . . , b M } , namely, Q = span { b , . . . , b M } . (15)If we choose B to be linearly independent, it will form a basis for Q . We can accordingly write every vector q in Q as alinear combination of B , i.e., q = M (cid:77) m =1 α m · b m , (16)where α m ∈ R are unique.As a shorthand, we will denote b = [ b b · · · b M ] T as the basis. The inner products between all pairs of basisvectors forms the Gram matrix , (cid:104) b , b (cid:105) = [ (cid:104) b m , b n (cid:105) ] , (17)where ( m, n ) are the indices of the matrix. We furthermore have an orthonormal basis if (cid:104) b m , b n (cid:105) = δ mn , the Kroneckerdelta, in which case (cid:104) b , b (cid:105) = , the identity matrix. Given a subspace Q of P and p ∈ P , the PDF q (cid:63) ∈ Q that minimizes the distance to, as well as the divergence (13)from, p is the projection of p onto Q , that is, q (cid:63) = proj Q p. (18)As in Euclidean geometry, we can view p as being decomposed into a component p (cid:107) lying in Q and a component p ⊥ perpendicular to it; therefore q (cid:63) = p (cid:107) (see Figure 1).The coordinates of q (cid:63) can be calculated as α (cid:63) = (cid:104) b , b (cid:105) − (cid:104) b , p (cid:105) . (19)4ariational Inference as Iterative Projection in a Bayesian Hilbert Space p P p k p ? p qq Q Figure 1: Projection onto a subspace, Q , of the Bayesian Hilbert space of PDFs, P .We may also write the projection as an outer-product operation on p , q (cid:63) ( x ) = Q ( x , x (cid:48) ) (cid:126) p ( x (cid:48) ) , (20)where Q ( x , x (cid:48) ) = b ( x ) (cid:105)(cid:104) b , b (cid:105) − (cid:104) b ( x (cid:48) ) (21)is the kernel of Q . (See also Appendix B.) R As an example, consider the domain over which the PDFs of P are defined to be R . We can use the exponentiatedHermite polynomials as a basis for our infinite-dimensional P ; in fact, they prove to be a natural choice. In onedimension, the first few probabilists’ Hermite polynomials are H ( ξ ) = ξ, H ( ξ ) = ξ − , H ( ξ ) = ξ − ξ, H ( ξ ) = ξ − ξ + 3 . (22)(We exclude H ( ξ ) = 1 as the resulting vector is the zero vector; however, it will need to be introduced whenconsidering the domain R N as explained in Appendix D.) Owing to the properties of the Hermite polynomials, namely,that (cid:90) ∞−∞ H n ( ξ ) ν ( ξ ) dξ = 0 , (cid:90) ∞−∞ H m ( ξ ) H n ( ξ ) ν ( ξ ) dξ = n ! δ mn , m, n = 1 , , . . . , (23)where ν ( ξ ) = N (0 , is the standard normal density, we can construct an orthonormal basis for P following Egozcueet al. (2006). Accordingly, E ν [ H n ] = 0 , E ν [ H m H n ] = n ! δ mn , m, n = 1 , , . . . (24)Our basis functions are h n ( ξ ) = ↓ exp ( − η n ( ξ )) , η n ( ξ ) = 1 √ n ! H n ( ξ ) . (25)Orthogonality follows as (cid:104) h m , h n (cid:105) = E ν [ η m η n ] − E ν [ η m ] E ν [ η n ] = 1 √ m ! n ! (cid:90) ∞−∞ H m ( ξ ) H n ( ξ ) 1 √ π exp (cid:18) − ξ (cid:19) dξ = δ mn . (26)An arbitrary PDF p ∈ P can be expanded elegantly in terms of this Hermite basis. However, we first need two lemmata,resting on the recursive definition of Hermite polynomials; these are Lemma 1.
For the standard normal measure, ν ∼ N (0 , , E ν [ H n +1 ( ξ ) f ( ξ )] = E ν (cid:20) H n ( ξ ) ∂f ( ξ ) ∂ξ (cid:21) . (27) where f ( ξ ) is a differentiable function and is such that the expectations exist. Lemma 2.
For the standard normal measure, ν ∼ N (0 , , E ν [ H n ( ξ ) f ( ξ )] = E ν (cid:20) ∂ n f ( ξ ) ∂ξ n (cid:21) . (28) where f ( ξ ) is an n -fold differentiable function and is such that the expectations exist. p ∈ P expressed as p ( ξ ) = ↓ exp( − φ ( ξ )) . The coordinates are given by α n = (cid:104) h n , p (cid:105) = 1 √ n ! E ν (cid:20) ∂ n φ ( ξ ) ∂ξ n (cid:21) (29)and hence p ( ξ ) = ∞ (cid:77) n =1 α n · h n ( ξ ) = ↓ exp (cid:32) − ∞ (cid:88) n =1 n ! E ν (cid:20) ∂ n φ ( ξ ) ∂ξ n (cid:21) H n ( ξ ) (cid:33) . (30)We can account for measures other than the standard normal density, say ν ∼ N ( µ, σ ) , by the well known reparame-terization ‘trick’, x = µ + σξ, (31)which leads to p ( x ) = ∞ (cid:77) n =1 α n · h n (cid:18) x − µσ (cid:19) = ↓ exp (cid:32) − ∞ (cid:88) n =1 σ n n ! E ν (cid:20) ∂ n φ ( x ) ∂x n (cid:21) H n (cid:18) x − µσ (cid:19)(cid:33) . (32)It is instructive to rewrite this expression by replacing − φ with ln p giving p ( x ) = ↓ exp (cid:32) ∞ (cid:88) n =1 σ n n ! E ν (cid:20) ∂ n ln p ( x ) ∂x n (cid:21) H n (cid:18) x − µσ (cid:19)(cid:33) . (33)This is a Taylor-like expansion of p pivoting on a given mean µ and standard deviation σ .Any subset of the basis functions { h , h , . . . } establishes a subspace of P ; however, as far as such subspaces areconcerned, it would be natural to choose an M -dimensional subspace H spanned by the first M basis functions. As thebasis is orthonormal, the Gram matrix is (cid:104) h , h (cid:105) = .The Hermite functions can also be used to generate a basis for P on the domain R N (see Appendix D). Let us consider a simple one-dimensional, nonlinear estimation problem as a numerical example motivated by thetype of inverse-distance nonlinearity found in a stereo camera model. This same experiment (with the same parametersettings) was used as a running example by Barfoot (2017, §4). We assume that our true state is drawn from a Gaussianprior: x ∼ N ( µ p , σ p ) . (34)We then generate a measurement according to z = f bx + n, n ∼ N (0 , σ r ) , (35)where n is measurement noise. The numerical values of the parameters used were µ p = 20 [m] , σ p = 9 [m ] ,f = 400 [pixel] , b = 0 . [m] , σ r = 0 . [pixel ] . (36)The true posterior is given by p ( x | z ) = ↓ exp( − φ ( x )) , φ ( x ) = 12 ( x − µ p ) σ p (cid:124) (cid:123)(cid:122) (cid:125) prior + 12 (cid:16) z − fbx (cid:17) σ r (cid:124) (cid:123)(cid:122) (cid:125) measurement . (37)We seek to find q ( x ) , in a subspace spanned by a Hermite basis, closest to p ( x | z ) . This problem can also be viewed asthe correction step of the Bayes filter (Jazwinski, 1970): Start from a prior and correct it based on the latest (nonlinear)measurement.Figure 2 provides an example of projection using (29) with an increasing number of Hermite basis functions. Themeasure used was ν ( x ) = N ( µ p , . We see that even with a fixed measure, the approximation gets better as thenumber of basis functions increases to the point that it is difficult to distinguish visually between the posterior and theapproximation. The bottom panel shows that divergence, I ( p (cid:9) q ) , essentially decreases exponentially fast with morebasis functions. 6ariational Inference as Iterative Projection in a Bayesian Hilbert Space
10 15 20 25 30 35 x p ( x ) prior2 basis funcs4 basis funcs6 basis funcs8 basis funcs10 basis funcsposterior Number of Basis Functions, B -10 -5 I ( p ⊖ q ) Figure 2: An example of projection of a posterior onto a finite basis with increasing number of basis functions. The toppanel qualitatively shows that adding more basis functions brings the approximation closer to the posterior. The bottomshows the same quantitatively where I ( p (cid:9) q ) decreases exponentially fast with more basis functions. P We now turn to a subspace involving Gaussian PDFs, which are quintessentially important to statistics and estimationtheory. Gaussians, as traditionally defined with a positive-definite covariance matrix, do not in themselves form asubspace. We need to expand the set to include covariance matrices that are sign-indefinite. Let us accordingly definean N -dimensional indefinite-Gaussian distribution as p ( x ) = N ( µ , Σ ) = 1 (cid:112) (2 π ) N | Σ | exp (cid:18) −
12 ( x − µ ) T Σ − ( x − µ ) (cid:19) , (38)which has mean, µ , and symmetric covariance, Σ . The set of all indefinite Gaussians is G = (cid:26) p ( x ) = N ( µ , Σ ) (cid:12)(cid:12)(cid:12)(cid:12) x , µ ∈ R N , Σ ∈ R N × N (cid:27) . (39)It is easy to show that G is in fact a subspace of P as the zero vector is contained therein ( Σ − = O , allowing that Σ → ∞ ) and the set is closed under vector addition and scalar multiplication.To establish G as a Bayesian Hilbert space, we must have an appropriate measure, ν . In our case, we choose the measureto also be a Gaussian, ν = N ( µ , Σ ) ∈ G . We may thus declare G to be a Bayesian Hilbert space for a measure ν ∈ G and note that Gaussian PDFs are square-log integrable satisfying (11).Several possibilities exist to parameterize Gaussians (Barfoot, 2020). There are N ( N + 3) unique elements containedin the mean and the symmetric covariance matrix on R N ; hence the dimension of G is N ( N + 3) . We shall constructour basis on a positive-definite choice of covariance Σ that we can decompose in Cholesky fashion, i.e., Σ = LL T .Now consider γ ( x ) = L − ( x − µ ) , γ ( x ) = (cid:113) D T D vech ( L − ( x − µ )( x − µ ) T L − T ) , (40)wherein vech ( · ) is the half-vectorization of its matrix argument and D is the associated duplication matrix (seeAppendix A). Note that γ is an N × column and γ is an N ( N + 1) × column. With a little abuse of notation,we set the basis functions as g ( x ) = (cid:20) g ( x ) g ( x ) (cid:21) = ↓ exp (cid:18) − (cid:20) γ ( x ) γ ( x ) (cid:21)(cid:19) ; (41)that is, the exponential is applied elementwise. We claim that g ( x ) is a basis for G .7ariational Inference as Iterative Projection in a Bayesian Hilbert SpaceIt is instructive to show that g ( x ) spans G as well as serving as the proof that it is a basis. Consider again thereparameterization ‘trick’ given by x = µ + L ξ (42)with ξ ∼ N ( , ) . This renders (41) as g ( ξ ) = (cid:20) g ( ξ ) g ( ξ ) (cid:21) = ↓ exp (cid:18) − (cid:20) γ ( ξ ) γ ( ξ ) (cid:21)(cid:19) = ↓ exp (cid:32) − (cid:34) ξ (cid:113) D T D vech ξξ T (cid:35)(cid:33) . (43)A linear combination of the basis functions can be written as p ( ξ ) = ↓ exp (cid:0) − α T γ ( ξ ) − α T γ ( ξ ) (cid:1) . (44)Now α T γ = α T ξ . (45)Also, we can in general express the second set of coordinates as α = (cid:113) D T D vech S (46)for some symmetric S that can easily be reconstructed from α . Hence α T γ = 12 ( vech S ) T D T D vech ξξ T = 12 ( vec S ) T vec ξξ T (47)given the identities vech A = D † vec A and DD † vec A = vec A , where D † is the Moore-Penrose inverse of D (Appendix A). Moreover, the identity ( vec A ) T vec B = tr AB leads to α T γ = 12 tr (cid:16) S ξξ T (cid:17) = 12 ξ T S ξ . (48)Then p ( x ) = ↓ exp (cid:18) − α T ξ − ξ T S ξ (cid:19) = ↓ exp (cid:18) − (cid:0) ξ + S − α ) T S ( ξ + S − α (cid:1)(cid:19) = ↓ exp (cid:18) − (cid:0) x − ( µ − LS − α ) (cid:1) T L − T SL − (cid:0) x − ( µ − LS − α ) (cid:1)(cid:19) . (49)This can represent any Gaussian distribution, where the mean is µ − LS − α and the covariance LS − L T . Thus g spans G . Furthermore, as the dimension of G is N ( N + 3) , the number of functions in g , g is a basis for G .This basis is, in addition, orthonormal as can be shown in a straightforward fashion by using the reparameterized form g ( ξ ) and recognizing that the entries in γ ( ξ ) are ξ i and those in γ ( ξ ) are either ξ i ξ j ( i (cid:54) = j ) or ξ i / √ . Hence, (cid:104) g , g (cid:105) = .It can be shown that α = (cid:104) g , p (cid:105) = L T E ν (cid:20) ∂φ ( x ) ∂ x T (cid:21) , α = (cid:104) g , p (cid:105) = (cid:113) D T D vech (cid:18) L T E ν (cid:20) ∂ φ ( x ) ∂ x T ∂ x (cid:21) L (cid:19) (50)are the coordinates for p ( x ) = ↓ exp( − φ ( x )) ∈ G . Consult Appendix E, where the more general case that the projectionof any PDF in P onto G is shown to be given also by (50). Another rendering of (49) is p ( x ) = ↓ exp (cid:18) − ( x − µ ) T E ν (cid:20) ∂φ ( x ) ∂ x T (cid:21) −
12 ( x − µ ) T E ν (cid:20) ∂ φ ( x ) ∂ x T ∂ x (cid:21) ( x − µ ) (cid:19) , (51)which also expresses the projection of a PDF in P onto G . We include the derivation of the information I of a Gaussiandistribution in Appendix F. We shall now address the problem of variational Bayesian inference using the algebraic tools of the Bayesian space.But we begin by reviewing the
Fisher information matrix and showing that with respect to the coordinates used in agiven subspace it is simply the Gram matrix of the chosen basis.8ariational Inference as Iterative Projection in a Bayesian Hilbert Space
Let q ( x | θ ) ∈ Q , a finite-dimensional subspace of P with basis B , depending on some parameter θ . The Fisherinformation on θ with respect to the measure ν is defined to be the covariance of the score (Fisher, 1922), i.e., I θ = E ν (cid:34)(cid:18) ∂ ln q∂θ − E ν (cid:20) ∂ ln q∂θ (cid:21)(cid:19) (cid:35) = E ν (cid:34)(cid:18) ∂ ln q∂θ (cid:19) (cid:35) − (cid:18) E ν (cid:20) ∂ ln q∂θ (cid:21)(cid:19) . (52)While our Fisher information may appear slightly unfamiliar, by taking the measure to be the density ν = q then E q [ ∂ ln q/∂θ ] = 0 and we have the traditional version. We purposely delay setting ν = q to show the connection toBayesian space.Take q to be expressed as a linear combination of the basis functions b n , that is, q ( x | θ ) = (cid:77) n α n ( θ ) · b n . (53)The score is ∂ ln q∂θ = 1 q ∂q∂θ . (54)As q = (cid:81) n b α n n / (cid:82) (cid:81) n b α n n d x , ∂q∂θ = ∂∂θ (cid:18) (cid:81) n b α n n (cid:82) (cid:81) n b α n n d x (cid:19) = (cid:88) m (cid:18) ∂α m ∂θ (cid:19) ln b m (cid:81) n b α n n (cid:82) (cid:81) n b α n n d x − (cid:81) n b α n n (cid:82) (cid:81) n b α n n d x (cid:80) m ( ∂α m /∂θ ) (cid:82) ln b m (cid:81) n b α n n (cid:82) (cid:81) n b α n n d x = q (cid:88) m (cid:18) ∂α m ∂θ (cid:19) (ln b m − E q [ln b m ]) . (55)Hence ∂ ln q∂θ = (cid:88) m (cid:18) ∂α m ∂θ (cid:19) (ln b m − E q [ln b m ]) . (56)(We note in passing that ∂ ln q/∂θ = (cid:80) k ( ∂α m /∂θ ) clr b m , where clr b m is the centered log-ratio transformation (vanden Boogaart et al., 2014).)Substituting (56) into (52) produces I θ = (cid:88) m (cid:88) n (cid:18) ∂α m ∂θ (cid:19) (cid:18) ∂α n ∂θ (cid:19) ( E ν [ln b m ln b n ] − E ν [ln b m ] E ν [ln b n ]) = (cid:18) ∂ α ∂θ (cid:19) T (cid:104) b , b (cid:105) (cid:18) ∂ α ∂θ (cid:19) . (57)The traditional Fisher information uses q as the measure and we will indicate that explicitly with a subscript on theinner product, e.g., (cid:104) b , b (cid:105) q .We mention for interest that the stochastic derivative of q ( x | θ ) with respect to θ is ð q ð θ = (cid:77) m (cid:18) ∂α m ∂θ (cid:19) · b m ( x ) (58)and so I θ = (cid:18) ∂ α ∂θ (cid:19) T (cid:104) b , b (cid:105) (cid:18) ∂ α ∂θ (cid:19) = (cid:42)(cid:77) m (cid:18) ∂α m ∂θ (cid:19) · b m , (cid:77) n (cid:18) ∂α n ∂θ (cid:19) · b n (cid:43) = (cid:28) ð q ð θ , ð q ð θ (cid:29) , (59)which makes the inner-product expression of the Fisher information coordinate-free.For multiple parameters, θ , θ . . . θ K , the ( m, n ) entry in the Fisher information matrix (FIM) is I θ ,mn = E ν (cid:20)(cid:18) ∂ ln q∂θ m − E ν (cid:20) ∂ ln q∂θ m (cid:21)(cid:19) (cid:18) ∂ ln q∂θ n − E ν (cid:20) ∂ ln q∂θ n (cid:21)(cid:19)(cid:21) = E ν (cid:20) ∂ ln q∂θ m ∂ ln q∂θ n (cid:21) − E ν (cid:20) ∂ ln q∂θ m (cid:21) E ν (cid:20) ∂ ln q∂θ n (cid:21) (60)9ariational Inference as Iterative Projection in a Bayesian Hilbert Spaceleading to I θ = (cid:18) ∂ α ∂ θ (cid:19) T (cid:104) b , b (cid:105) (cid:18) ∂ α ∂ θ (cid:19) . (61)We shall be particularly interested in the FIM with respect to the coordinates for a given basis, that is, when θ = α . Inthis case, the FIM is simply the Gram matrix, I α = (cid:104) b , b (cid:105) . (62)When q is used as the measure, we shall write I α = (cid:104) b , b (cid:105) q . In variational Bayesian inference, we seek to find an approximation, q , from some family of distributions constituting asubspace Q , to the true Bayesian posterior p ∈ P . In general, Q ⊆ P , (63)where equality implies that q = p will match the posterior exactly. But P is infinite-dimensional and, in practice, Q ⊂ P is a finite-dimensional subspace.There are many possible divergences that can be defined to characterize the ‘closeness’ of q to p including theKullback-Leibler (KL) divergence (Kullback and Leibler, 1951), Bregman divergence (Bregman, 1967), Wassersteindivergence/Earth mover’s distance (Monge, 1781) and Rényi divergence (Rényi, 1961). We shall focus on the KLdivergence, which is defined asKL ( q || p ) = − (cid:90) X q ( x ) ln (cid:18) p ( x | z ) q ( x ) (cid:19) d x = − E q [ln p − ln q ] . (64)Sometimes the reverse of this is used: KL ( p || q ) . Note, we show the divergence with respect to the posterior, p ( x | z ) ,but in practice during the calculations we use that p ( x | z ) = p ( x , z ) /p ( z ) = ↓ p ( x , z ) since the joint likelihood is easyto construct and then the p ( z ) can be dropped for it does not result in a KL term that depends on x . We will genericallyuse p in what follows to keep the notation clean. KL Gradient.
We assume a basis B = { b , b · · · b M } for Q and we write q as q = M (cid:77) m =1 α m · b m . (65)We desire to minimize the KL divergence with respect to the coordinates α m . The gradient of KL ( q || p ) can be computedas follows: ∂ KL ∂α n = − (cid:90) X (cid:18) ∂q∂α n (ln p − ln q ) − q ∂ ln q∂α n (cid:19) d x . (66)Recalling (55) and (56), this reduces to ∂ KL ∂α n = − E q [ln b n (ln p − ln q )] + E q [ln b n ] E q [ln p − ln q ] = −(cid:104) b n , p (cid:9) q (cid:105) q (67)or, collecting these in matrix form, ∂ KL ∂ α T = −(cid:104) b , p (cid:9) q (cid:105) q . (68)The necessary condition for a minimum of the KL divergence is that the gradient is zero. Newton’s method suggeststhe manner in which we might iteratively solve for the optimal distribution. Following the established procedure, theiteration for the coordinates is given by α ( i +1) = α ( i ) + H ( i ) − (cid:68) b , p (cid:9) q ( i ) (cid:69) q ( i ) , (69)where H is the Hessian of the KL divergence. 10ariational Inference as Iterative Projection in a Bayesian Hilbert Space KL Hessian.
The ( m, n ) entry of the Hessian is ∂ KL ∂α m ∂α n = − ∂∂α m (cid:104) b n , p (cid:9) q (cid:105) q . (70)This differentiation must take into account the effect of the ‘measure’ q . The product rule applies here and we can breakdown the differentiation as ∂ KL ∂α m ∂α n = − (cid:18) ∂∂α m (cid:104) b n , p (cid:9) q (cid:105) (cid:19) q − (cid:104) b n , p (cid:9) q (cid:105) ∂q/∂α m , (71)the first term of which is to be read as the derivative of the inner product holding the measure fixed and the second ofwhich deals with the derivative of the measure while holding the arguments of inner product fixed. The first term is (cid:18) ∂∂α m (cid:104) b n , p (cid:9) q (cid:105) (cid:19) q = (cid:32) ∂∂α m (cid:104) b n , p (cid:105) − ∂∂α m (cid:88) k α k (cid:104) b n , b k (cid:105) (cid:33) q = −(cid:104) b n , b m (cid:105) q = −(cid:104) b m , b n (cid:105) q . (72)As shown in Appendix G, the second becomes (cid:104) b n , p (cid:9) q (cid:105) ∂q/∂α m = (cid:28) b n , ∂ ln q∂α m · ( p (cid:9) q ) (cid:29) q − E q [ln p − ln q ] (cid:104) b m , b n (cid:105) q . (73)We advise that the coefficient ∂ ln q/∂α m of p (cid:9) q is in fact a function of the state and as such cannot be transferred tothe other argument of the inner product as would be possible for a scalar in the field R . We also recognize the factor ofthe last term as KL ( q || p ) . Therefore, substituting (72) and (73) into (71) yields ∂ KL ∂α m ∂α n = (1 − KL ( q || p )) (cid:104) b m , b n (cid:105) q − (cid:28) b n , ∂ ln q∂α m · ( p (cid:9) q ) (cid:29) q . (74)We observe that the second term on the right-hand side is symmetric in the indices as the substitution of (56) will attest.In matrix form, the Hessian is H = ∂ KL ∂ α T ∂ α = (1 − KL ( q || p )) I α − (cid:28) b , ∂ ln q∂ α T · ( p (cid:9) q ) (cid:29) q . (75)Newton’s method (69) can now be implemented. But the Hessian bears a closer look.The Hessian can also be explicitly written as ∂ KL ∂α m ∂α n = (cid:104) b m , b n (cid:105) q − E q [ln b m ln b n (ln p − ln q )]+ E q [ln b m ln b n ] E q [ln p − ln q ] + E q [ln b n ] E q [ln b m (ln p − ln q )]+ E q [ln b m ] E q [ln b n (ln p − ln q )] − E q [ln b m ] E q [ln b n ] E q [ln p − ln q ] , (76)the terms of which can be collected as ∂ KL ∂α m ∂α n = (cid:104) b m , b n (cid:105) q + (cid:104)− b mn + E q [ln b n ] · b m + E q [ln b m ] · b n , p (cid:9) q (cid:105) q , (77)where b mn = ↓ exp(ln b m ln b n ) . The symmetry in the Hessian is plainly evident in this version. Iterative Projection.
In the vicinity of the optimal distribution, with a sufficiently large subspace Q , we may expect p (cid:9) q to be small in the sense that ln p − ln q is small almost everywhere. This makes all the terms in the Hessian offirst order except I α , which is of zeroth order. The gradient (68) is also of first order. Thus to keep Newton’s descent tothis order, we may approximate the Hessian as H (cid:39) I α and the iterative procedure (69) becomes simply α ( i +1) = α ( i ) + I ( i ) α − (cid:68) b , p (cid:9) q ( i ) (cid:69) q ( i ) . (78)However, as q ( i ) = (cid:76) m α ( i ) m · b m , (cid:68) b , p (cid:9) q ( i ) (cid:69) q ( i ) = (cid:104) b , p (cid:105) q ( i ) − (cid:104) b , b (cid:105) q ( i ) α ( i ) = (cid:104) b , p (cid:105) q ( i ) − I ( i ) α α ( i ) . (79)11ariational Inference as Iterative Projection in a Bayesian Hilbert Space p P q ( i ) q ( i +1) q ( Q , q ( i ) ) p q ( i ) Figure 3: Iterative projection onto a sequence of Bayesian Hilbert spaces, ( Q , q ( i ) ) .Hence (78) becomes α ( i +1) = I ( i ) α − (cid:104) b , p (cid:105) q ( i ) . (80)The iterative update to q is q ( i +1) = (cid:76) m α ( i +1) m · b m . We may then express (80) using the application of the kernel of Q , q ( i +1) = (cid:16) b (cid:105) I ( i ) α − (cid:104) b (cid:17) (cid:126) p, (81)which describes a projection onto Q using measure q ( i ) . That is, the procedure can be viewed as an iterative projection , q ( i +1) = proj ( Q ,q ( i ) ) p. (82)Figure 3 depicts the scheme. The procedure is essentially the application of Newton’s method on the KL divergencewith the Hessian approximated as the FIM. This is precisely the approximation made in natural gradient descent (Amari,1998). In our Bayesian space, the operating point of the Newton step becomes the measure for the inner product. Thishighlights a key aspect of using the algebra associated with a Bayesian space. It recognizes the dual role of q : On theone hand it is the approximating PDF and on the other it serves as a measure that weights the difference between theapproximation and the approximated.Convergence of iterative projection is guaranteed if the Hessian is positive-definite. Provided that the subspace is largeenough, we can expect convergence when we begin in a neighborhood of optimal q where the first-order terms in theHessian are sufficiently small.It is notable that each step of the iterative projection is equivalent to the local minimization of the divergence I ( p (cid:9) q ) with the measure fixed at q ( i ) because I (cid:16) p (cid:9) ( q ( i ) ⊕ δq ) (cid:17) = I ( p (cid:9) q ( i ) ) + δ α T (cid:18) ∂I∂ α T (cid:19) q ( i ) + 12 δ α T (cid:18) ∂ I∂ α T ∂ α (cid:19) q ( i ) δ α , (83)where δ α = α ( i +1) − α ( i ) and (cid:18) ∂I∂ α T (cid:19) q ( i ) = − (cid:68) b , p (cid:9) q ( i ) (cid:69) q ( i ) , (cid:18) ∂ I∂ α T ∂ α (cid:19) q ( i ) = (cid:104) b , b (cid:105) q ( i ) ≡ I ( i ) α , (84)which are identical to the linearized forms for the KL divergence.Throughout this section we have assumed that the basis B remains constant across iterations, but this need not be thecase. We may also choose to update the basis along with the measure to maintain, for example, orthonormality. This isexplored in the next section on Gaussian iterative projection. Let us investigate a little more closely iterative projection using Gaussian PDFs, given their importance in statistics andestimation theory.
As mentioned at the end of the last section, we do not have to maintain the same basis from step to step as long aseach basis spans the same subspace. This is a particularly useful maneuver when using the subspace G of indefinite12ariational Inference as Iterative Projection in a Bayesian Hilbert SpaceGaussians. Denote the mean and variance of q ( i ) ∈ G as µ ( i ) and Σ ( i ) and let the basis g ( i ) be defined as in (40) and(41). Note that this basis is orthonormal with respect to q ( i ) . As such, I ( i ) α = (cid:10) g ( i ) , g ( i ) (cid:11) q ( i ) = . Imagine the PDF tobe approximated is expressed as p = ↓ exp( − φ ( x )) ∈ P . The coordinates resulting from the next projection are givenby (50), namely, α ( i +1)1 = (cid:68) g ( i )1 , p (cid:69) = L ( i ) T E q ( i ) (cid:20) ∂φ ( x ) ∂ x T (cid:21) , α ( i +1)2 = (cid:68) g ( i )2 , p (cid:69) = (cid:113) D T D vech (cid:18) L ( i ) T E q ( i ) (cid:20) ∂ φ ( x ) ∂ x T ∂ x (cid:21) L ( i ) (cid:19) , (85)where L ( i ) issues from the Cholesky decomposition of Σ ( i ) .The new iteration is q ( i +1) = proj ( G ,q ( i ) ) p = ↓ exp (cid:18) − α ( i +1)1 T γ ( i )1 − α ( i +1)1 T γ ( i )1 (cid:19) . (86)Using (51), this becomes q ( i +1) = ↓ exp (cid:18) − ( x − µ ( i ) ) T E q ( i ) (cid:20) ∂φ ( x ) ∂ x T (cid:21) −
12 ( x − µ ( i ) ) T E q ( i ) (cid:20) ∂ φ ( x ) ∂ x T ∂ x (cid:21) ( x − µ ( i ) ) (cid:19) , (87)which we may cast into the form, q ( i +1) = ↓ exp (cid:18) −
12 ( x − µ ( i +1) ) T Σ ( i +1) − ( x − µ ( i +1) ) (cid:19) . (88)Herein Σ ( i +1) − = E q ( i ) (cid:20) ∂ φ ( x ) ∂ x T ∂ x (cid:21) , (89a) Σ ( i +1) − δ µ = − E q ( i ) (cid:20) ∂φ ( x ) ∂ x T (cid:21) , (89b) µ ( i +1) = µ ( i ) + δ µ (89c)give the updates from q ( i ) = N ( µ ( i ) , Σ ( i ) ) to q ( i +1) = N ( µ ( i +1) , Σ ( i +1) ) and these are exactly the same as thoseused in the iterative Gaussian variational inference approach presented by Barfoot et al. (2020). We have arrived atthe same variational updates but have done so from the framework of a Bayesian Hilbert space, where it becomesabundantly clear that the minimization algorithm is in fact a slightly simplified version of Newton’s method. This alsoprovides the connection back to the classic Gaussian filtering and smoothing algorithms as discussed by Barfoot et al.(2020). The indefinite-Gaussian subspace does not guarantee that the covariance matrix of the approximation is positive-definitefor every PDF, p = ↓ exp( − φ ( x )) . However, as this covariance, or more precisely its inverse, is given by Σ − = E q (cid:20) ∂ φ ( x ) ∂ x T ∂ x (cid:21) , (90)positive-definiteness can be ensured if the Hessian of φ ( x ) is positive-definite. Equivalently, if φ ( x ) is a convex functionin x , that is, if p ( x ) is logarithmically concave, then the covariance of the projection on G will be positive-definite.The same can in fact be said about projection p onto the Hermite subspace H provided that at least the first two Hermitepolynomials are used in the basis (and combinations thereof in the multivariate case). The indefinite-Gaussian subspaceis also a subspace of H so constructed.This conjures the picture in Figure 4. Let the subset P + ⊂ P include all logarithmically concave PDFs or all PDFswith a negative convex log-likelihood, i.e., P + = (cid:26) p ( x ) = ↓ exp( − φ ( x )) ∈ P (cid:12)(cid:12)(cid:12)(cid:12) φ ( x ) convex in x ∈ X (cid:27) . (91)13ariational Inference as Iterative Projection in a Bayesian Hilbert Space PP + G + H + H G
Figure 4: Relationship of subsets of PDFs with positive-definite variances. Specifically, G + ⊂ G , H + ⊂ H , and P + ⊂ P .We can similarly define convex subsets for H , H + = (cid:26) p ( x ) = ↓ exp( − φ ( x )) ∈ H (cid:12)(cid:12)(cid:12)(cid:12) φ ( x ) convex in x ∈ X (cid:27) , (92)and G , G + = (cid:26) p ( x ) = ↓ exp( − φ ( x )) ∈ G (cid:12)(cid:12)(cid:12)(cid:12) φ ( x ) convex in x ∈ X (cid:27) . (93)But we emphasize that these are subsets not subspaces. However, P + can be projected onto H + ⊂ H , which in turncan be projected onto G + ⊂ G .Unfortunately, starting with p ∈ P + is a very strict condition that can be difficult to meet in practice. Fortunately, thisis not the only way that we can end up with q ∈ G + . One possibility is to start with q in a local basin of convergencewhere the expected curvature remains positive. Another is to make an input correction by adding a regularizer term to p ∈ P to cause it to move into P + before projection. On the output end, if q does not end up in G + it can be rejected infavour of the closest q ∈ G + to p ; in other words, one can add a constraint to the variational objective. In the MaximumA Posteriori (MAP) setting, the last two ideas are tied to Levenberg-Marquardt regularization. We return to the Hermite basis to provide examples of iterative projection on 2- and 4-dimensional subspaces using againthe posterior of §3.3. In all examples, the expectations were computed with generic numerical integration although, asdiscussed by Barfoot et al. (2020), there are several other options including Gaussian quadrature.We consider two different examples. In the first, shown in Figure 5, we hold the Hermite basis functions fixed andtake the full estimate q ( i ) from the previous iteration to be the measure. With respect to the changing (and in the4-basis-function case non-Gaussian) measure, the fixed basis is not orthogonal. This then necessitates calculation of theGram matrix at each iteration to implement the projection. We initialized both the 2- and 4-basis-function estimatesto the prior, which corresponds to the first panel in Figure 5. The next three panels show subsequent iterations of theestimates. The last panel shows the KL divergence between the estimates and the true posterior for 10 iterations. We seethat both estimates converged in a few iterations with the 4-basis-function version performing better at each iterationand becoming visibly indistinguishable from the true posterior after four iterations.In the second example, shown in Figure 6, we chose to use a Gaussian measure, even for the 4-basis-function estimate.This was motivated by the fact that the Hermite basis is orthogonal with respect to a Gaussian measure. Using anorthogonal basis brings along with it a simpler form of the projection since the Gram matrix is the identity matrix. Asthe measure changes from one iteration to the next, we then have to update the basis to retain the desired orthogonality.This can be accomplished by again using the reparameterization ‘trick’ to adjust the basis to be orthogonal with respectto the current Gaussian measure. For the 4-basis-function estimate, the Gaussian measure was calculated by furtherprojecting the current estimate onto the Gaussian subspace, a trivial matter of retaining only the first two coordinatescorresponding to H and H . The first panel of Figure 6 again depicts both estimates initialized to the prior. The nextthree panels show the iterative scheme progressing. The last panel provides the KL divergence between the estimatesand the true posterior. The results are similar to the previous example. Interestingly, the 4-basis-function estimate wasonly slightly worse in terms of its final KL than the first example despite the fact that it is constrained by its use of aGaussian measure. 14ariational Inference as Iterative Projection in a Bayesian Hilbert Space
10 15 20 25 30 35 x p ( x ) prior2 basis functions4 basis functionsposterior
10 15 20 25 30 35 x p ( x ) prior2 basis functions4 basis functionsposterior
10 15 20 25 30 35 x p ( x ) prior2 basis functions4 basis functionsposterior
10 15 20 25 30 35 x p ( x ) prior2 basis functions4 basis functionsposterior Iterations -6 -5 -4 -3 -2 -1 K L ( q || p ) Figure 5: Example of iterative projection onto 2- and 4-dimensional subspaces spanned by Hermite polynomials,where the measure is taken to be the estimate q ( i ) at the previous iteration and the basis held fixed. Both the 2- and4-basis-function estimates were initialized to the prior (first panel) and then iteratively updated (next three panels).The last panel shows the KL divergence between the estimates and the true posterior for 10 iterations. We see thatthe estimates converged in a few iterations with the -basis-function estimate matching the posterior better at everyiteration. 15ariational Inference as Iterative Projection in a Bayesian Hilbert Space
10 15 20 25 30 35 x p ( x ) prior2 basis functions4 basis functionsposterior
10 15 20 25 30 35 x p ( x ) prior2 basis functions4 basis functionsposterior
10 15 20 25 30 35 x p ( x ) prior2 basis functions4 basis functionsposterior
10 15 20 25 30 35 x p ( x ) prior2 basis functions4 basis functionsposterior Iterations -6 -5 -4 -3 -2 -1 K L ( q || p ) Figure 6: Same setup as Figure 5, except that the measure was Gaussian for both estimates. For the 4-basis-functionestimate, this involved projecting the current estimate onto the Gaussian subspace. The basis functions were reorthogo-nalized at each iteration as described in §5, accounting for the small difference in the 2-basis-function case from theprevious example. Again, we see both estimates converged well. The 4-basis-function estimate converged to a slightlyworse value than when the full estimate was used as the measure rather than its Gaussian projection.16ariational Inference as Iterative Projection in a Bayesian Hilbert SpaceWe note that the second example exhibited a slightly slower convergence than the first. This is understandable in the4-basis-function case as the algorithm is slightly compromised by essentially truncating the measure to a Gaussian. Asfor the 2-basis-function case, while the two examples differ only in the reorthogonalization of the basis functions, itmust be borne in mind that the use of different bases for even the same subspace can result in a different progression ofthe projections. At convergence, there is still a very slight difference in the KL-divergence values but this is an artefactof the integration tolerance operating on different variables ( ξ instead of x ). One of the major advantages of thinking of P as a vector space with the definition of vector addition ⊕ is thatBayesian inference in general can be viewed as the addition of vectors. Consider the posterior p ( x | z ) where z are somemeasurements. Bayes’ rule states that p ( x | z ) = p ( z | x ) p ( x ) p ( z ) = p ( z | x ) ⊕ p ( x ) , (94)where p ( x ) is a prior, p ( z | x ) is a measurement factor and, as mentioned earlier, we needn’t introduce the normalizationconstant p ( z ) explicitly when writing the posterior as a vector addition in Bayesian space. If we have several measurements that are statistically independent, then this can be factored as p ( x | z ) = p ( x ) ⊕ K (cid:77) k =1 p ( z k | x k ) , (95)where x k = P k x is a subset of the variables in x , P k is a projection matrix, and z k is the k th measurement. Thisexpresses sparsity in the state description and in the measurements. To keep the notation economical, we shall simplywrite p = K (cid:77) k =0 p k , (96)where p is the posterior and the p k comprise the prior and the measurements, corresponding to statistically independentdata. In other words, the factorization becomes a summation in the Bayesian space P .Now consider our projective approach to inference. As usual, given a subspace Q ⊂ P , the optimal estimate to (96) isgiven by q (cid:63) = proj Q p = proj Q K (cid:77) k =0 p k = K (cid:77) k =0 proj Q p k . (97)That is, the projection of the sum is the sum of the projections. Each individual projection can be done separatelybecause we are in a linear space. This is of enormous practical advantage because it means that we do not need all of Q to represent each projection.We can see this more clearly by defining P k ⊂ P as the subspace corresponding to the variables x k . Then p = K (cid:77) k =0 p k ∈ P ⊕ P ⊕ · · · ⊕ P K ⊆ P . (98)In other words, p is contained in the direct sum of the subspaces P k . Each constituent part p k may be confined to asmaller subspace of P , depending on the variable dependencies in each term.If we wish to project p k ∈ P k onto Q it will suffice to consider the projection on just Q k = P k ∩ Q , i.e.,proj Q p k = proj Q k p k . (99)The subspace Q k may, and ideally would, be smaller than Q . We may refer to Q k as the marginal subspace of Q withrespect to the subset of variables x k .Therefore, the optimal estimate will be given by q (cid:63) = proj Q p = K (cid:77) k =0 proj Q k p k . (100)17ariational Inference as Iterative Projection in a Bayesian Hilbert Space p ... p K p K p k . . . p p Q Q k Q K Q K . . .. . . ... ... q q q k q K q K q Q Figure 7: Exploiting sparsity by projecting individual measurements onto marginal subspaces, Q k .This means that we can project the PDF associated with each measurement onto a smaller subspace and simply add upthe estimates, lifting the overall estimate up into a potentially much larger space. The decomposition and reconstitutionis illustrated in Figure 7. Just as with the total posterior, we may describe q (cid:63) as being an element of a direct sum of theindividual subspaces of Q , i.e., q (cid:63) ∈ Q ⊕ Q ⊕ · · · ⊕ Q K ⊆ Q . (101)The subspace sum may be substantially smaller than Q but again it will depend on the variable dependencies of eachterm.This is the key result that allows most practical inference frameworks to function in a tractable way. Depending on thechosen basis for Q , many of the coordinates can potentially be zero and thus it will not be necessary to waste effortcomputing them or space storing them. The effect of sparsity as it applies to iterative Gaussian inference is of particular interest. Let us consider thedecomposition of a posterior p in accordance to the foregoing; that is, p = ↓ exp( − φ ( x )) = ↓ exp (cid:32) − K (cid:88) k =0 φ k ( x k ) (cid:33) = K (cid:77) k =0 exp ( − φ k ( x k )) = K (cid:77) k =0 p k , (102)where φ k ( x k ) is the k th (negative log) factor expression and x k = P k x .As in (51), we may express the variational estimate as q ( i +1) = proj ( G ,q ( i ) ) p = ↓ exp (cid:18) − ( x − µ ( i ) ) T E q ( i ) (cid:20) ∂φ ( x ) ∂ x T (cid:21) −
12 ( x − µ ( i ) ) T E q ( i ) (cid:20) ∂ φ ( x ) ∂ x T ∂ x (cid:21) ( x − µ ( i ) ) (cid:19) , (103)18ariational Inference as Iterative Projection in a Bayesian Hilbert Spaceusing the measure q ( i ) = N ( µ ( i ) , Σ ( i ) ) . To take advantage of sparsity, we need to have it reflected in the expectationsherein. The first one leads to E q ( i ) (cid:20) ∂φ ( x ) ∂ x T (cid:21) = K (cid:88) k =0 E q ( i ) (cid:20) ∂φ k ( x k ) ∂ x T (cid:21) = K (cid:88) k =0 (cid:18) ∂ x k ∂ x (cid:19) T E q ( i ) (cid:20) ∂φ k ( x k ) ∂ x Tk (cid:21) = K (cid:88) k =0 P Tk E q ( i ) (cid:20) ∂φ k ( x k ) ∂ x Tk (cid:21) = K (cid:88) k =0 P Tk E q ( i ) k (cid:20) ∂φ k ( x k ) ∂ x Tk (cid:21) , (104)given that x k = P k x . For each factor k , then, we are able to shift the differentiation from x to x k . We draw attentionto the last equality, where the expectation simplifies to using q ( i ) k = q ( i ) k ( x k ) , the marginal of the measure for just thevariables in factor k . In a similar fashion, E q ( i ) (cid:20) ∂ φ ( x ) ∂ x T ∂ x (cid:21) = K (cid:88) k =0 P Tk E q ( i ) k (cid:20) ∂ φ k ( x k ) ∂ x Tk ∂ x k (cid:21) P k (105)accounts for the second expectation in (103).The implication of the factorization is that each factor, identified by φ k ( x k ) , is projected onto G k , the marginal subspaceassociated with variables x k . The results can then be recombined for the full variational estimate as q ( i +1) = proj ( G ,q ( i ) ) p = K (cid:77) k =0 proj ( G k ,q ( i ) k ) p k = K (cid:77) k =0 q ( i +1) k . (106)The individual projections of p k = ↓ exp ( − φ k ( x k )) onto ( G k , q ( i ) ) are q ( i +1) k = proj ( G k ,q ( i ) k ) p k = ↓ exp (cid:18) −
12 ( x k − µ ( i +1) k ) T Σ ( i +1) − k ( x k − µ ( i +1) k ) (cid:19) = ↓ exp (cid:18) −
12 ( x − P Tk µ ( i +1) k ) T (cid:16) P Tk Σ ( i +1) − k P k (cid:17) ( x − P Tk µ ( i +1) k ) (cid:19) , (107)where µ ( i +1) k = µ ( i ) k − Σ ( i +1) k E q ( i ) k (cid:20) ∂φ k ( x k ) ∂ x Tk (cid:21) , Σ ( i +1) − k = E q ( i ) k (cid:20) ∂ φ k ( x k ) ∂ x Tk ∂ x k (cid:21) . (108)It is straightforward to show that the vector sum of q k from (107) reproduces (103). (Note that P k P Tk = as P k is aprojection matrix and P Tk the corresponding dilation.)As explained in detail by Barfoot et al. (2020), it would be too expensive for practical problems to construct first Σ ( i ) and then extract the required blocks for the marginals, q ( i ) k = N ( µ ( i ) k , Σ ( i ) k ) = N ( P k µ ( i ) , P k Σ ( i ) P Tk ) . We see fromthe above development that we actually only require the blocks of Σ ( i ) corresponding to the nonzero blocks of itsinverse and the method of Takahashi et al. (1973) can be used to extract the required blocks efficiently. Barfoot et al.(2020) provide numerical experiments showing the efficacy of this approach.Moreover, if we choose to project onto a larger Hermite basis, we can still continue to use a Gaussian measure since theHermite basis functions we defined are orthonormal with respect to this measure; this is what was done in the secondnumerical example above. This means that even with a larger approximation family, we can continue to exploit themethod of Takahashi et al. (1973) to extract only the required marginals. Our principal goal in this work has been to provide a new perspective on the problem of variational inference. This newvantage point is afforded by considering probability density functions as elements in a Bayesian Hilbert space, wherevector addition is a normalized multiplication (perturbation) that accounts for Bayes’ rule and scalar multiplication is anormalized exponentiation (powering). Gaussians and, more generally, exponential families, which are often used invariational inference, constitute subspaces. We thus have at our disposal all the familiar instruments of linear algebra.The use of the Kullback-Leibler divergence KL ( q (cid:107) p ) in variational inference to find the best approximation q to a givenposterior p is widespread. In most approaches, the canvas on which the minimization is carried out is a set, usually19ariational Inference as Iterative Projection in a Bayesian Hilbert Spaceconvex, or a manifold of admissible functions (Csiszár, 1975; Csiszár and Tusnády, 1984; Amari, 1995; Adamˇcík,2014; Amari, 2016). ‘Projections’ of p onto the set or manifold are ipso facto the PDF q that minimizes the divergence.However, in Bayesian space, we may interpret projections as standard linear-algebraic projections, reminding us of aEuclidean world.We take particular note of the information geometry of Csiszár and Amari. They along with their colleagues (Csiszár andTusnády, 1984; Amari et al., 1992) separately developed the em algorithm—not to be confused with the EM (expectation-maximization) algorithm although the two are in many cases equivalent—to solve the generalized variational problem,which involves a dual minimization of q over its manifold and p over its own. (The minimum is therefore the minimum‘distance’ between manifolds.) The e -step of the algorithm is performed by making the manifold ‘flat,’ i.e., linear, asa result of using an exponential family of densities. This flattening is equivalent to thinking in terms of a Bayesianspace as we have done here. Indeed, as we have shown, the natural-gradient-descent algorithm of Amari (1998) can beexplained using this framework as a Newton-like iterative projection.Based on the inner product of our Bayesian space, we have suggested a new information measure. It is proportional tothe squared norm of a probability distribution, which can be used to establish a (symmetric and quadratic) divergencebetween two PDFs. The connection to the KL divergence is worthwhile mentioning. Each step in the iterative-projectionalgorithm presented here for variational inference based on the KL divergence amounts to a local minimization of ourBayesian-space divergence.The linear structure of a Bayesian space furthermore allows us to treat sparsity in measurement data very neatly asthe vector sum of the measurements, each of which can be expressed as an element in a subspace restricted to thelocal variables dictated by the sparsity of the problem, for example, as in the simultaneous-localization-and-mapping(SLAM) problem in robotics (Barfoot et al., 2020). The mean-field approximation in variational inference can behandled in much the same way in this framework. The factorization of a distribution with respect to a desired family ofdistributions would again be rendered as a vector sum of PDFs.In his fictional autobiography, Zen and the Art of Motorcycle Maintenance , Robert M. Pirsig notes that “One geometrycannot be more true than another; it can only be more convenient.” The same can be said of algebra. Whether one takesa geometric or algebraic tack in analyzing a problem, it can be agreed that different perspectives offer different viewsand given a particular problem or even a particular class of problem one tack may sometimes be more convenient thanothers. We hope the perspective presented here on variational inference using a Bayesian Hilbert space of probabilitydensity functions offers not only convenience in some respects but insight and a degree of elegance as well.
References
Adamˇcík, M., “The Information Geometry of Bregman Divergences and Some Applications in Multi-Expert Reasoning,”
Entropy , 16:6338–6381, 2014.Aitchison, J., “The statistical analysis of compositional data (with discussion),”
Journal of the Royal Statistical Society,Series B , 44:139–177, 1982.Amari, S.-I., “The EM Algorithm and Information Geometry in Neural Network Learning,”
Neural Computation ,7:13–18, 1995.Amari, S.-I., “Natural gradient works efficiently in learning,”
Neural computation , 10(2):251–276, 1998.Amari, S.-I.,
Information Geometry and Its Applications , Springer, Japan, 2016.Amari, S.-I., Kurata, K., and Nagaoka, H., “Information Geometry of Boltzmann Machines,”
IEEE Transactions onNeural Networks , 3(2):260–271, 1992.Ambrogioni, L., Güçlü, U., Güçlütürk, Y., Hinne, M., Maris, E., and van Gerven, M. A. J., “Wasserstein VariationalInference,” in , 2018.Barber, D.,
Bayesian Reasoning and Machine Learning , Cambridge University Press, 2012.Barfoot, T. D., “Stochastic Decentralized Systems,” Ph.D. Thesis, University of Toronto, 2002.Barfoot, T. D.,
State Estimation for Robotics , Cambridge University Press, 2017.Barfoot, T. D., “Multivariate Gaussian Variational Inference by Natural Gradient Descent,” Technical report, Au-tonomous Space Robotics Lab, University of Toronto, 2020, arXiv:2001.10025 [stat.ML].Barfoot, T. D. and D’Eleuterio, G. M. T., “An Algebra for the Control of Stochastic Systems: Exercises in LinearAlgebra,” in
Proceedings of the 5th International Conference on Dynamics and Control of Structures in Space(DCSS) , Cambridge, England, 2002. 20ariational Inference as Iterative Projection in a Bayesian Hilbert SpaceBarfoot, T. D. and D’Eleuterio, G. M. T., “Stochastic Algebra for Continuous Variables,” Technical report, Universityof Toronto Institute for Aerospace Studies, 2003.Barfoot, T. D., Forbes, J. R., and Yoon, D. J., “Exactly Sparse Gaussian Variational Inference with Application toDerivative-Free Batch Nonlinear State Estimation,”
International Journal of Robotics Research, to appear , 2020,arXiv:2001.10025 [stat.ML].Bayes, T., “Essay towards solving a problem in the doctrine of chances,”
Philosophical Transactions of the RoyalSociety of London , 53:370–418, 1763.Bishop, C. M.,
Pattern Recognition and Machine Learning , Springer, 2006.Blei, D. M., Kucukelbir, A., and McAuliffe, J. D., “Variational Inference: A Review for Statisticians,”
Journal of theAmerican Statistical Association , 112(518):859–877, 2017.Bregman, L. M., “The relaxation method of finding the common point of convex sets and its application to the solutionof problems in convex programming,”
USSR Computational Mathematics and Mathematical Physics , 7(3):200–217,1967.Csiszár, I., “ I -Divergence Geometry of Probability Distributions and Minimization Problems,” The Annals of Probability ,3(1):146–158, 1975.Csiszár, I. and Tusnády, G., “Information Geometry and Alternating Minimization Procedures,” in
Statistics andDecisions , Supplement Issue No. 1, R. Oldenberg, 1984.Egozcue, J., Pawlowsky-Glahn, V., Tolosana-Delgado, R., Ortego, M. I., and van den Boogaart, K. G., “Bayes spaces:use of improper distributions and exponential families,”
RACSAM: Revista de la Real Academia de Ciencias Exactas,Físicas y Naturales. Serie A. Matemáticas , 107:475–486, 2013.Egozcue, J. J., Diaz-Barrero, J. L., and Pawlowsky-Glahn, V., “Hilbert Space of Probability Density Functions Basedon Aitchison Geometry,”
Acta Mathematica Sinica , 22:1175–1182, 2006.Fisher, R. A., “On the mathematical foundations of theoretical statistics,”
Philosophical Transactions of the RoyalSociety of London. Series A, Containing Papers of a Mathematical or Physical Character , 222(594-604):309–368,1922.Hinton, G. E. and van Camp, D., “Keeping Neural Networks Simple by Minimizing the Description Length of theWeights,” in
Sixth ACM Conference on Computational Learning Theory, Santa Cruz, California , 1993.Jazwinski, A. H.,
Stochastic Processes and Filtering Theory , Academic, New York, 1970.Jordan, M. I., Ghahramani, Z., Jaakkola, T., and Saul, L. K., “An Introduction to Variational Methods for GraphicalModels,”
Machine Learning , 37:183–233, 1999.Kullback, S. and Leibler, R. A., “On information and sufficiency,”
The annals of mathematical statistics , 22(1):79–86,1951.Laplace, P.-S.,
Philosophical Essay on Probabilities , Springer, 1995, translated by Andrew I. Dale from Fifth FrenchEdition, 1825.Li, Y. and Turner, R. E., “Rényi Divergence Variational Inference,” in , 2016.Magnus, J. R. and Neudecker, H., “The elimination matrix: some lemmas and applications,”
SIAM Journal on AlgebraicDiscrete Methods , 1(4):422–449, 1980.Magnus, J. R. and Neudecker, H.,
Matrix differential calculus with applications in statistics and econometrics , JohnWiley & Sons, 2019.Manton, J. H. and Amblard, P.-O., “A Primer on Reproducing Kernel Hilbert Spaces,” Technical report, The Universityof Melbourne and CNRS, 2015, arXiv:1408.0952v2 [math.HO].McGrayne, S. B.,
The Theory That Would Not Die: How Bayes’ Rule Cracked the Enigma Code, Hunted Down RussianSubmarines, and Emerged Triumphant from Two Centuries of Controversy , Yale University Press, New Haven,Connecticut, 2011.Monge, G., “Mémoire sur la théorie des déblais et des remblais,”
Histoire de l’Académie Royale des Sciences de Paris ,1781.Painsky, A. and Wornell, G. G., “Bregman Divergence Bounds and Universality Properties of the Logarithmic Loss,”Technical report, Department of Industrial Engineering, Tel Aviv University, 2020, arXiv:1810.07014v2 [cs.IT].Pawlowsky-Glahn, V. and Egozcue, J. J., “Geometric Approach to Statistical Analysis on the Simplex,”
StochasticEnvironmental Research and Risk Assessment , 15:384–398, 2001.21ariational Inference as Iterative Projection in a Bayesian Hilbert SpaceRényi, A., “On Measure of Entropy and Information,” in
Proceedings of the Fourth Berkeley Symposium on Mathemati-cal Statistics and Probability , volume 1, pages 547–561, 1961.Stein, C. M., “Estimation of the Mean of a Multivariate Normal Distribution,”
Annals of Statistics , 9(6):1135–1151,1981.Takahashi, K., Fagan, J., and Chen, M.-S., “A Sparse Bus Impedance Matrix and its Application to Short Circuit Study,”in
Proceedings of the PICA Conference , 1973.van den Boogaart, K. G., Egozcue, J. J., and Pawlowsky-Glahn, V., “Bayes Linear Spaces,”
Statistics and OperationsResearch Transactions , 34(2):201–222, 2010.van den Boogaart, K. G., Egozcue, J. J., and Pawlowsky-Glahn, V., “Bayes Hilbert Spaces,”
Australian and NewZealand Statistics , 56(2):171–194, 2014.Wainwright, M. J. and Jordan, M. I., “Graphical Models, Exponential Families, and Variational Inference,”
MachineLearning , 1(1-2):1–305, 2008.
A Kronecker Product, vec and vech Operators, and Duplication Matrices
There are several identities, of which we have made use, involving the
Kronecker product ⊗ and the vectorization operator vec ( · ) that stacks the columns of a matrix:vec ( a ) ≡ a vec ( ab T ) ≡ b ⊗ a vec ( ABC ) ≡ ( C T ⊗ A ) vec ( B ) vec ( A ) T vec ( B ) ≡ tr ( AB )( A ⊗ B )( C ⊗ D ) ≡ ( AC ) ⊗ ( BD )( A ⊗ B ) − ≡ A − ⊗ B − ( A ⊗ B ) T ≡ A T ⊗ B T . (109)It is worth noting that ⊗ and vec ( · ) are linear operators.As we will be working with (symmetric) covariance matrices when discussing Gaussians, we would like to be able torepresent them parsimoniously in terms of only their unique variables. Following Magnus and Neudecker (2019, §18),we introduce the half-vectorization operator vech ( · ) that stacks up the elements in a column matrix, excluding all theelements above the main diagonal. The duplication matrix D allows us to recover a full symmetric matrix from itsunique parts: vec ( A ) = D vech ( A ) (symmetric A ) . (110)It is helpful to consider a simple × example: A = (cid:20) a bb c (cid:21) , vec ( A ) = abbc , D = , vech ( A ) = (cid:34) abc (cid:35) . (111)The Moore-Penrose pseudoinverse of D will be denoted D † and is given by D † = (cid:0) D T D (cid:1) − D T . (112)We can then use D † to convert the vectorization of a matrix into its half-vectorization:vech ( A ) = D † vec ( A ) (symmetric A ) . (113)For our × example we have D † =
12 12
00 0 0 1 . (114)22ariational Inference as Iterative Projection in a Bayesian Hilbert SpaceUseful identities involving D are then D † D ≡ † T D T ≡ DD † DD † vec ( A ) ≡ vec ( A ) (symmetric A ) DD † ( A ⊗ A ) D ≡ ( A ⊗ A ) D (any A ) , (115)which can be found in Magnus and Neudecker (1980). B Outer Products
The outer product
Φ :
P → P of two vectors b = b ( x ) , c = c ( x (cid:48) ) ∈ P , denoted Φ( x , x (cid:48) ) = b ( x ) (cid:105)(cid:104) c ( x (cid:48) ) or briefly Φ = b (cid:105)(cid:104) c , is defined by its operation on arbitrary d = d ( x (cid:48) ) ∈ P as Φ( x , x (cid:48) ) (cid:126) d ( x (cid:48) ) = b ( x ) (cid:105)(cid:104) c ( x (cid:48) ) (cid:126) d ( x (cid:48) ) = b ( x ) · (cid:104) c, d (cid:105) = (cid:104) c, d (cid:105) · b ( x ) . (116)Thus, dropping the functional dependence, (cid:104) a, Φ (cid:126) d (cid:105) = (cid:104) a, b (cid:105)(cid:104) c, d (cid:105) (117)for arbitrary a ∈ P . More generally, Φ = M (cid:77) i =1 N (cid:77) j =1 φ ij · b i (cid:105)(cid:104) c j , (118)where b i , c j ∈ P and φ ij ∈ R , so that Φ (cid:126) d = M (cid:77) i =1 N (cid:88) j =1 φ ij (cid:104) c j , d (cid:105) · b i (119)and (cid:104) a, Φ (cid:126) d (cid:105) = M (cid:88) i =1 N (cid:88) j =1 φ ij (cid:104) a, b i (cid:105)(cid:104) c j , d (cid:105) . (120)Defining the matrix Φ = [ φ ij ] ∈ R M × N and b ( x ) = b ( x ) b ( x ) ... b M ( x ) , c ( x ) = c ( x ) c ( x ) ... c N ( x ) , (121)we may abbreviate (118) to Φ( x , x (cid:48) ) = b ( x ) (cid:105) Φ (cid:104) c ( x (cid:48) ) (122)and hence (cid:104) a, Φ (cid:126) d (cid:105) = (cid:104) a, b (cid:105) Φ (cid:104) c , d (cid:105) , where (cid:104) a, b (cid:105) is interpreted as a row and (cid:104) c , d (cid:105) as a column.Given an orthonormal basis { b , b · · · b M } for a subspace S ⊂ P , Q = M (cid:77) m =1 b m (cid:105)(cid:104) b m ≡ b (cid:105)(cid:104) b (123)is the kernel of S and thus, for any s ∈ S , Q (cid:126) s = s (Manton and Amblard, 2015). For an nonorthonormal basis, Q = M (cid:77) m =1 M (cid:77) n =1 κ mn · b m (cid:105)(cid:104) b n ≡ b (cid:105)(cid:104) b , b (cid:105) − (cid:104) b , (124)where κ mn is the ( m, n ) entry in (cid:104) b , b (cid:105) − . 23ariational Inference as Iterative Projection in a Bayesian Hilbert Space C Proofs of Lemma 1 and Lemma 2
Proof of Lemma 1.
The n = 0 case, E ν [ ξf ] = E ν (cid:20) ∂f∂ξ (cid:21) , (125)is immediately true by Stein’s lemma (Stein, 1981). For general case n , E ν [ H n +1 f ] = E ν (cid:20)(cid:18) ξH n − ∂H n ∂ξ (cid:19) f (cid:21) = E ν (cid:20) ∂∂ξ ( H n f ) − ∂H n ∂ξ f (cid:21) = E ν (cid:20) ∂H n ∂ξ f + H n ∂f∂ξ − ∂H n ∂ξ f (cid:21) = E ν (cid:20) H n ∂f∂ξ (cid:21) , (126)where we have used the recurrence relation, H n +1 = ξH n − ∂H n ∂ξ , (127)for the Hermite polynomials. Proof of Lemma 2.
Repeatedly applying Lemma 1, E ν [ H n f ] = E ν (cid:20) H n − ∂f∂ξ (cid:21) = E ν (cid:20) H n − ∂ f∂ξ (cid:21) = · · · = E ν (cid:20) H ∂ n − f∂ξ n − (cid:21) = E ν (cid:20) H ∂ n f∂ξ n (cid:21) = E ν (cid:20) ∂ n f∂ξ n (cid:21) , (128)yields the desired result. D Hermite Basis on R N We can extend the results of §3.2 to create a Hermite basis for P on R N . Let η ( ξ ) = 1 √ n ! H ( ξ ) H ( ξ ) ... H M ( ξ ) . (129)Note that we have reintroduced H ( ξ ) because the basis will be created by all possible combinatorial N -products ofthese functions, one for each variable in ξ ∈ R N . However, we will have to exclude the combination made up of only H because once again this function gives the zero vector of P . We may express this operation as a Kronecker product,i.e., η ( ξ ) = C ( η ( ξ ) ⊗ η ( ξ ) ⊗ · · · ⊗ η ( ξ N )) , (130)where C = [ ] contains zero in the first column followed by the identity matrix; this removes the offending function.Observe that CC T = . The basis is then h ( ξ ) = ↓ exp( − η ( ξ )) . (131)The total number of basis functions is ( M + 1) N − .This set of basis functions retains its orthonormality because (cid:104) h ( ξ ) , h ( ξ ) (cid:105) = E ν (cid:104) ( C η ( ξ ) ⊗ · · · ⊗ η ( ξ N )) ( C η ( ξ ) ⊗ · · · ⊗ η ( ξ N )) T (cid:105) = C E ν (cid:2) η ( ξ ) η ( ξ ) T ⊗ · · · ⊗ η ( ξ N ) η ( ξ N ) T (cid:3) C T (132)by a property of the Kronecker product (Appendix A). Now E ν (cid:2) η ( ξ ) η ( ξ ) T ⊗ · · · ⊗ η ( ξ N ) η ( ξ N ) T (cid:3) = (cid:90) ∞−∞ η ( ξ ) η ( ξ ) T ν ( ξ ) dξ ⊗ · · · ⊗ (cid:90) ∞−∞ η ( ξ N ) η ( ξ N ) T ν ( ξ N ) dξ N = ( M +1) N × ( M +1) N , (133)wherein each of the integrals expresses the orthonormality of the Hermite functions and results in an ( M + 1) × ( M + 1) identity matrix. Hence (cid:104) h ( ξ ) , h ( ξ ) (cid:105) = C1 ( M +1) N × ( M +1) N C T = [( M +1) N − × [( M +1) N − . (134)To determine the coordinates of an arbitrary PDF p ∈ H , we shall require the multivariate version of Lemma 2:24ariational Inference as Iterative Projection in a Bayesian Hilbert Space Lemma 3.
For the standard normal measure, ν ∼ N ( , ) , E ν [ H n ( ξ ) H n ( ξ ) · · · H n N ( ξ N ) f ( ξ )] = E ν (cid:20) ∂ n + n + ··· + n N f ( ξ ) ∂ξ n ∂ξ n · · · ∂ξ n N N (cid:21) , n k = 1 , · · · M. (135) where f : R N → R is n k -fold differentiable in ξ k and is such that the expectations exist. The proof relies on the use of Lemma 2 for each individual partial derivative; for example, with respect to the variable ξ , E ν (cid:2) H n ( ξ ) H n ( ξ ) · · · H n N ( ξ N ) f ( ξ ) (cid:3) = E ν (cid:20) H n − ( ξ ) H n ( ξ ) · · · H n N ( ξ N ) ∂f ( ξ ) ∂ξ (cid:21) . (136)The product H n ( ξ ) · · · H n N ( ξ N ) has no dependence on ξ and can therefore be treated as a constant. Doing the samefor all the other variables and for the indicated number of times leads to the stated result.We can streamline the notation by defining ∂ ξ = (cid:20) ∂∂ξ ∂ ∂ξ · · · ∂ M ∂ξ M (cid:21) (137)and, as above, ∂ ξ = C ( ∂ ξ ⊗ ∂ ξ ⊗ · · · ⊗ ∂ ξ N ) . (138)Using the measure ν = N (0 , , then, α = (cid:104) h ( ξ ) , p ( ξ ) (cid:105) = E ν [ ∂ ξ p ( ξ )] (139)are the coordinates of p ( ξ ) ∈ H . E Coordinates of Multivariate Gaussian Projection
Let p ( x ) = ↓ exp( − φ ( x )) ∈ P . Projecting onto G , the coordinates associated with basis functions g are α = (cid:104) g , p (cid:105) = E ν [ γ ( x ) φ ( x )] − E ν [ γ ( x )] E ν [ φ ( x )]= E ν (cid:2) L − ( x − µ ) φ ( x ) (cid:3) − E ν (cid:2) L − ( x − µ ) (cid:3)(cid:124) (cid:123)(cid:122) (cid:125) E ν [ φ ( x )] (140) = L − Σ E ν (cid:20) ∂φ ( x ) ∂ x T (cid:21) = L T E ν (cid:20) ∂φ ( x ) ∂ x T (cid:21) , where we have employed Stein’s lemma (Stein, 1981) to go from the third line to the fourth. Taking the inner product ofthese coefficients with the associated basis functions we have α T γ ( x ) = E ν (cid:20) ∂φ ( x ) ∂ x T (cid:21) T LL − ( x − µ ) = E ν (cid:20) ∂φ ( x ) ∂ x T (cid:21) T ( x − µ ) . (141)The coordinates associated with basis functions g are α = (cid:104) g , p (cid:105) = E ν [ γ ( x ) φ ( x )] − E ν [ γ ( x )] E ν [ φ ( x )]= E ν (cid:20)(cid:113) D T D vech (cid:0) L − ( x − µ )( x − µ ) T L − T (cid:1) φ ( x ) (cid:21) − E ν (cid:20)(cid:113) D T D vech (cid:0) L − ( x − µ )( x − µ ) T L − T (cid:1)(cid:21) E ν [ φ ( x )] (142) = (cid:113) D T D vech (cid:0) L − (cid:0) E ν (cid:2) ( x − µ )( x − µ ) T φ ( x ) (cid:3) − Σ E ν [ φ ( x )] (cid:1) L − T (cid:1) = (cid:113) D T D vech (cid:18) L − Σ E ν (cid:20) ∂ φ ( x ) ∂ x T ∂ x (cid:21) ΣL − T (cid:19) = (cid:113) D T D vech (cid:18) L T E ν (cid:20) ∂ φ ( x ) ∂ x T ∂ x (cid:21) L (cid:19) , α T γ ( x ) = 12 vech (cid:18) L T E ν (cid:20) ∂ φ ( x ) ∂ x T ∂ x (cid:21) L (cid:19) T D T D vech (cid:0) L − ( x − µ )( x − µ ) T L − T (cid:1) = 12 vec (cid:18) L T E ν (cid:20) ∂ φ ( x ) ∂ x T ∂ x (cid:21) L (cid:19) T D † T D T D (cid:124) (cid:123)(cid:122) (cid:125) D D † vec (cid:0) L − ( x − µ )( x − µ ) T L − T (cid:1) = 12 vec (cid:18) L T E ν (cid:20) ∂ φ ( x ) ∂ x T ∂ x (cid:21) L (cid:19) T DD † vec (cid:0) L − ( x − µ )( x − µ ) T L − T (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) use (115) third line = 12 vec (cid:18) L T E ν (cid:20) ∂ φ ( x ) ∂ x T ∂ x (cid:21) L (cid:19) T vec (cid:0) L − ( x − µ )( x − µ ) T L − T (cid:1) (143) = 12 tr (cid:18) L T E ν (cid:20) ∂ φ ( x ) ∂ x T ∂ x (cid:21) LL − ( x − µ )( x − µ ) T L − T (cid:19) = 12 tr (cid:18) ( x − µ ) T L − T L T E ν (cid:20) ∂ φ ( x ) ∂ x T ∂ x (cid:21) LL − ( x − µ ) (cid:19) = 12 ( x − µ ) T E ν (cid:20) ∂ φ ( x ) ∂ x T ∂ x (cid:21) ( x − µ ) . Combining these we haveproj ( G ,ν ) p = ↓ exp (cid:18) − ( x − µ ) T E ν (cid:20) ∂φ ( x ) ∂ x T (cid:21) −
12 ( x − µ ) T E ν (cid:20) ∂ φ ( x ) ∂ x T ∂ x (cid:21) ( x − µ ) (cid:19) (144)for the projected PDF in terms of its Gaussian basis. F Gaussian Information
We calculate here the information I contained in a multivariate Gaussian distribution, g ( x ) = N ( µ (cid:48) , Σ (cid:48) ) ∈ G + . Wehave g ( x ) = ↓ exp ( − φ ( x )) (145)with φ ( x ) = 12 ( x − µ (cid:48) ) T Σ (cid:48)− ( x − µ (cid:48) ) . (146)The measure is taken as ν = N ( µ , Σ ) .Using our orthonormal basis for G , the information in g is I ( g ) = 12 (cid:107) g (cid:107) = 12 (cid:104) g, g (cid:105) = 12 ( α T α + α T α ) , (147)where α and α are the coordinates. As E ν (cid:20) ∂φ ( x ) ∂ x T (cid:21) = Σ (cid:48)− ( µ − µ (cid:48) ) , E ν (cid:20) ∂ φ ( x ) ∂ x T ∂ x (cid:21) = Σ (cid:48)− , (148)these coordinates are, by (50), α = L T Σ (cid:48)− ( µ − µ (cid:48) ) , α = (cid:113) D T D D † vec (cid:16) L T Σ (cid:48)− L (cid:17) . (149)Hence, from (147), I ( g ) = 12 (cid:18) ( µ − µ (cid:48) ) T Σ (cid:48)− ΣΣ (cid:48)− ( µ − µ (cid:48) ) + 12 tr Σ (cid:48)− ΣΣ (cid:48)− Σ (cid:19) , (150)26ariational Inference as Iterative Projection in a Bayesian Hilbert Spacewhere the second term is a result of the fourth identity in (109) and the third in (115). It will, however, be instructive torewrite the terms as ( µ − µ (cid:48) ) T Σ (cid:48)− ΣΣ (cid:48)− ( µ − µ (cid:48) ) = µ (cid:48) T Σ (cid:48)− ΣΣ (cid:48)− µ (cid:48) − µ (cid:48) T Σ (cid:48)− Σ (cid:0) µ T ⊗ (cid:1) vec Σ (cid:48)− + (cid:16) vec Σ (cid:48)− (cid:17) T ( µ ⊗ ) Σ (cid:0) µ T ⊗ (cid:1) vec Σ (cid:48)− (151)tr Σ (cid:48)− ΣΣ (cid:48)− Σ = (cid:16) vec Σ (cid:48)− (cid:17) T ( Σ ⊗ Σ ) vec Σ (cid:48)− , (152)with the help of the third and fourth identities in (109). Now (150) can be neatly expressed as I ( g ) = 12 (cid:20) Σ (cid:48)− µ (cid:48) vec Σ (cid:48)− (cid:21) T (cid:34) Σ − Σ (cid:0) µ T ⊗ (cid:1) − ( µ ⊗ ) Σ ( Σ ⊗ Σ ) + ( µ ⊗ ) Σ (cid:0) µ T ⊗ (cid:1)(cid:35) (cid:20) Σ (cid:48)− µ (cid:48) vec Σ (cid:48)− (cid:21) . (153)This is the information contained in the Gaussian N ( µ (cid:48) , Σ (cid:48) ) although it is conditioned by the choice of measure N ( µ , Σ ) used to the define the inner product. Note that as Σ (cid:48)− tends to zero, indicating a broadening of thedistribution, the information also goes to zero; that is, we lose information of what we can expect in drawing from thedistribution. The expression (153) can also be interpreted as simply writing the information using a different basisassociated with the so-called natural parameters of a Gaussian (Barfoot, 2020). G Derivation of Equation (73): Derivative of the Measure in the Inner Product
We consider the inner product (cid:104) p, q (cid:105) ν = E ν [ln p ln q ] − E ν [ln p ] E ν [ln q ] (154)with ν = (cid:77) m α m · b m . (155)We emphasize that here p and q are held fixed. The partial derivative with respect to α n is ∂∂α n (cid:104) p, q (cid:105) ν = ∂∂α n E ν [ln p ln q ] − (cid:18) ∂∂α n E ν [ln p ] (cid:19) E ν [ln q ] − E ν [ln p ] (cid:18) ∂∂α n E ν [ln q ] (cid:19) . (156)In general, ∂∂α n E ν [ln r ] = (cid:90) ∂ν∂α n ln rd x = (cid:90) ν ∂ ln ν∂α n ln rd x = E ν (cid:20) ∂ ln ν∂α n ln r (cid:21) . (157)(This quantity may in fact alternatively be written as (cid:104) b n , r (cid:105) ν .) The last two derivatives in (156) are accounted for; asfor the first, replacing ln r with ln p ln q above, gives ∂∂α n E ν [ln p ln q ] = E ν (cid:20) ∂ ln ν∂α n ln p ln q (cid:21) . (158)Thus ∂∂α n (cid:104) p, q (cid:105) ν = E ν (cid:20) ∂ ln ν∂α n ln p ln q (cid:21) − E ν (cid:20) ∂ ln ν∂α n ln p (cid:21) E ν [ln q ] − E ν [ln p ] E ν (cid:20) ∂ ln ν∂α n ln q (cid:21) . (159)Now we may rewrite this as ∂∂α n (cid:104) p, q (cid:105) ν = E ν (cid:104) ln p ln q ∂ ln ν/∂α n (cid:105) − E ν [ln p ] E ν (cid:104) ln q ∂ ln ν/∂α n (cid:105) − E ν (cid:20) ∂ ln ν∂α n ln p (cid:21) E ν [ln q ] . (160)We recognize that q ∂ ln ν/∂α n in not a PDF; however, the self-normalizing feature of the inner product allows us to write E ν (cid:104) ln p ln q ∂ ln ν/∂α n (cid:105) − E ν [ln p ] E ν (cid:104) ln q ∂ ln ν/∂α n (cid:105) = (cid:68) p, q ∂ ln ν/∂α n (cid:69) ν = (cid:28) p, ∂ ln ν∂α n · q (cid:29) ν . (161)For the last term in (160), we use (56) yielding E ν (cid:20) ∂ ln ν∂α n ln p (cid:21) E ν [ln q ] = ( E ν [ln b n ln p ] − E ν [ln b n ] E ν [ln p ]) E ν [ln q ] = E ν [ln q ] (cid:104) b n , p (cid:105) ν . (162)Finally then ∂∂α n (cid:104) p, q (cid:105) ν = (cid:28) p, ∂ ln ν∂α n · q (cid:29) ν − E ν [ln q ] (cid:104) b n , p (cid:105) ν . (163)27ariational Inference as Iterative Projection in a Bayesian Hilbert SpaceAs the inner product is symmetric in its arguments, this is also ∂∂α n (cid:104) p, q (cid:105) ν = ∂∂α n (cid:104) q, p (cid:105) ν = (cid:28) q, ∂ ln ν∂α n · p (cid:29) ν − E ν [ln p ] (cid:104) b n , q (cid:105) ν . (164)There is a caveat, however, in that we cannot transfer ∂ν/∂α n as the coefficient of p to that of q ; this is because thecoefficient is a function of the domain variables of the PDFs. That transformation, though, may be expressed as (cid:28) q, ∂ ln ν∂α n · p (cid:29) ν = (cid:28) p, ∂ ln ν∂α n · q (cid:29) ν + E ν [ln p ] (cid:104) b n , q (cid:105) ν − E ν [ln q ] (cid:104) b n , p (cid:105) ν . (165)We have used the shorthand (cid:104) p, q (cid:105) ∂ν/∂α nn