Factor graph fragmentization of expectation propagation
FFactor Graph Fragmentization of Expectation Propagation B Y W ILSON
Y. C
HEN AND M ATT
P. W
AND
University of Technology Sydney
Abstract
Expectation propagation is a general approach to fast approximate inference for graph-ical models. The existing literature treats models separately when it comes to derivingand coding expectation propagation inference algorithms. This comes at the cost of sim-ilar, long-winded algebraic steps being repeated and slowing down algorithmic develop-ment. We demonstrate how factor graph fragmentization can overcome this impediment.This involves adoption of the message passing on a factor graph approach to expectationpropagation and identification of factor graph sub-graphs, which we call fragments , thatare common to wide classes of models. Key fragments and their corresponding messagesare catalogued which means that their algebra does not need to be repeated. This allowscompartmentalization of coding and efficient software development.
Keywords:
Approximate Bayesian inference; Generalized linear mixed models; Graphicalmodels; Kullback-Leibler projection; Message passing.
Expectation propagation (e.g. Minka, 2005) is gaining popularity as a general approach tofitting and inference for large graphical models, including those that arise in statisticalcontexts such as Bayesian generalized linear mixed models (e.g. Gelman et al. , 2014; Kim& Wand, 2017). Compared with Markov chain Monte Carlo approaches, expectation prop-agation has the attractions of speed and parallelizability of the computing across multipleprocessors making it more amenable to high volume/velocity data applications. One priceto be paid is inferential accuracy since expectation propagation uses product density sim-plifications of joint posterior density functions. Another is algebraic overhead: as demon-strated by Kim & Wand (2016) several pages of algebra are required to derive explicitprogrammable expectation propagation algorithms for even very simple Bayesian mod-els. This article alleviates the latter cost. Using the notions of message passing and factorgraph fragments we demonstrate the compartmentalization of expectation propagation al-gebra and coding. The resultant infrastructure and updating formulae lead to much moreefficient expectation propagation fitting and inference and allows extension to arbitrarilylarge Bayesian models.Expectation propagation and mean field variational Bayes are the two most commonparadigms for obtaining fast approximate inference algorithms for graphical models (e.g.Bishop, 2006; Wainwright & Jordan, 2008; Murphy, 2012). Each is driven by minimumKullback-Leibler divergence considerations. As explained in Minka (2005), they can bothbe expressed as message passing algorithms on factor graphs. The alternative appella-tion variational message passing is used for mean field variational Bayes when such an ap-proach is used. The software platform
Infer.NET (Minka et al. , 2014) uses both expec-tation propagation and variational message passing to perform fast approximate infer-ence for graphical models. Recently Wand (2017) introduced factor graph fragmentizationto streamline variational message passing for semiparametric regression analysis. Semi-parametric regression (e.g. Ruppert et al. , 2009) is a big class of flexible regression models a r X i v : . [ s t a t . M E ] J a n hat includes generalized linear mixed models, generalized additive models and varying-coefficient models as special cases. Nolan & Wand (2017) and McLean & Wand (2018) builton Wand (2017) for more elaborate likelihood fragments.The crux of this article is to show how the factor graph fragment idea also can be usedto streamline expectation propagation. We focus on semiparametric regression models.However, the approach is quite general and applies to other graphical models for whichexpectation propagation is feasible. The fragment updating algorithms presented and de-rived here cover a wide range of semiparametric models and pave the way for futurederivations of the same type.Section 2 provides the background material needed for factor graph fragmentizationof expectation propagation. This includes exponential family and Kullback-Leibler pro-jection theory, as well as the notions of factor graphs and their fragment sub-graphs. Thearticle’s centerpiece is Section 3 in which several key fragments are identified and havemessage updates derived and catalogued. Such cataloguing implies that updates for aparticular fragment never have to be derived again and only need to be implementedonce in an expectation propagation software suite. An illustration involving generalizedadditive mixed model analysis of data from a longitudinal public health study is providedin Section 4. Section 5 contains some commentary of fragmentization of expectation prop-agation for more elaborate models. Factor graph fragmentization of expectation propagation relies on definitions and resultsconcerning both distribution theory and graph theory, not all of which are commonplacein the statistics literature. We provide the necessary background material in this section.
A random variable x has an exponential family distribution if its probability mass functionor density function admits the form p ( x ) = exp { T ( x ) T η − A ( η ) } h ( x ) , x ∈ R , η ∈ H. The vectors T ( x ) and η are called, respectively, the sufficient statistic and natural parameter .The set H is the space of allowable natural parameter values. The function A ( η ) is calledthe log-partition function and h ( x ) is the base measure . A key exponential family distribu-tional result is that E { T ( x ) } = ∇ A ( η ) (1)where ∇ A ( η ) is the column vector of partial derivatives of A ( η ) with respect to each ofthe components of η .Table 1 lists each of the exponential families distributions arising in this article, alongwith their defining functions and parameter spaces. The Normal and Inverse Chi-Squaredexponential families are well known. The Moon Rock exponential family is less estab-lished, and is given this name in McLean & Wand (2018). In Table 1 and elsewhere weuse the following indicator function notation: I ( P ) = 1 if the proposition P is true and I ( P ) = 0 if P is false.Note also that the Inverse Chi-Squared exponential family is equivalent to the InverseGamma exponential family. The two families differ in their common parametrizations asexplained in, for example, Section S.1.3 of the online supplement of Wand (2017). TheInverse Chi-Squared distribution has the advantage of being the special case of the InverseWishart distribution for × random matrices. Throughout this article we write X ∼ Inverse-Wishart ( κ, Λ ) T ( x ) A ( η ) h ( x ) H Normal (cid:20) xx (cid:21) − ( η /η ) 1 { ( η , η ) : − log( − η ) η ∈ R , η < } Inverse Chi-Squared (cid:20) log( x )1 /x (cid:21) ( η + 1) log( − η ) I ( x > { ( η , η ) :+ log Γ( − η − η < − , η < } Moon Rock { x log( x ) − log Γ( x ) } x log (cid:104) (cid:82) ∞ { t t / Γ( t ) } η I ( x > { ( η , η ) : × exp( η t ) dt (cid:105) η > , η + η < } Table 1:
Sufficient statistics, log-partition functions, base measures and natural parameter spacesof three exponential families. to denote a d × d random matrix X having density function p ( X ) = C − d,κ | Λ | κ/ | X | − ( κ + d +1) / exp {− tr ( Λ X − ) } I ( X symmetric and positive definite ) where κ > d − , Λ is a d × d symmetric positive definite matrix and C d,κ ≡ dκ/ π d ( d − / d (cid:89) j =1 Γ (cid:18) κ + 1 − j (cid:19) . (2)For the special case of d = 1 we write x ∼ Inverse- χ ( κ, λ ) . If p and p are two univariate density functions then the Kullback-Leibler divergence of p from p is KL ( p (cid:107) p ) ≡ (cid:90) ∞−∞ p ( x ) log { p ( x ) /p ( x ) } dx. If Q is a family of univariate density functions then the projection of the univariate densityfunction p onto Q is proj Q [ p ] ≡ argmin q ∈Q KL ( p (cid:107) q ) . (3)A core aspect of expectation propagation is projection of an arbitrary input density functiononto a particular exponential family. This corresponds to (3) with Q = (cid:8) q ( · ; η ) : q ( x ; η ) = exp { T ( x ) T η − A ( η ) } h ( x ) , η ∈ H (cid:9) . As explained in Section 2.3 of Kim & Wand (2016), the exponential family Kullback-Leiblerproblem η ∗ = argmin η ∈ H KL (cid:0) p (cid:107) q ( · ; η ) (cid:1)
3s equivalent to the sufficient statistic moment matching problem (cid:90) ∞−∞ T ( x ) exp { T ( x ) T η ∗ − A ( η ∗ ) } h ( x ) dx = (cid:90) ∞−∞ T ( x ) p ( x ) dx. (4)Because of (1) we can re-write (4) as ( ∇ A )( η ∗ ) = (cid:90) ∞−∞ T ( x ) p ( x ) dx. Then, assuming that the inverse of ∇ A is well-defined, η ∗ = ( ∇ A ) − (cid:18)(cid:90) ∞−∞ T ( x ) p ( x ) dx (cid:19) . (5)Hence, given the T moments, Kullback-Leibler projection of a density function p ontoan exponential family boils down to inversion of ∇ A . Section 3 of Wainwright & Jordan(2008) provides a detailed study of exponential families including properties of A and ∇ A .An exponential family distribution with the sufficient statistic T ( x ) being a d × vectoris said to be regular if H is an open set in R d and minimal if there is no d × vector a andconstant b ∈ R such that a T T ( x ) = b almost surely. Each of the exponential families inTable 1 are regular and minimal. Result 1 provides a summary of results from Section 3 ofWainwright & Jordan (2008) that is relevant to (5). It depends on: Definition 1.
Consider a function f : R → R d . Then the set of realizable expectations of f isthe set of points [ τ , . . . , τ d ] T ∈ R d such that there exists a univariate random variable x for which E { f ( x ) } = [ τ , . . . , τ d ] T . To illustrate the notion of the set of realizable expectations, consider the functions f : R → R and f : R → R given by f ( x ) = (cid:34) xx (cid:35) and f ( x ) = (cid:34) xx (cid:35) . The sets of all realizable expectations of f and f are, respectively M ≡ (cid:40)(cid:34) x x (cid:35) : x ≥ x (cid:41) and M ≡ (cid:40)(cid:34) x x (cid:35) : sign ( x ) x ≥ | x | (cid:41) . To show that M is the set of all realizable expectations of f note that M = M ∪ M where M ≡ { [ x x ] T : x = x } and M ≡ { [ x x ] T : x > x } . Then for any [ x x ] ∈ M we can take x to be the degenerate random variable with probability massfunction p ( x ) = I ( x = x ) . For such x , E { f ( x ) } = [ x x ] T = [ x x ] T ∈ M which showsthat all elements of M are realizable expectations of f . For any [ x x ] ∈ M taking x ∼ N ( x , x − x ) leads to E { f ( x ) } = [ x x ] T verifying that all elements of M arerealizable by E { f ( x ) } for some x . Hence, all entries of M are realizable by E { f ( x ) } forsome x .. Values [ x x ] T / ∈ M are not realizable because Jensen’s inequality implies that E ( x ) ≥ { E ( x ) } for any random variable x . Similar arguments can be used to establishthat M is the set of all realizable expectations of f . Figure 1 shows the sets M and M .We are now ready to give the pivotal: Result 1 (Wainwright & Jordan, 2008).
Consider a regular and minimal exponential familywith d -dimensional sufficient statistic T ( x ) and corresponding natural parameter vector η . Then (a) H is a strictly convex subset of R d . (b) A is a strictly convex and infinitely differentiable function on H . − − − x x − − − − − − x x Figure 1:
Left panel: The shaded region is M , the set of realizable expectations of f . Right panel:The shaded region is M , the set of realizable expectations of f . (c) ∇ A is a one-to-one function. (d) The image of ∇ A , which we denote by T , is the interior of the set of all realizable expectationsof T . Result 1 guarantees that ∇ A : H → T is a bijective map and that ( ∇ A ) − : T → H iswell-defined. The Normal distribution is the one of simplest exponential families since ∇ A and ( ∇ A ) − admit simple closed forms. Firstly, we have ∇ A ( η ) = (cid:34) − η / (2 η )( η − η ) / (4 η ) (cid:35) . It is straightforward to show that the image of H under ∇ A is T = { ( τ , τ ) : τ > τ } and the inverse of ∇ A is ( ∇ A ) − ( τ ) = (cid:20) τ / ( τ − τ ) − / { τ − τ ) } (cid:21) . For the Inverse Chi-Squared distribution we have ∇ A ( η ) = (cid:34) log( − η ) − digamma ( − η − η + 1) /η (cid:35) where digamma ( x ) ≡ ddx log Γ( x ) . Determination of the image of H under ∇ A is morechallenging for the Inverse Chi-Squared distribution. It is aided by Theorem 1 of Kim &5and (2016) which establishes that log − digamma is a bijective map between R + and R + .This leads to T = { ( τ , τ ) : τ > e − τ } . The inverse or ∇ A is ( ∇ A ) − ( τ ) = (cid:34) − (log − digamma ) − (cid:0) τ + log( τ ) (cid:1) − − (log − digamma ) − (cid:0) τ + log( τ ) (cid:1) /τ (cid:35) . Theorem 1 of Kim & Wand (2016) implies that (log − digamma ) − is well-defined. H −3 −2 −1 1 2 3−3−2−1 h h T −3 −2 −1 1 2 3123456789 t t H −5 −4 −3 −2 −1 −3−2−1 h h T −2 −1 1 2 3123456789 t t Figure 2:
Upper panel: Illustration of the bijective maps between H and T for the Normal expo-nential family. The crosses and dotted lines depict five example η ∈ H and τ = ∇ A ( η ) ∈ T pairs.Since ∇ A is a bijective map, the crosses and dotted lines equivalently depict five example τ ∈ T and η = ( ∇ A ) − ( τ ) ∈ H pairs. Lower panel: Similar illustration for the Inverse Chi-Squaredexponential family. Figure 2 depicts the ∇ A and ( ∇ A ) − bijective maps between H and T for both theNormal and Inverse Chi-Squared exponential family distributions. A factor graph is a graphical representation of the factor/argument dependencies of a mul-tivariate function. Even though the concept applies to functions in general, the relevant6unctions are joint density functions in the context of expectation propagation. As an illus-tration, consider the Bayesian linear model y | β , σ ∼ N ( Xβ , σ I ) , where y is an n × vector of responses, with the following prior distributions on theregression coefficients and error standard deviation: β ∼ N ( , σ β I ) and σ ∼ Half-Cauchy ( A ) , The second prior specification means that σ has prior density function p ( σ ) = 2 / [ Aπ { σ/A ) } ] for σ > . An equivalent representation of the model, involving the auxiliaryvariable a , is y | β , σ ∼ N ( Xβ , σ I ) , β ∼ N ( , σ β I ) ,σ | a ∼ Inverse- χ (1 , /a ) , a ∼ Inverse- χ (1 , /A ) . (6)We work with this auxiliary variable representation since it aids tractability of expectationpropagation. The joint density function of the random variables and random vectors in (6)admits the following factorized form: p ( y , β , σ , a ) = p ( β ) p ( y | β , σ ) p ( σ | a ) p ( a ) . Now let x Ti be the i th row of X for ≤ i ≤ n . Then a further breakdown of p ( y , β , σ , a ) is p ( y , β , σ , a ) = p ( β ) (cid:40) n (cid:89) i =1 (cid:90) ∞−∞ δ ( α i − x Ti β ) p ( y i | α i , σ ) dα i (cid:41) p ( σ | a ) p ( a ) (7)where δ is the Dirac delta function and p ( y i | α i , σ ) ≡ (2 πσ ) − / exp {− ( y i − α i ) / (2 σ ) } .Figure 3 is a factor graph representation of p ( y , β , σ , a ) according to the factors that ap-pear in (7). At this point we note that we are not using the conventional factor graphdefinition here since some of the factors appear inside the integrals in (7). Kim & Wand(2017) introduced the term derived variable factor graph to make this distinction. We willsimply call it a factor graph from now onwards. The circles are called stochastic nodes andthe rectangles are called factors . Both circles and rectangles are nodes of the factor graph.We say that a two nodes are neighbors of each other if they are joined by an edge. β σ a δ ( α − x β ) α p(y | α , σ ) δ ( α n − x nT β ) α n p(y n | α n , σ ) ●●● ●●● ●●● p( β ) p( σ |a) p(a) β σ a δ ( α − x β ) α p(y | α , σ ) δ ( α n − x nT β ) α n p(y n | α n , σ ) ●●● ●●● ●●● p( β ) p( σ |a) p(a) Figure 3:
Factor graph representation of (7).
Figure 4 is a representation of Figure 3 with factor graph fragments of the same typeidentified via color-coding and numbering of the factors. As defined in Wand (2017), afragment is a sub-graph of a factor graph consisting of a factor and each of its neighboringstochastic nodes.The five different colors in Figure 4 correspond to five different fragment types. Someof the fragment types, such as that corresponding to the p ( β ) factor, only appear once inthis factor graph. Other types, such as those corresponding to δ ( α i − x Ti β ) , ≤ i ≤ n ,7 σ a δ ( α − x β ) α p(y | α , σ ) δ ( α n − x nT β ) α n p(y n | α n , σ ) ●●● ●●● ●●● p( β ) p( σ |a) p(a) Figure 4:
Fragmentization of the Figure 3 factor graph. Different colors signify fragments of thesame type, and are included in Table 2. appear multiple times. Recognition of the recurrence of fragments of the same type inthis factor graph and factor graphs for other models is at the core of extension to arbitrar-ily large models. Wand (2017) demonstrated factor graph fragmentization of variationalmessage passing. Our goal here is to do the same for expectation propagation.
Recent summaries of expectation propagation are provided in Kim & Wand (2016, 2017).We briefly cover the main points here. The function neighbors ( · ) plays an important rolein the algebraic description of the message updates. Consider the illustrative genericform factor graph shown in Figure 5, corresponding to the joint density function of ran-dom vectors θ , . . . , θ according to a particular Bayesian model. Then neighbours (1) = { , , } since the factor f is connected by edges to each of θ , θ and θ . Similarly,neighbours (2) = { , , } , neighbours (3) = { } , neighbours (4) = { , } and neighbours (5) = { , } . For general factor graphs with the θ i and f j labeling, neighbours ( j ) is the set of in-dices of the θ i that are connected to f j by an edge. Based on (54) of Minka (2005), the θ θ θ θ θ f f f f f Figure 5:
An illustrative generic form factor graph. stochastic node to factor messages are updated according to m θ i → f j ( θ i ) ←− (cid:89) j (cid:48) (cid:54) = j : i ∈ neighbours ( j (cid:48) ) m f j (cid:48) → θ i ( θ i ) (8)and, based on (83) of Minka (2005), the factor to stochastic node messages updates are m f j → θ i ( θ i ) ←− proj (cid:34) m θ i → f j ( θ i ) (cid:90) f j ( θ neighbours ( j ) ) (cid:89) i (cid:48) ∈ neighbours ( j ) \{ i } m θ i (cid:48) → f j ( θ i (cid:48) ) d θ neighbours ( j ) \{ i } /Z (cid:35) m θ i → f j ( θ i ) , (9)where Z is the normalizing factor that ensures that the function of θ i inside the proj [ · ] is a density function. The normalizing factor in (9) involves summation if some of the8 i (cid:48) have discrete components. The proj [ · ] in (9) denotes Kullback-Leibler projection ontoan appropriate exponential family of density functions. However, in Kim & Wand (2016)illustration was done only via a simple example in which all of the stochastic nodes wereunivariate. In the case of linear models, in which vector parameters are present, someadjustments are necessary to avoid intractable multivariate integrals. The first one is anintrinsically important convention and is now spelt out: Convention 1.
Derived variable factor graphs are treated as ordinary factor graphs whenit comes to applying the message passing expressions (8) and (9).
In practice, iteration involving (8) and (9) may require some tweaking to achieve con-vergence. Minka (2005) recommends the damping adjusment m f j → θ i ( θ i ) ←− m f j → θ i ( θ i ) ε × { right-hand side of (9) } − ε . (10)for some ≤ ε < . Kim and Wand (2017) noted that setting ε to a small positive numbersuch as ε = 0 . aided convergence for their expectation propagation algorithms for fittinglinear models. Therefore, we build this adjustment into the fragment updates in the nextsection.The full expectation propagation iterative algorithm is:Initialize all factor to stochastic node messages.Cycle until all factor to stochastic node messages converge:For each factor:Compute the messages passed to the factor using (8).Compute the messages passed from the factor using (9) and (10).Upon convergence the expectation propagation-approximate posterior density function of θ i is obtained from q ∗ ( θ i ) ∝ (cid:89) j : i ∈ neighbours ( j ) m f j → θ i ( θ i ) . Each of the generalized, linear and mixed models dealt with in Kim & Wand (2017) canbe handled with nine distinct fragment types, which are listed in Table 2. The messageupdates for each fragment type only needs to be derived once. Each subsection deals withthe required derivation and summarizes the updates as an algorithm. For a software suitethat uses expectation propagation to fit generalized, linear and mixed models the fragmentonly needs to be implemented once. We now work through each of the Table 2 fragmentsin turn.The algorithms use the matrix functions vec and its inverse vec − which we definehere. If A is d × d matrix then vec ( A ) is the d × vector obtained by stacking the columnsof A underneath each other in order from left to right. if a is a d × vector then vec − ( a ) is the d × d matrix formed from listing the entries of a in a column-wise fashion in orderfrom left to right and is the usual function inverse when the domain of vec is restricted tosquare matrices.The following shorthand is used throughout this section: a ε ←− b denotes a ←− ε a + (1 − ε ) b. ragment name Diagram Distributional statement1. Gaussian prior fragment name diagram distributional statementGaussian prior p( θ ) ● θ θ ~ N( µ θ , Σ θ )Inverse Wishart prior p( Θ ) ● Θ Θ ~ Inverse − Wishart( κ Θ , Λ Θ )Iterated Inverse Wishart p( Θ | Θ ) ● Θ ● Θ Θ | Θ ~ Inverse − Wishart( κ ,2 Θ − )Gaussian penalization p( θ , … , θ L | Θ , … , Θ L ) ● ( θ , … , θ L ) ● Θ ● Θ L ● ● ● θ ~ N( µ θ , Σ θ ), θ ell | Θ l ~ N(0, I m l ⊗Θ l ), 1 ≤ l ≤ L, θ , … , θ L independentGaussian likelihood p( y | θ , Θ ) ● θ ● Θ y | θ , Θ ~ N( A θ , I m ⊗Θ ) θ ∼ N ( µ θ , Σ θ )
2. Inverse Wishart Θ p( Θ ) Θ ∼ Inverse-Wishart ( κ Θ , Λ Θ ) prior3. Iterated Inverse Chi-Squared fragment name diagram distributional statementGaussian prior p( θ ) ● θ θ ~ N( µ θ , Σ θ )Inverse Wishart prior p( Θ ) ● Θ Θ ~ Inverse − Wishart( κ Θ , Λ Θ )Iterated Inverse Wishart p( Θ | Θ ) ● Θ ● Θ Θ | Θ ~ Inverse − Wishart( κ ,2 Θ − )Gaussian penalization p( θ , … , θ L | Θ , … , Θ L ) ● ( θ , … , θ L ) ● Θ ● Θ L ● ● ● θ ~ N( µ θ , Σ θ ), θ ell | Θ l ~ N(0, I m l ⊗Θ l ), 1 ≤ l ≤ L, θ , … , θ L independentGaussian likelihood p( y | θ , θ ) ● θ ● θ y | θ , Θ ~ N( A θ , I m ⊗Θ ) σ | a ∼ Inverse- χ ( ν, /a )
4. Linear combination θ δ ( α − a T θ ) α α ≡ a T θ derived variable5. Multivariate linear combi- θ δ ( α − A T θ ) α α ≡ A T θ nation derived variable6. Gaussian α p(y| α , σ ) σ y | α, σ ∼ N ( α, σ )
7. Logistic likelihood α p(y| α ) y | α ∼ Bernoulli ( logit − ( α ))
8. Probit likelihood α p(y| α ) y | α ∼ Bernoulli (Φ( α ))
9. Poisson likelihood α p(y| α ) y | α ∼ Poisson (exp( α )) Table 2:
Fundamental factor graph fragments for expectation propagation fitting of generalized,linear and mixed models.
The Gaussian prior fragment arises from the following prior distribution specification: θ ∼ N ( µ θ , Σ θ ) for user-specified hyperparameters µ θ and Σ θ . The fragment factor is p ( θ ) = (2 π ) − d θ / | Σ θ | − / exp (cid:8) − ( θ − µ θ ) T Σ − θ ( θ − µ θ ) (cid:9) . We assume that all messages passed to θ from factors outside of thefragment are in the Multivariate Normal family. (11)The message from p ( θ ) to θ takes the form m p ( θ ) → θ ( θ ) = exp (cid:40)(cid:20) θ vec ( θθ T ) (cid:21) T η p ( θ ) → θ (cid:41) . Algorithm 1 provides the natural parameter update for this simple fragment. The deriva-tion of Algorithm 1 is given in Section S.2.1 of the online supplement.10 lgorithm 1
The input, update and output of the Gaussian prior fragment.
Hyperparameter Inputs: µ θ , Σ θ . Update: η p ( θ ) → θ ←− Σ − θ µ θ − vec ( Σ − θ ) Parameter Output: η p ( θ ) → θ . Let Θ be a d Θ × d Θ symmetric positive definite random matrix. The prior specification Θ ∼ Inverse-Wishart (cid:0) κ Θ , Λ Θ (cid:1) leads to a factor graph fragment with factor p ( Θ ) = C − d Θ ,κ Θ | Λ Θ | κ Θ / | Θ | − ( κ Θ + d Θ +1) / exp {− tr ( Λ Θ Θ − ) } I ( Θ symmetric and positive definite ) . where C d Θ ,κ Θ is defined via (2). The message from p ( Θ ) to Θ takes the form m p ( Θ ) → Θ ( Θ ) = exp (cid:34) log | Θ | vec ( Θ − ) (cid:35) T η p ( Θ ) → Θ . Algorithm 2 gives the η p ( Θ ) → Θ update based on hyperparameter inputs κ Θ and Λ Θ . Algorithm 2
The input, update and output of the Inverse Wishart prior fragment.
Hyperparameter Inputs: κ Θ , Λ Θ . Update: η p ( Θ ) → Θ ←− − ( κ Θ + d Θ + 1) − vec ( Λ Θ ) Parameter Output: η p ( Θ ) → Θ .A derivation of Algorithm 2 is given in Section S.2.2 of the online supplement. This fragment arises from the following distributional fact (e.g. Wand et al. (2011), Result5): σ | a ∼ Inverse- χ ( ν, ν/a ) and a ∼ Inverse- χ (1 , /A ) implies σ ∼ Half-t ( A, ν ) (12)where x ∼ Half-t ( A, ν ) if and only if p ( x ) = 2Γ (cid:0) ν +12 (cid:1) I ( x > √ πν Γ( ν/ A { x/A ) /ν } ( ν +1) / . t family can be im-posed on standard deviation parameters using messages within the Inverse Chi-Squaredfamily.The fragment factor is p ( σ | a ) = { ν/ (2 a ) } ν/ Γ( ν/
2) ( σ ) − ( ν/ − exp {− ν/ (2 aσ ) } I ( σ > I ( a > and it is assumed that:all messages passed to σ from factors outside ofthe fragment are in the Inverse Chi-Squared family andall messages passed to a from factors outside ofthe fragment are also in the Inverse Chi-Squared family. (13)The messages from the factor to its neighboring stochastic nodes are m p ( σ | a ) → σ ( σ ) = exp (cid:34) log( σ )1 /σ (cid:35) T η p ( σ | a ) → σ and m p ( σ | a ) → a ( a ) = exp (cid:34) log( a )1 /a (cid:35) T η p ( σ | a ) → a . Algorithm 3 provides the updates of the natural parameters of these messages given mes-sages from outside the fragment. The function G IG3 is defined in Section S.1.3.
Algorithm 3
The inputs, updates and outputs of the iterated Inverse Chi-Squared fragment.
Data Input: ν > , ≤ ε < . Parameter Inputs: η p ( σ | a ) → σ , η p ( σ | a ) → a , η σ → p ( σ | a ) , η a → p ( σ | a ) . Updates: η p ( σ | a ) → σ ε ←− G IG3 (cid:16) η σ → p ( σ | a ) , η a → p ( σ | a ) ; ν + 2 , ν (cid:17) η p ( σ | a ) → a ε ←− G IG3 (cid:16) η a → p ( σ | a ) , η σ → p ( σ | a ) ; ν, ν (cid:17) Parameter Outputs: η p ( σ | a ) → σ , η p ( σ | a ) → a . The linear combination derived variable fragment corresponds to equating a scalar vari-able α with a linear combination a T θ . If g is a general function that depends on the linearcombination form a T θ and other variables, denoted by o , then the derived variable α arises from the equality: g ( a T θ ; o ) = (cid:90) ∞−∞ δ ( α − a T θ ) g ( α ; o ) dα. (14)12here δ is the Dirac delta function. Under Convention 1 given in Section 2.4, the integralsign is ignored when it comes to applying the expectation propagation updates (8) and (9).We assume that: all messages passed to α from factors outside ofthe fragment are in the Univariate Normal familyand all messages passed to θ from factors outsideof the fragment are in the Multivariate Normal family. (15)The function δ ( α − a T θ ) is the factor for this fragment. According to conjugacy restrictions,messages passed from δ ( α − a T θ ) to α and θ take the forms m δ ( α − a T θ ) → α ( α ) = exp (cid:40)(cid:20) αα (cid:21) T η δ ( α − a T θ ) → α (cid:41) (16)and m δ ( α − a T θ ) → θ ( θ ) = exp (cid:40)(cid:20) θ vec ( θθ T ) (cid:21) T η δ ( α − a T θ ) → θ (cid:41) . Algorithm 4 provides the updates to the natural parameter vectors η δ ( α − a T θ ) → α and η δ ( α − a T θ ) → θ given inputs η α → δ ( α − a T θ ) and η θ → δ ( α − a T θ ) . It uses the notation ( η θ → δ ( α − a T θ ) (cid:17) ≡ vector containing the first d θ entries of η θ → δ ( α − a T θ ) and ( η θ → δ ( α − a T θ ) (cid:17) ≡ vector containing the remaining ( d θ ) entries of η θ → δ ( α − a T θ ) where d θ is the number of entries in θ . The derivations of these updates are given inSection S.2.5 of the online supplement. Now consider the following bivariate extension of (14): g (cid:0) a T θ , a T θ ; o (cid:1) = (cid:90) ∞−∞ (cid:90) ∞−∞ δ ( α − a T θ ) δ ( α − a T θ ) g ( α , α ; o ) dα dα , (17)where the primary argument of the function g is now bivariate. The established result forthe Dirac delta function applied to bivariate arguments leads to the equivalent form forthe right-hand side of (17) taking the form: (cid:90) ∞−∞ (cid:90) ∞−∞ δ (cid:18)(cid:20) α α (cid:21) − A T θ (cid:19) g ( α , α ; o ) dα dα where A ≡ [ a a ] . It follows that α ≡ (cid:20) α α (cid:21) is a bivariate derived variable corresponding to the multivariate linear combination A T θ .13 lgorithm 4 The inputs, updates and outputs of the linear combination derived variable fragment.
Data Input: a (vector having the same dimension as θ ), ≤ ε < . Parameter Inputs: η δ ( α − a T θ ) → α , η δ ( α − a T θ ) → θ , η α → δ ( α − a T θ ) , η θ → δ ( α − a T θ ) . Updates: ω ←− (cid:110) vec − (cid:16)(cid:16) η θ → δ ( α − a T θ ) (cid:17) (cid:17)(cid:111) − a η δ ( α − a T θ ) → α ε ←− ω T a ω T (cid:16) η θ → δ ( α − a T θ ) (cid:17) η δ ( α − a T θ ) → θ ε ←− a (cid:16) η α → δ ( α − a T θ ) (cid:17) vec ( aa T ) (cid:16) η α → δ ( α − a T θ ) (cid:17) Parameter Outputs: η δ ( α − a T θ ) → α , η δ ( α − a T θ ) → θ .In the most general case, θ and α are, respectively, d θ × and d α × vectors and A is a d θ × d α matrix. The fragment factor is δ ( α − A T θ ) and the message given in (16)generalizes to η δ ( α − A T θ ) → α ( α ) = exp (cid:40)(cid:20) α vec ( αα T ) (cid:21) T η δ ( α − A T θ ) → α (cid:41) . Algorithm 5 lists the natural parameter updates. Their derivations are given in SectionS.2.5 of the online supplement.Note that Algorithm 5 is a generalization of Algorithm 4. Therefore, from a strict math-ematical standpoint, Algorithm 4. However, since ordinary linear combinations are com-mon in expectation propagation fitting of linear models we feel that it is worth having aseparate fragment and algorithm for this special case.
The Gaussian fragment corresponds to the specification y | α, σ ∼ N ( α, σ ) . The fragment’s factor is p ( y | α, σ ) = (2 πσ ) − / exp {− ( y − α ) / (2 σ ) } which, as a function of α , is in the Normal family and, as a function of σ , is in the In-verse Chi-Squared family. Exponential family constraint considerations then lead to thefollowing assumption for the Gaussian fragment:all messages passed to α from factors outside ofthe fragment are in the Univariate Normal familyand all messages passed to σ from factors outsideof the fragment are in the Inverse Chi-Squared family. (18)14 lgorithm 5 The inputs, updates and outputs of the multivariate linear combination derived vari-able fragment.
Data Input: A (matrix with number of columns matching the dimension of θ ), ≤ ε < . Parameter Inputs: η δ ( α − A T θ ) → α , η δ ( α − A T θ ) → θ , η α → δ ( α − A T θ ) , η θ → δ ( α − A T θ ) . Updates: Ω ←− (cid:110) vec − (cid:16)(cid:16) η θ → δ ( α − A T θ ) (cid:17) (cid:17)(cid:111) − A η δ ( α − A T θ ) → α ε ←− ( Ω T A ) − Ω T (cid:16) η θ → δ ( α − A T θ ) (cid:17) vec (cid:0) ( Ω T A ) − (cid:1) η δ ( α − A T θ ) → θ ε ←− A (cid:16) η α → δ ( α − A T θ ) (cid:17) ( A ⊗ A ) (cid:16) η α → δ ( α − A T θ ) (cid:17) Parameter Outputs: η δ ( α − A T θ ) → α , η δ ( α − A T θ ) → θ .The messages from p ( y | α, σ ) take the forms m p ( y | α, σ ) → α ( α ) = exp (cid:34) αα (cid:35) T η p ( y | α, σ ) → α and m p ( by | α, σ ) → σ ( σ ) = exp (cid:34) log( σ )1 /σ (cid:35) T η p ( y | α, σ ) → σ with natural parameters updated according to Algorithm 6. The functions G N and G IG3 aredefined in Section S.1.3. Algorithm 6’s derivation is given in Section S.2.6.
The logistic likelihood fragment corresponds to the specification y | α ∼ Bernoulli (cid:8) logit − ( α ) (cid:9) . The factor of the fragment is p ( y | α ) = exp { yα − log(1 + e α ) } . We assume that: all messages passed to α from other factors arewithin the Univariate Normal exponential family. (19)Conjugacy then dictates that m p ( y | α ) → α ( α ) = exp (cid:40)(cid:20) αα (cid:21) T η p ( y | α ) → α (cid:41) . (20)15 lgorithm 6 The inputs, updates and outputs of the Gaussian fragment.
Data Input: y ∈ R , ≤ ε < . Parameter Inputs: η p ( y | α, σ ) → α , η p ( y | α, σ ) → σ , η α → p ( y | α, σ ) , η σ → p ( y | α, σ ) . Update: η p ( y | α, σ ) → α ε ←− G N η α → p ( y | α, σ ) , η σ → p ( y | α, σ ) ; yy η p ( y | α, σ ) → σ ε ←− G IG1 η σ → p ( y | α, σ ) , η α → p ( y | α, σ ) ; yy Parameter Outputs: η p ( y | α, σ ) → α , η p ( y | α, σ ) → σ .Algorithm 7 provides the update to the natural parameter vector η p ( y | α ) → α based on input η α → p ( y | α ) and depends on the function H logistic defined at (S.3) in the online supplement.Its derivation is given in Section S.2.7 of the online supplement. Algorithm 7
The inputs, updates and outputs of the logistic likelihood fragment.
Data Input: y ∈ { , } , ≤ ε < . Parameter Inputs: η p ( y | α ) → α , η α → p ( y | α ) . Update: η p ( y | α ) → α ε ←− H logistic ( η α → p ( y | α ) ; y ) Parameter Output: η p ( y | α ) → α . The probit likelihood fragment corresponds to the specification y | α ∼ Bernoulli (cid:0) Φ( α ) (cid:1) . The factor of the fragment is p ( y | α ) = exp (cid:2) y log { Φ( α ) } + (1 − y ) log { − Φ( α ) } (cid:3) . As for the logistic likelihood fragment, we also assume (19) which implies that m p ( y | α ) → α ( α ) also takes the form (20). The fragment update is given in Algorithm 8, with justificationdeferred to Section S.2.8 of the online supplement. The function H probit is defined in SectionS.1.3 of the online supplement. Note that H probit has the advantage of admitting a closedform expression. This is not the case for H logistic and numerical integration is required forits evaluation. 16 lgorithm 8 The inputs, updates and outputs of the probit likelihood fragment.
Data Input: y ∈ { , } , ≤ ε < . Parameter Inputs: η p ( y | α ) → α , η α → p ( y | α ) . Update: η p ( y | α ) → α ε ←− H probit ( η α → p ( y | α ) ; y ) Parameter Outputs: η p ( y | α ) → α . The Poisson likelihood fragment matches y | α ∼ Poisson (cid:8) exp( α ) (cid:9) and the factor of the fragment is p ( y | α ) = exp { yα − e α − log( y !) } . As for the logistic and Poisson likelihood fragments, we also assume (19) which impliesthat m p ( y | α ) → α ( α ) also takes the form (20). The fragment update is given in Algorithm 9with the H Poisson function defined at (S.3)Section S.2.9 of the online supplement contains justification of Algorithm 9.
Algorithm 9
The inputs, updates and outputs of the Poisson likelihood fragment.
Data Input: y ∈ { , , , . . . } , ≤ ε < . Parameter Inputs: η p ( y | α ) → α , η α → p ( y | α ) . Update: η p ( y | α ) → α ε ←− H Poisson ( η α → p ( y | α ) ; y ) Parameter Outputs: η p ( y | α ) → α . We now provide illustration via a generalized additive mixed model analysis. The data arefrom the Indonesian Children’s Health Study (Sommer, 1982), corresponding to a cohortof
Indonesian children who are repeatedly examined. The response variable is y ij = (cid:26) , if respiratory infection present in the i th child at the j th examination, , otherwise (21)for ≤ i ≤ m and ≤ j ≤ n i . For these data note that m = 275 and the n i ∈ { , . . . , } .Potential predictor variables are age, indicator of vitamin A deficiency, indicator of beingfemale, height, indicator of being stunted and indicators for the number of clinic visits17or each child. We let a ij denote the age in years of the i th child at the j th examination.Consider the following Bayesian generalized additive mixed model: y ij | β , β x , β spl , u grp , u spl ind. ∼ Bernoulli (cid:16) logit − (cid:0) β + u grp ,i + β T x x ij + f ( a ij ) (cid:1)(cid:17) ,f ( a ij ) ≡ β spl a ij + K (cid:88) k =1 u spl ,k z k ( a ij ) is a low-rank smoothing spline in a ij ,where { z k ( · ) : 1 ≤ k ≤ K } is a suitable spline basis , β ≡ β β x β spl ∼ N ( µ β , Σ β ) , u ≡ (cid:20) u grp u spl (cid:21) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) σ grp , σ spl ∼ N (cid:18)(cid:20) (cid:21) , (cid:20) σ grp I m σ spl I K (cid:21)(cid:19) ,σ grp | A grp ∼ Inverse- χ (1 , /A grp ) , σ spl | a spl ∼ Inverse- χ (1 , /a spl ) ,A grp ∼ Inverse- χ (1 , /A grp ) , a spl ∼ Inverse- χ (1 , /A spl ) . (22)The ‘grp’ and ‘spl’ subscripting indicates whether the random effect vector and corre-sponding variance parameter is for the random subject intercept or for spline coefficientsin the non-linear function of age. Let y denote the N × vector containing the y ij , where N ≡ (cid:80) mi =1 n i . Despite the common use of double subscript notation as in (21), it is moreconvenient to label the entries of y with a single subscript when it comes to fitting viaexpectation propagation. To avoid a notational clash we use y s(cid:96) , ≤ (cid:96) ≤ N , to denote the (cid:96) th entry of y . Let d β be the number of rows in β . For the Indonesian Children’s HealthStudy Data application d β = 11 . Then let X by the N × d β matrix containing the predictordata. The random effects design matrix is Z = [ Z grp Z spl ] where Z grp ≡ blockdiag ≤ i ≤ m ( n i ) and Z spl ≡ (cid:104) z k ( a ij ) ≤ k ≤ K (cid:105) ≤ j ≤ n i , ≤ i ≤ m . Then the likelihood can be written as y s(cid:96) | β , u ind. ∼ Bernoulli (cid:16) logit − (cid:0) ( Xβ + Zu ) (cid:96) (cid:1)(cid:17) , ≤ (cid:96) ≤ N. Next, let C ≡ [ X Z ] so that Xβ + Zu = C (cid:20) βu (cid:21) and let c T(cid:96) be the (cid:96) th row of C . Let e r be the ( m + K ) × vector with r th entry equal to and zeroes elsewhere for ≤ r ≤ m + K . Lastly, let E d β be the ( d β + m + K ) × d β matrixwith the d β × d β identity matrix at the top and all other entries equal to zero. The joint18ensity function of all random variables in the model is p ( y , β , u , σ grp , σ spl , A grp , a spl )= p ( y | β , u ) p ( β ) p ( u | σ grp , σ spl ) p ( σ grp | A grp ) p ( σ spl | a spl ) p ( A grp ) p ( a spl )= p ( β ) (cid:40) N (cid:89) (cid:96) =1 p ( y s(cid:96) | β , u ) (cid:41) (cid:40) m (cid:89) i =1 p ( u grp ,i | σ grp ) (cid:41) (cid:40) K (cid:89) k =1 p ( u spl ,k | σ spl ) (cid:41) p ( σ grp | A grp ) × p ( σ spl | a spl ) p ( A grp ) p ( a spl )= (cid:26)(cid:90) R dβ p ( (cid:101) β ) δ (cid:18)(cid:101) β − E Td β (cid:20) βu (cid:21)(cid:19) d (cid:101) β (cid:27) (cid:40) N (cid:89) (cid:96) =1 (cid:90) ∞−∞ p ( y s(cid:96) | α (cid:96) ) δ (cid:18) α (cid:96) − c T(cid:96) (cid:20) βu (cid:21)(cid:19) dα (cid:96) (cid:41) × (cid:40) m (cid:89) i =1 (cid:90) ∞−∞ p ( (cid:101) u grp ,i | σ grp ) δ (cid:18)(cid:101) u grp ,i − e Td β + i (cid:20) βu (cid:21)(cid:19) d (cid:101) u grp ,i (cid:41) p ( σ grp | A grp ) p ( A grp ) × (cid:40) K (cid:89) k =1 (cid:90) ∞−∞ p ( (cid:101) u spl ,k | σ grp ) δ (cid:18)(cid:101) u spl ,k − e Td β + m + k (cid:20) βu (cid:21)(cid:19) d (cid:101) u spl ,k (cid:41) p ( σ spl | a spl ) p ( a spl ) . (23)Figure 6 is the derived variable factor graph corresponding to the representation of thejoint density function given in (23). All of the fragments in Figure 6 are versions of funda-mental fragments listed in Table 2 and are color-coded and numbered accordingly. Expec-tation propagation inference for this model and data involves iteratively passing messagesbetween neighboring nodes on the Figure 6 factor graph. The parameter updates for thefactor to stochastic node messages are given by the relevant algorithms in Section 3. Thestochastic node to factor message parameter updates are a simple consequence of (8). p(a sbj ) ● a sbj p( σ sbj2 |a sbj ) ● σ sbj2 p(u~ sbj,1 | σ sbj2 ) p(u~ sbj,m | σ sbj2 ) ● u~ sbj,1 ● u~ sbj,m δ u~ sbj,1 − e d β + β u δ u~ sbj,m − e d β + mT β u δ u~ spl,1 − e d β + m + β u ● u~ spl,1 p(u~ spl,1 | σ spl2 ) δβ ~ − E d β T β u ● β u ● σ spl2 δ u~ spl,K − e d β + m + KT β u ● u~ spl,K p(u~ spl,K | σ spl2 ) ● β ~ δα − c β u δα N − c NT β u p( β ~) ● α ● α N p(y | α ) p(y N | α N ) p(a spl ) p( σ spl2 |a spl ) ● a spl ●●● ●●● ●●● ●●●●●●●●●● ● ●● ● ●● ● ● Figure 6:
Derived variable factor graph corresponding to the representation of the joint densityfunction of random variables in the generalized additive mixed model (22) given by (23).
We fit (22) using 1,000 iterations of expectation propagation message passing on thefactor graph of Figure 6. We also conducted Markov chain Monte Carlo fitting via the func-19 . . . . . vitamin A deficiency app r o x . po s t e r i o r accuracy EPMCMC . . . . male app r o x . po s t e r i o r accuracy −0.10 −0.05 0.00 0.05 height app r o x . po s t e r i o r accuracy −1.0 0.0 1.0 2.0 . . . . . stunted app r o x . po s t e r i o r accuracy −2.5 −1.5 −0.5 . . . . . . app r o x . po s t e r i o r accuracy −2.0 −1.0 0.0 0.5 . . . . . . app r o x . po s t e r i o r accuracy −3.0 −2.0 −1.0 0.0 . . . . . app r o x . po s t e r i o r accuracy −0.5 0.0 0.5 1.0 1.5 . . . . app r o x . po s t e r i o r accuracy −1.0 −0.5 0.0 0.5 1.0 . . . app r o x . po s t e r i o r accuracy . . . . s grp2 app r o x . po s t e r i o r accuracy . . . . s spl2 app r o x . po s t e r i o r accuracy . . . non−linear age effect e s t. p r obab . o f r e s p i r . i n f e c . age in years Figure 7:
Comparison of two approximate Bayesian inference methods, expectation propagationand Markov chain Monte Carlo, for model (22) applied to the Indonesian Children’s Health StudyData. The first three rows compares approximate posterior density functions for the fixed effectsparameters. The heading at the top of the panel is the corresponding predictor. The first two panelsin the fourth row compares approximate posterior density functions for the two variance parameters.The accuracy percentages correspond to the definition at (24). The bottom right panel compares thelow-rank smoothing spline fits for the non-linear age effect on the probability of respiratory infectionwith all other predictors set to their averages. In this panel, the dashed curves indicate pointwise95% credible intervals and the tick marks show the age data. stan() in the R package rstan (Guo, Gabry & Goodrich, 2017), which interfaces the Stan language (Carpenter et al. , 2017), with a warmup size of 50,000 and a retained samplesize of 1,000,000. The hyperparameters were set to µ β = , Σ β = 10 I , σ grp = σ spl = 10 with continuous variables standardized for the analyses and then results transformed tocorrespond to the original units. Figure 7 compares the Bayesian inference arising from thetwo approaches. The first three rows compare the expectation and Markov chain MonteCarlo approximate posterior density functions for the fixed effects parameters. The lastrow contains similar comparisons for the variance parameters and the low-rank smooth-ing spline fits for the non-linear age effect. The estimated probability functions are suchthat all other predictors are set at their average values, and are accompanied by pointwise95% credible intervals.The posterior density function comparisons are accompanied by accuracy percentages.For a generic parameter θ , the accuracy of the approximation q ( θ ) to the posterior densityfunction p ( θ | y ) is given byaccuracy ≡ (cid:26) − (cid:90) ∞−∞ (cid:12)(cid:12) q ( θ ) − p ( θ | y ) (cid:12)(cid:12) dθ (cid:27) % . (24)The Markov chain Monte Carlo-based posterior density functions, as well as the accuracypercentages on which they depend, are binned kernel density estimates obtained usingthe R function bkde() in the package KernSmooth (Wand & Ripley, 2015) with directplug-in bandwidth selection via the function dpik() . The density estimates should bevery close to the actual posterior density functions since they are based on one millionposterior draws.We see from Figure 7 that expectation propagation achieves excellent accuracy for thefixed effect parameters, in keeping with the simulation studies of Kim & Wand (2017). Thevariance parameter posterior density estimates are not as good for this particular examplewith accuracy scores of about 72% and 83%. Such mediocre accuracy was not apparentin the Kim & Wand (2017) simulations although their Figures 9 and 11 show accuraciesfor variance parameters being substantially lower than that those for fixed effect param-eters. We ran the code that produced Figure 7 on some simulated data and got accuracyscores in the 85%-95% range for the variance parameters. Further research is needed togain a fuller understanding of the accuracy of expectation propagation in the generalizedadditive mixed model context corresponding to this example.
The fragments listed in Table 2 and covered in Section 3 are the most fundamental onesfor generalized, linear and mixed models. Whilst these fragments support expectationpropagation fitting of a wide range of models, additional fragments are needed for variouselaborations. We now illustrate this fact by investigating fragments needed for (a) theextension to multivariate random effects, and (b) models where the response variable ismodeled according to the t distribution. As we will see, expectation propagation is quitenumerically challenging for such extensions. The fragments in Table 2 can handle the univariate random effects structure u | σ ∼ N (0 , σ ) but they do not cover the multivariate random effects extension: u | Σ ∼ N ( , Σ ) Σ is a unstructured d u × d u matrix.The fragment corresponding to the factor p ( u | Σ ) = (2 π ) − d u / | Σ | − / exp( − u T Σ − u ) is shown in Figure 8. u p( u | Σ ) Σ Figure 8:
The factor graph fragment corresponding the factor p ( u | Σ ) . Under the usual conjugacy constraints, the message from p ( u | Σ ) to Σ is m p ( u | Σ ) → Σ ( Σ ) = proj IW (cid:2) m Σ → p ( u | Σ ) ( Σ ) (cid:82) R d u p ( u | Σ ) m u → p ( u | Σ ) ( u ) d u /Z (cid:3) m Σ → p ( u | Σ ) ( Σ ) (25)where proj IW denotes projection onto the d u -dimensional Inverse Wishart family. The mes-sages on the right-hand side of (25) have the form m Σ → p ( u | Σ ) ( Σ ) = exp (cid:34) log | Σ | vec ( Σ − ) (cid:35) T η Σ → p ( u | Σ ) and m u → p ( u | Σ ) ( u ) = exp (cid:34) u vec ( uu T ) (cid:35) T η u → p ( u | Σ ) Introducing the shorthand η ♥ ≡ η Σ → p ( u | Σ ) and η ♦ ≡ η u → p ( u | Σ ) , arguments analogous to those given in Appendix 6 lead to the function of Σ inside theproj IW [ · ] in (25) being proportional to | Σ | η ♥ | Σ vec − ( η ♦ ) − I | − / tr { Σ − vec − ( η ♥ ) }× exp (cid:104) − ( η ♦ ) T { Σ vec − ( η ♦ ) − I }{ vec − ( η ♦ ) } − η ♦ (cid:105) . (26)The next step is to compute E { log | Σ |} and E ( Σ − ) with expectation with respect to thedensity function obtained by normalizing (26). This is a particularly challenging numericalproblem since it involves numerical integration of the cone of d u × d u symmetric positivedefinite matrices. Then η p ( u | Σ ) → Σ = ( ∇ A ) − IW E { log | Σ |} E { vech ( Σ − ) } . (27)Note that the function ( ∇ A ) IW admits the explicit form ( ∇ A ) IW (cid:18)(cid:20) η η (cid:21)(cid:19) = log (cid:12)(cid:12)(cid:12) − vech − ( η ) (cid:12)(cid:12)(cid:12) − d u (cid:88) j =1 digamma (cid:0) − η − ( d u + j ) (cid:1) { η + ( d u + 1) } vech [ { vech − ( η ) } − ] where [ η η T ] T is the partition of the natural parameter vector into the first entry ( η ) andthe remaining d u ( d u + 1) entries ( η ). However, evaluation of (27) involves numericalinversion of ( ∇ A ) IW in { d u ( d u + 1) } -dimensional space.In conclusion, literal application of expectation propagation for multivariate randomeffects is quite daunting and effective implementation for even ≤ d u ≤ is a verychallenging numerical problem. 22 .2 t Likelihood
The Gaussian fragment, treated in Section 3.6, corresponds to the specification y | α, σ ∼ N ( α, σ ) . Now consider the extension to the t distribution: y | α, σ , ν ∼ t ( α, σ , ν ) (28)where ν > is the degrees of freedom parameter. Low values of ν correspond to heavy-tailed distributions. The Gaussian likelihood is the ν → ∞ limiting case. The densityfunction corresponding to (28) is p ( y | α, σ , ν ) = Γ (cid:0) ν +12 (cid:1) √ πν Γ( ν/ { y − α ) / ( νσ ) } ν +12 . One could work with this density function in the expectation propagation message equa-tions (8) and (9), but trivariate numerical integration is required. In other Bayesian com-putation contexts such as Markov chain Monte Carlo (e.g. Verdinelli & Wasserman, 1991)and variational message passing (e.g. McLean & Wand, 2018) it is common to replace (28)by the auxiliary variable representation y | α, σ , a ∼ N ( α, aσ ) , a | ν ∼ Inverse- χ ( ν, ν ) (29)to aid tractability. Expectation propagation also benefits from this representation of the t -likelihood specification. The fragments corresponding to the factor product p ( y | α, σ , a ) p ( a | ν ) p ( ν ) (30)are shown in Figure 9. σ α p(y| α , σ ,a) a p(a| ν ) ν p( ν ) Figure 9:
Color-coded fragments corresponding to the factors p ( y | α, σ , a ) , p ( a | ν ) and p ( ν ) ap-pearing in (30). None of these fragments are among those treated in Section 3. Therefore, extensionto t likelihood models requires expectation propagation updates for these three new frag-ments. Unfortunately, as we will see, difficult numerical challenges arise for these updates.We now focus on each one in turn. p ( y | α, σ , a ) Fragment
The factor for this fragment is p ( y | α, σ , a ) = (2 π aσ ) − / exp (cid:26) − ( y − α ) aσ (cid:27) . Conjugacy considerations dictate the assumption:all messages passed to α from factors outside of thefragment are in the Univariate Normal family and allmessages passed to either a of σ from factors outsideof the fragment are in the Inverse Chi-Squared family. (31)23his leads to the factor to stochastic node messages taking the forms: m p ( y | α, σ , a ) → α ( α ) = exp (cid:40)(cid:20) αα (cid:21) T η p ( y | α, σ , a ) → α (cid:41) , m p ( y | α, σ , a ) → σ ( σ ) = exp (cid:40)(cid:20) log( σ )1 /σ (cid:21) T η p ( y | α, σ , a ) → σ (cid:41) and m p ( y | α, σ , a ) → a ( a ) = exp (cid:40)(cid:20) log( a )1 /a (cid:21) T η p ( y | α, σ , a ) → a (cid:41) . The derivations of the natural parameter updates are similar in nature to those given inAppendix S.2.6 for the Gaussian fragment. However, the form aσ (rather than σ ) inthe variance means that the natural parameter updates require evaluation of the bivariateintegral-defined function B ( p, q , q , r , r , s, t, u ) ≡ (cid:90) ∞−∞ (cid:90) ∞−∞ x p exp { q x + q x − r e x − r e x − se x + x / ( t + e x + x ) } ( t + e x + x ) u dx dx for p ≥ , q , q ∈ R , r , r > , s ≥ , t > , u > rather than the univariate integral-defined function B ( p, q, r, s, t, u ) given by (S.1). p ( a | ν ) Fragment
The relevant factor is p ( a | ν ) = ( ν/ ν/ Γ( ν/ a − ( ν/ − exp {− ( ν/ /a } , a, ν > . Let υ ≡ ν/ be a simple linear transformation of ν . For the remainder of this section wework with υ , rather than ν , since it leads to a simpler exposition. Now note that p ( a | υ ) ∝ exp log( a )1 /a T − υ − υ − as a function of a, exp υ log( υ ) − log { Γ( υ ) } υ T − /a − log( a ) as a function of υ .To ensure conjugacy we should then impose the restriction:all messages passed to a from factors outside of thefragment are in the Inverse Chi-Squared family and allmessages passed to either υ from factors outsideof the fragment are in the Moon Rock family. (32)The definition of the Moon Rock family is given in Table 1. The messages passed from p ( a | υ ) are then of the form m p ( a | υ ) → a ( a ) = exp log( a )1 /a T η p ( a | υ ) → a m p ( a | υ ) → υ ( υ ) = exp υ log( υ ) − log { Γ( υ ) } υ T η p ( a | υ ) → υ . The message m p ( a | υ ) → a ( a ) has a treatment similar to that for m p ( σ | a ) → σ ( σ ) and m p ( σ | a ) → a ( a ) in Appendix S.2.3 and m p ( by | α, σ ) → σ ( σ ) in Appendix S.2.6 with projection onto the In-verse Chi-Squared family, although bivariate numerical integration is required. On theother hand, m p ( a | υ ) → υ ( υ ) = proj MR (cid:104) m υ → p ( a | υ ) ( υ ) (cid:82) ∞ p ( a | υ ) m p ( a | υ ) → a ( a ) da (cid:14) Z (cid:105) m υ → p ( a | υ ) ( υ ) . where proj MR denotes projection onto the Moon Rock family. The function of υ inside theproj MR [ ] is proportional to h ( υ ) ≡ { υ υ (cid:14) Γ( υ ) } η (cid:91) +1 e η (cid:91) υ Γ( υ − η (cid:93) ) / ( υ + η (cid:93) ) υ − η (cid:93) where η (cid:93) ≡ η p ( a | υ ) → a and η (cid:91) ≡ η υ → p ( a | υ ) . Then η p ( a | υ ) → υ = ( ∇ A MR ) − (cid:32)(cid:34) (cid:82) ∞ { υ log( υ ) − log Γ( υ ) } h ( υ ) dυ (cid:14) (cid:82) ∞ h ( υ ) dυ (cid:82) ∞ υ h ( υ ) dυ (cid:14) (cid:82) ∞ h ( υ ) dυ (cid:35)(cid:33) where A MR ( η ) ≡ log (cid:20)(cid:90) ∞ { t t / Γ( t ) } η exp( η t ) dt (cid:21) is the log-partition function of the Moon Rock exponential family. This implies that ( ∇ A MR )( η ) = (cid:82) ∞ { t log( t ) − log Γ( t ) }{ t t / Γ( t ) } η exp( η t ) dt (cid:46) (cid:82) ∞ { t t / Γ( t ) } η exp( η t ) dt (cid:82) ∞ t { t t / Γ( t ) } η exp( η t ) dt (cid:46) (cid:82) ∞ { t t / Γ( t ) } η exp( η t ) dt . This particular exponential family is not well-studied and we are not aware of any pub-lished theory concerning the properties of ∇ A MR and ( ∇ A MR ) − . Standard analytic argu-ments can be used to show that the domain of ∇ A MR is H = (cid:26)(cid:20) η η (cid:21) : η ≥ , η + η < (cid:27) . It is conjectured that the image of H under ∇ A MR is T = (cid:26)(cid:20) τ τ (cid:21) : τ < τ log( τ ) − log Γ( τ ) (cid:27) . Figure 10 shows the domain of ∇ A MR and the conjectured domain of ( ∇ A MR ) − as well assome example mappings between the two spaces.Evaluation of ( ∇ A MR ) − is a non-trivial problem. It requires numerical inversion tech-niques such as Newton-Raphson iteration. Moreover, each of the iterative updates in-volves evaluation of ∇ A MR and, possibly, its first partial derivatives. None of these func-tions are available in closed form and require numerical integration.25 − − − η η T − τ τ Figure 10:
Illustration of the bijective maps between H and T for the Moon Rock exponentialfamily. The crosses and dotted lines depict five example η ∈ H and τ = ∇ A MR ( η ) ∈ T pairs.Since ∇ A MR is a bijective map, the crosses and dotted lines equivalently depict five example τ ∈ T and η = ( ∇ A MR ) − ( τ ) ∈ H pairs. p ( ν ) Fragment
This simple fragment has the factor to stochastic node message m p ( ν ) → ν ( ν ) ∝ p ( ν ) corresponding to the prior distribution on ν . The conjugate family of prior density func-tions is p ( ν ) ∝ { ( ν/ ν/ (cid:14) Γ( ν/ } A ν exp( − B ν ν ) , ν > . for hyperparameters A ν ≥ and B ν > A ν .In terms of υ = ν/ , the relevant message is m p ( υ ) → υ ( υ ) = exp υ log( υ ) − log { Γ( υ ) } υ T A ν − B ν . The previous two subsections make it clear that elaborations such as multivariate randomeffects and fancier likelihoods involve profound numerical challenges for the expectationpropagation paradigm. Table 3 summarizes the numerical challenges of all of the non-trivial fragments treated in this article.The first ten fragments in Table 3 have the attraction of requiring only numerical eval-uation of univariate integral within the families given by (S.1) and (S.2). The probit like-lihood fragment stands out as a special case of a likelihood that does not require any nu-merical methods for expectation propagation message passing.The last three fragments of Table 3 are considerably more demanding in terms of nu-merical analysis. In a recent article, Gelman et al. (2017) discuss the possibility of adoptingMonte Carlo methods to deal with difficult computational problems in expectation prop-agation, but we are not aware of any existing methodology of this type.26ragment name numeric. integrat. demands Kull.-Leib. projec. demandsGaussian prior none noneInverse Wishart prior none noneMoon Rock prior none noneIterated Inverse Chi Squared univariate quadrature inversion of log − digammaLinear comb. deriv. var. none noneMultiv. lin. comb. deriv. var. none noneGaussian univariate quadrature inversion of log − digammaLogistic likelihood univariate quadrature trivialProbit likelihood none trivialPoisson likelihood univariate quadrature trivialMultiple random effects multivariate quadrature inversion of a multivariatefunction t likelihood direct trivariate quadrature inversion of log − digammaand a non-explicit bivariatefunction t likelihood aux. var. bivariate quadrature inversion of log − digammaand a non-explicit bivariatefunctionTable 3: The numerical integration and Kullback-Leibler projection demands of the non-trivialfragments discussed in this article.
Acknowledgments
This research was supported by Australian Research Council Discovery Project DP180100597.
References
Azzalini, A. (2017). The R package sn : The skew-normal and skew-t distributions (version1.5). http://azzalini.stat.unipd.it/SN Carpenter, B., Gelman, A., Hoffman, M.D., Lee, D., Goodrich, B., Betancourt, M., Brubaker,M., Guo, J., Li, P. and Riddell, A. (2017). Stan: A probabilistic programming lan-guage.
Journal of Statistical Software , , Issue 1, 1–32.Gelman, A., Vehtari, A., Jyl¨anki, P., Sivula, T., Tran, D., Suhai, S., Blomstedt, P. Cun-ningham, J.P., Schiminovic, D. and Robert, C. (2017). Expectation propagation asa way of life: A framework for Bayesian inference on partitioned data. Unpublishedmanuscript ( arXiv:1412.4869v2 ).Guo, B.-N. and Qi, F. (2013). Refinements of lower bounds for polygamma functions. Proceedings of the American Mathematical Society , , 1007–1015.Guo, J., Gabry, J. and Goodrich, B. (2017). The R package rstan : R interface to Stan . R package (version 2.17.2). http://mc-stan.org .27arville, D.A. (2008). Matrix Algebra from a Statistician’s Perspective . New York: Springer.Kim, A.S.I. and Wand, M.P. (2016). The explicit form of expectation propagation for asimple statistical model.
Electronic Journal of Statistics , , 550–581.Kim, A.S.I. and Wand, M.P. (2017). On expectation propagation for generalized, linear andmixed models. Australian and New Zealand Journal of Statistics , , in press.Minka, T. (2005). Divergence measures and message passing. Microsoft Research TechnicalReport Series , MSR-TR-2005-173 , 1–17.McLean, M.W. and Wand, M.P. (2018). Variational message passing for elaborate responseregression models.
Bayesian Analysis , under revision.Nolan, T.H. and Wand, M.P. (2017). Accurate logistic variational message passing: alge-braic and numerical details.
Stat , , 102–112.Smyth, G. (2015). The R package statmod : Statistical modeling (version 1.4). http://http://cran.r-project.org Sommer, A. (1982).
Nutritional Blindness . New York: Oxford University Press.Wainwright, M.J. and Jordan, M.I. (2008). Graphical models, exponential families, andvariational inference.
Foundations and Trends in Machine Learning , , 1–305.Wand, M.P. (2017). Fast approximate inference for arbitrarily large semiparametric regres-sion models via message passing (with discussion). Journal of the American StatisticalAssociation , , 137–168.Wand, M.P. and Jones, M.C. (1993). Comparison of smoothing parameterizations in bivari-ate density estimation. Journal of the American Statistical Association , , 520–528.Wand, M.P. and Ripley, B.D. (2015). The R package KernSmooth . Functions for kernelsmoothing supporting Wand & Jones (1995) (version 2.23). https://cran.R-project.org . 28 upplement for:
Factor Graph Fragmentization of Expectation Propagation B Y W ILSON
Y. C
HEN AND M ATT
P. W
AND
University of Technology Sydney
S.1 Function Definitions
Expectation propagation algorithms that are based on the fragments in Section 3 havestraightforward implementation once a few key functions are identified. Many of the func-tions are simple but long-winded. However, they only have to be implemented once andafter that all fragment updates are simple.The functions can be divided into three types:1. functions defined via non-analytic integral families.2. a function defined via inversion of an established function3. functions that are explicit given function types 1. and 2.We now give details of each of these types in turn.
S.1.1 Functions Defined via Non-Analytic Integral Families
Two fundamental families of integrals for expectation propagation in linear model con-texts are: A ( p, q, r, s, t, u ) ≡ (cid:90) ∞−∞ x p exp( qx − rx ) dx ( x + sx + t ) u ,p ≥ , q ∈ R , r > , s ∈ R , t > s , u > and B ( p, q, r, s, t, u ) ≡ (cid:90) ∞−∞ x p exp { qx − re x − se x / ( t + e x ) } dx ( t + e x ) u ,p ≥ , q ∈ R , r > , s ≥ , t > , u > . (S.1)An additional family of non-analytic functions that we need is: C b ( p, q, r ) ≡ (cid:90) ∞−∞ x p exp { qx − rx − b ( x ) } dx, (S.2)where q ∈ R , r > and b : R → R is any function for which C b ( p, q, r ) exists.To avoid underflow and overflow working with logarithms and suitably modified in-tegrands is recommended. For example, C (0 , q, r ) = e M (cid:90) ∞−∞ exp { qx − rx − b ( x ) − M } dx where M ≡ sup { x ∈ R : qx − rx − b ( x ) } implies that log {C (0 , q, r ) } = M + log (cid:20)(cid:90) ∞−∞ exp { qx − rx − b ( x ) − M } dx (cid:21) Sine the last-written integrand has a maximum of , its values are not overly large or small.1 .1.2 Function Defined via Inversion Theorem 1 of Kim & Wand (2016) asserts that the function log − digamma is a bijectivemapping from R + onto R + . Therefore its inverse (log − digamma ) − : R + → R + is well-defined. Bounds given in Guo & Qi (2013) imply that x < (log − digamma ) − ( x ) < x for all x > . Therefore, (log − digamma ) − ( x ) ≈ geometric mean of x and x = 1 x √ and provides useful starting values for iterative inversion of log − digamma.In addition, care is required for evaluation of (log − digamma )( x ) for large x since directcomputation round-off error can lead to an erroneous answer of zero. Software such as thefunction logmdigamma() in the R package statmod (Smyth, 2015). S.1.3 Explicit Functions
The N (0 , density function and cumulative distribution functions are denoted by φ ( x ) ≡ (2 π ) − e − x / and Φ( x ) ≡ (cid:90) x −∞ φ ( t ) dt. We also define ζ ( x ) ≡ log { x ) } so that ζ (cid:48) ( x ) ≡ φ ( x ) / Φ( x ) . Stable computation of ζ (cid:48) is available from, for example, the function zeta() in the R package sn (Azzalini, 2017).The functions G N and G IG1 defined in Kim & Wand (2016) are also needed here. Thefunction G IG2 from Kim & Wand (2016) requires generalization to handle Half- t priors onstandard deviation parameters with arbitrary degrees of freedom. The generalization isdenoted by G IG3 . Each of G N , G IG1 and G IG3 depend on the functions defined in AppendicesS.1.1 and S.1.2 but otherwise ther are simple, albeit long-winded, functions with multiplevector arguments. Their definitions are given in Section A.4 of Kim & Wand (2016) andare repeated here for convenience. We also define the explicit functions H probit , H logistic and H Poisson . First set: α k, (cid:20) a a (cid:21) , (cid:20) b b (cid:21) , c c c = A (cid:18) k, a , − a , − c c , c − b c , c − b − (cid:19) and β k, (cid:96), v, w, (cid:20) a a (cid:21) , (cid:20) b b (cid:21) , c c c = B (cid:32) k, (cid:96) + c − − a , c c − c c − a , − b (cid:18) c c + b b (cid:19) , v, w (cid:33) . Next, define g ( (cid:96), v, w, a , b , c ) = (log − digamma ) − (cid:32) log (cid:40) β (0 , (cid:96) + 1 , v, w, a , b , c ) β (0 , (cid:96) − , v, w, a , b , c ) (cid:41) − β (1 , (cid:96) − , v, w, a , b , c ) β (0 , (cid:96) − , v, w, a , b , c ) (cid:33) . G N , G IG1 , G IG2 , and G IG3 : G N ( a , b ; c ) = (cid:34) α (2 , a , b , c ) α (0 , a , b , c ) − (cid:26) α (1 , a , b , c ) α (0 , a , b , c ) (cid:27) (cid:35) − (cid:20) α (1 , a , b , c ) /α (0 , a , b , c ) − / (cid:21) − a ,G IG1 a , (cid:20) b b (cid:21) ; c c c = − − g (0 , − b /c , , a , b , c ) − g (0 , − b /c , , a , b , c ) β (0 , − , − b /c , , a , b , c ) β (0 , , − b /c , , a , b , c ) − a and G IG3 (cid:18) a , (cid:20) b b (cid:21) ; k, (cid:96) (cid:19) = − − g k − , − b /(cid:96), (cid:96) − k/ − b , a , (cid:20) b (cid:21) , − g k − , − b /(cid:96), (cid:96) − k/ − b , a , (cid:20) b (cid:21) , × β , k − , − b /(cid:96), (cid:96) − k/ − b , a , (cid:20) b (cid:21) , β , k − , − b /(cid:96), (cid:96) − k/ − b , a , (cid:20) b (cid:21) , − a . This definition is a generalization of the function G IG2 given in Kim & Wand (2016, 2017)and is such that G IG3 (cid:18) a , (cid:20) b b (cid:21) ; k, (cid:19) = G IG2 (cid:18) a , (cid:20) b b (cid:21) ; k (cid:19) . Put H b (cid:18)(cid:20) a a (cid:21) ; y (cid:19) = C b (1 , a + y, − a ) / C b (0 , a + y, − a ) C b (2 , a + y, − a ) C b (0 , a + y, − a ) − C b (1 , a + y, − a ) C b (0 , a + y, − a ) − / C b (2 , a + y, − a ) C b (0 , a + y, − a ) − C b (1 , a + y, − a ) C b (0 , a + y, − a ) − (cid:20) a a (cid:21) for any a ∈ R , a < and y ∈ R and then let H logistic (cid:18)(cid:20) a a (cid:21) ; y (cid:19) ≡ H b (cid:18)(cid:20) a a (cid:21) ; y (cid:19) for b ( x ) = log(1 + e x ) and H Poisson (cid:18)(cid:20) a a (cid:21) ; y (cid:19) ≡ H b (cid:18)(cid:20) a a (cid:21) ; y (cid:19) for b ( x ) = e x . (S.3)Lastly, H probit (cid:18)(cid:20) a a (cid:21) ; y (cid:19) ≡ − a − ζ (cid:48) ( r ) (cid:8) r + ζ (cid:48) ( r ) (cid:9) a (1 − a )+(2 y − ζ (cid:48) ( r ) (cid:112) a (2 a − a (1 − a ) − a a where r ≡ (2 y − a / (cid:112) a (2 a − . 3 .2 Derivations We now provide derivations of each of algorithms given in Section 3. Definitions used inthe derivations are: A N ( η ) ≡ − ( η /η ) − log( − η ) and A I χ ( η ) ≡ ( η + 1) log( − η ) + log Γ( − η − for the log-partition functions of the Normal and Inverse Chi-Squared families respec-tively. In a similar vein, proj N [ · ] denotes Kullback-Leibler projection onto the (possiblyMultivariate) Normal family of density functions and proj I χ [ · ] denotes Kullback-Leiblerprojection onto the Inverse Chi-Squared family of density functions. We use Z to denotethe normalizing factor of the function inside a Kullback-Leibler projection operator. S.2.1 Derivation of Algorithm 1
Plugging into (9) we get m p ( θ ) → θ ( θ ) = proj N (cid:104) m θ → p ( θ ) ( θ ) p ( θ ) (cid:14) Z (cid:105) m θ → p ( θ ) ( θ ) = m θ → p ( θ ) ( θ ) p ( θ ) m θ → p ( θ ) ( θ ) = p ( θ ) . The second equality follows from the fact that m θ → p ( θ ) ( θ ) is a Multivariate Normal den-sity function, which is a consequence of (11). Since p ( θ ) ∝ exp (cid:8) − ( θ − µ θ ) T Σ − θ ( θ − µ θ ) (cid:9) ∝ exp (cid:34) θ vec ( θθ T ) (cid:35) T (cid:34) Σ − θ µ θ − vec ( Σ − θ ) (cid:35) we have m p ( θ ) → θ ( θ ) = exp (cid:34) θ vec ( θθ T ) (cid:35) T η p ( θ ) → θ where η p ( θ ) → θ = (cid:34) Σ − θ µ θ − vec ( Σ − θ ) (cid:35) . S.2.2 Derivation of Algorithm 2
Using arguments similar to those given in Section S.2.1 for the Gaussian prior fragmentlead to m p ( Θ ) → Θ ( Θ ) ∝ p ( Θ ) ∝ | Θ | − ( κ Θ + d Θ +1) / exp {− tr ( Λ Θ Θ − ) } = exp (cid:34) log | Θ | vec ( Θ − ) (cid:35) T (cid:34) − ( κ Θ + d Θ + 1) − vec ( Λ − Θ ) (cid:35) . Hence m p ( Θ ) → Θ ( Θ ) = exp (cid:34) log | Θ | vec ( Θ − ) (cid:35) T η p ( Θ ) → Θ where η p ( Θ ) → Θ = (cid:34) − ( κ Θ + d Θ + 1) − vec ( Λ − Θ ) (cid:35) . .2.3 Derivation of Algorithm 3 The message from p ( σ | a ) to σ is m p ( σ | a ) → σ ( σ ) ←− proj I χ (cid:2) m σ → p ( σ | a ) ( σ ) (cid:82) ∞ p ( σ | a ) m a → p ( σ | a ) ( a ) da (cid:14) Z (cid:3) m σ → p ( σ | a ) ( σ ) . (S.4)Assumption (13) implies that m a → p ( σ | a ) ( a ) = exp (cid:40)(cid:20) log( a )1 /a (cid:21) T η a → p ( σ | a ) (cid:41) = a η ♣ exp( η ♣ /a ) and m σ → p ( σ | a ) ( σ ) = exp (cid:40)(cid:20) log( σ )1 /σ (cid:21) T η σ → p ( σ | a ) (cid:41) = ( σ ) η ♠ exp( η ♠ /σ ) where η ♣ ≡ η a → p ( σ | a ) and η ♠ ≡ η σ → p ( σ | a ) . As a function of σ , the integral in (S.4) is (cid:90) ∞ ( ν/a ) ν/ Γ( ν/
2) ( σ ) − ( ν/ − exp {− ν/ ( σ a ) } a η ♣ exp( η ♣ /a ) da ∝ ( σ ) − ( ν/ − (cid:90) ∞ a −{− η ♣ +( ν/ − }− exp[ −{ ( ν/σ ) − η ♣ } /a ] da ∝ ( σ ) − ( ν/ − { ( ν/σ ) − η ♣ } η ♣ − ( ν/ . Therefore, the density function inside the proj I χ [ · ] is proportional to ( σ ) η ♠ − ( ν/ − { ( ν/σ ) − η ♣ } η ♣ − ( ν/ exp( η ♠ /σ ) and the numerator of (S.4) is proportional to exp (cid:40)(cid:20) log( σ )1 /σ (cid:21) T η numer (cid:41) where η numer ≡ ( ∇ A I χ ) − (cid:82) ∞ log( σ )( σ ) η ♠ − ( ν/ − { ( ν/σ ) − η ♣ } η ♣ − ( ν/ exp( η ♠ /σ ) dσ (cid:82) ∞ ( σ ) η ♠ − ( ν/ − { ( ν/σ ) − η ♣ } η ♣ − ( ν/ exp( η ♠ /σ ) dσ (cid:82) ∞ (1 /σ )( σ ) η ♠ − ( ν/ − { ( ν/σ ) − η ♣ } η ♣ − ( ν/ exp( η ♠ /σ ) dσ (cid:82) ∞ ( σ ) η ♠ − ( ν/ − { ( ν/σ ) − η ♣ } η ♣ − ( ν/ exp( η ♠ /σ ) dσ . Steps analogous to those given in Appendix A.5.4 of Kim & Wand (2016) can then beused to derive the η p ( σ | a ) → σ update. However, note that Algorithm 3 supports generalHalf-t ( A, ν ) prior distributions and Kim & Wand (2016) only deal with the ν = 1 (Half-Cauchy) special case. Hence, the function G IG2 in Kim & Wand (2016) has to be generalizedto the G IG3 function.Also, note that the η p ( σ | a ) → σ update has ε ←− rather than ←− to allow for the damp-ing adjustment defined by (10). The same adjustment applies to the remainder of thederivations given in Section S.2.The message from p ( σ | a ) to a has a similar form and arguments analogous to thosejust given for the message from p ( σ | a ) to σ lead to the update for η p ( σ | a ) → a given inAlgorithm 3. 5 .2.4 Derivation of Algorithm 4 Algorithm 4 is the d θ = 1 special case of Algorithm 5 and therefore its derivation is coveredby the general d θ case given next in Section S.2.5. S.2.5 Derivation of Algorithm 5
Algorithm 5 depends on Result 2 below, which extends Theorem 1 of Kim & Wand (2017)to multivariate linear combination derived variables.Let p Nd ( x ; µ , Σ ) = (2 π ) − d/ | Σ | − / exp {− ( x − µ ) T Σ − ( x − µ ) } denote the density function of a d -variate N ( µ , Σ ) random vector. Then we have: Result 2.
For all d × d (cid:48) matrices L such L T Σ L is positive definite and d (cid:48) × vectors v : (cid:90) R d p Nd ( x ; µ , Σ ) δ (cid:0) v − L T x (cid:1) d x = p Nd (cid:48) ( v ; L T µ , L T Σ L ) . (S.5) Derivation of Result 2.
Via the change of variable z = Σ − / ( x − µ ) , the left-hand side of (S.5) is (cid:90) R d p Nd ( z ; , I ) δ (cid:0) v − L T µ − L T Σ / z (cid:1) d z = (cid:90) R d p Nd ( z ; , I ) δ (cid:0) v † − ( L † ) T z (cid:1) d z where v † ≡ v − L T µ and L † ≡ Σ / L . Next note that (cid:90) R d p Nd ( x ; , I ) δ (cid:0) v † − ( L † ) T z (cid:1) d z = (cid:90) R d lim ε → p Nd ( z ; , I ) p Nd (cid:48) (cid:0) v † − ( L † ) T z ; , ε I (cid:1) d z = (2 π ) − d/ (cid:90) R d lim ε → (cid:2) (2 πε ) − d (cid:48) / exp { r ( z , v † , L † , ε ) } (cid:3) d z where r ( z , v † , L † , ε ) = − z T z − ε { v † − ( L † ) T z } T { v † − ( L † ) T z } = − ( z − m ) T { I + ε L † ( L † ) T } ( z − m ) − ε ( v † ) T v † + ε ( L † v † ) T { I + ε L † ( L † ) T } − L † v † with m = m ( v † , L † , ε ) ≡ ε ( I + ε LL T ) − ( L † ) T v † . We then have (cid:82) R d p Nd ( z ; , I ) δ (cid:0) v † − ( L † ) T z (cid:1) d z = lim ε → (cid:104) (2 πε ) − d (cid:48) / | I + ε L † ( L † ) T | − / × exp (cid:8) ε ( L † v † ) T { I + ε L † ( L † ) T } − L † v † − ε ( v † ) T v † (cid:9) (cid:105) (S.6)where we have used the fact (cid:90) R d (2 π ) − d/ | I + ε L † ( L † ) T | / exp (cid:110) − ( z − m ) T (cid:8) I + ε L † ( L † ) T (cid:9) ( z − m ) (cid:111) d z = 1 and assumed that the integrand possesses properties that are sufficient to justify inter-changing the limit as ε → and the integral over R d . Using Theorem 18.1.1 of Harville(2008), the determinant in (S.6) is | I + ε L † ( L † ) T | = ε − d (cid:48) | ε I + ( L † ) T L † | = ε − d (cid:48) | ε I + L T Σ L | . (S.7)6ext, application of Theorem 18.2.8 of Harville (2008) gives { I + ε L † ( L † ) T } − = I − L † (cid:8) ε I + ( L † ) T L † (cid:9) − ( L † ) T (S.8)which implies that ( L † v † ) T { I + ε L † ( L † ) T } − L † v † = ( v † ) T ( L † ) T L † v † − ( L † v † ) T L † (cid:8) ε I + ( L † ) T L † (cid:9) − ( L † ) T L † v † = ( v † ) T L T Σ L v † − ( v † ) T L T Σ L (cid:0) ε I + L T Σ L (cid:1) − L T Σ Lv † . The exponent in the expression on the right-hand side of (S.6) is then ε (cid:110) ( v † ) T L T Σ L v † − ( v † ) T L T Σ L (cid:0) ε I + L T Σ L (cid:1) − L T Σ Lv † − ε ( v † ) T v † (cid:111) = − ( v † ) T (cid:0) ε I + L T Σ L (cid:1) − v † = − ( v − L T µ ) T (cid:0) ε I + L T Σ L (cid:1) − ( v − L T µ ) . Substitution of this result and (S.7) into (S.6) then gives (cid:90) R d p Nd ( z ; , I ) δ (cid:0) v † − ( L † ) T z (cid:1) d z = lim ε → (cid:2) (2 π ) − d (cid:48) / | ε I + L T Σ L | − / exp {− ( v − L T µ ) − ( ε I + L T Σ L ) − ( v − L T µ ) } (cid:3) . = p Nd (cid:48) ( v ; L T µ , L T Σ L ) and the result follows.From (9) we have η δ ( α − A T θ ) → α ( α ) ←− proj N (cid:104) m α → δ ( α − A T θ ) ( α ) (cid:90) R d θ δ ( α − A T θ ) m θ → δ ( α − A T θ ) ( θ ) d θ (cid:14) Z (cid:105) m α → δ ( α − A T θ ) ( α ) (S.9)where d θ is the dimension of θ . It follows from assumption (15) and (8) that m α → δ ( α − A T θ ) ( α ) = exp (cid:40)(cid:20) α vec ( αα T ) (cid:21) T η α → δ ( α − A T θ ) (cid:41) (S.10)and m θ → δ ( α − A T θ ) ( θ ) = exp (cid:40)(cid:20) θ vec ( θθ T ) (cid:21) T η θ → δ ( α − A T θ ) (cid:41) ∝ (2 π ) − d θ / | Σ (cid:12) | − / exp {− ( θ − µ (cid:12) ) T Σ − (cid:12) ( θ − µ (cid:12) ) } where µ (cid:12) ≡ − (cid:110) vec − (cid:0) η θ → δ ( α − A T θ ) (cid:1) (cid:111) − (cid:16) η θ → δ ( α − A T θ ) (cid:17) and Σ (cid:12) ≡ − (cid:110) vec − (cid:0) η θ → δ ( α − A T θ ) (cid:1) (cid:111) − are the common parameters matching η θ → δ ( α − A T θ ) . From Result 2 in Appendix S.2.5, (cid:90) R d θ δ ( α − A T θ )(2 π ) − d θ / | Σ (cid:12) | − / exp {− ( θ − µ (cid:12) ) T Σ − (cid:12) ( θ − µ (cid:12) ) } d θ = | π A T Σ (cid:12) A | − / exp (cid:110) − ( α − A T µ (cid:12) ) T ( A T Σ (cid:12) A ) − ( α − A T µ (cid:12) ) (cid:111) . η δ ( α − A T θ ) → α ( α ) = proj N (cid:34) exp (cid:26)(cid:20) α vec ( αα T ) (cid:21) η α → δ ( α − A T θ ) (cid:27) exp (cid:40)(cid:20) α vec ( αα T ) (cid:21) T (cid:34) ( A T Σ (cid:12) A ) − A T µ (cid:12) − vec (cid:0) ( A T Σ (cid:12) A ) − (cid:1)(cid:35)(cid:41) (cid:44) Z (cid:35) exp (cid:26)(cid:20) α vec ( αα T ) (cid:21) η α → δ ( α − A T θ ) (cid:27) = exp (cid:40)(cid:20) α vec ( αα T ) (cid:21) T (cid:34) ( A T Σ (cid:12) A ) − A T µ (cid:12) − vec (cid:0) ( A T Σ (cid:12) A ) − (cid:1)(cid:35)(cid:41) . Therefore, setting Ω ←− − Σ (cid:12) A , we get η δ ( α − A T θ ) → α ←− ( Ω T A ) − Ω T (cid:16) η θ → δ ( α − A T θ ) (cid:17) vec (cid:0) ( Ω T A ) − (cid:1) which are the first two updates of Algorithm 4.For the third update, note that (9) gives m δ ( α − A T θ ) → θ ( θ ) ←− proj N (cid:104) m θ → δ ( α − A T θ ) ( θ ) (cid:90) R d α δ ( α − A T θ ) m α → δ ( α − A T θ ) ( α ) d α (cid:14) Z (cid:105) m θ → δ ( α − A T θ ) ( θ )= proj N (cid:104) m θ → δ ( α − a T θ ) ( θ ) m α → δ ( α − A T θ ) ( A T θ ) /Z (cid:105) m θ → δ ( α − a T θ ) ( θ )= proj N (cid:34) exp (cid:40)(cid:20) θ vec ( θθ T ) (cid:21) T η θ → δ ( α − A T θ ) (cid:41) exp (cid:40)(cid:20) A T θ vec (cid:0) ( A T θ )( A T θ ) T (cid:1)(cid:21) T η α → δ ( α − A T θ ) (cid:41) (cid:44) Z (cid:35) exp (cid:40)(cid:20) θ vec ( θθ T ) (cid:21) T η θ → δ ( α − A T θ ) (cid:41) = exp (cid:20) θ vec ( θθ T ) (cid:21) T A (cid:16) η α → δ ( α − a T θ ) (cid:17) ( A ⊗ A ) (cid:16) η α → δ ( α − a T θ ) (cid:17) . The last step uses the identity vec ( BCD ) = ( D T ⊗ B ) vec ( C ) for any compatible matrices B , C and D . The third update follows immediately. S.2.6 Derivation of Algorithm 6
The message from p ( y | α, σ ) to α is, from (9), m p ( y | α, σ ) → α ( α ) = proj N (cid:2) m α → p ( y | α, σ ) ( α ) (cid:82) ∞ p ( y | α, σ ) m σ → p ( y | α, σ ) ( σ ) dσ (cid:3) m α → p ( y | α, σ ) ( α ) . (S.11)It follows from (8) and (18) that m σ → p ( y | α, σ ) ( σ ) = exp (cid:40)(cid:20) log( σ )1 /σ (cid:21) T η σ → p ( y | α, σ ) (cid:41) = ( σ ) η (cid:1) exp( η (cid:1) /σ ) . η (cid:1) ≡ η σ → p ( y | α, σ ) . Hence (cid:90) ∞ p ( y | α, σ ) m σ → p ( y | α, σ ) ( σ ) dσ = (2 π ) − / (cid:90) ∞ ( σ ) η (cid:1) − / exp[ { η (cid:1) − ( y − α ) } /σ )] dσ ∝ (cid:8) ( y − α ) − η (cid:1) (cid:9) η (cid:1) + 12 . Therefore, letting η (cid:52)(cid:53) ≡ η α → p ( y | α, σ ) , the numerator of the right-hand side of (S.11) isproj N exp( αη (cid:52)(cid:53) + α η (cid:52)(cid:53) ) (cid:8) ( y − α ) − η (cid:1) (cid:9) − η (cid:1) − = exp (cid:34) αα (cid:35) T ( ∇ A N ) − (cid:82) ∞−∞ α exp( αη (cid:52)(cid:53) + α η (cid:52)(cid:53) ) (cid:14)(cid:8) ( y − α ) − η (cid:1) (cid:9) − η (cid:1) − dα (cid:82) ∞−∞ exp( αη (cid:52)(cid:53) + α η (cid:52)(cid:53) ) (cid:14)(cid:8) ( y − α ) − η (cid:1) (cid:9) − η (cid:1) − dα (cid:82) ∞−∞ α exp( αη (cid:52)(cid:53) + α η (cid:52)(cid:53) ) (cid:14)(cid:8) ( y − α ) − η (cid:1) (cid:9) − η (cid:1) − dα (cid:82) ∞−∞ exp( αη (cid:52)(cid:53) + α η (cid:52)(cid:53) ) (cid:14)(cid:8) ( y − α ) − η (cid:1) (cid:9) − η (cid:1) − dα = exp (cid:34) αα (cid:35) T ( ∇ A N ) − A (1 ,η (cid:52)(cid:53) , − η (cid:52)(cid:53) , − y,y − η (cid:1) , − η (cid:1) −
12 ) A (0 ,η (cid:52)(cid:53) , − η (cid:52)(cid:53) , − y,y − η (cid:1) , − η (cid:1) −
12 ) A (2 ,η (cid:52)(cid:53) , − η (cid:52)(cid:53) , − y,y − η (cid:1) , − η (cid:1) −
12 ) A (0 ,η (cid:52)(cid:53) , − η (cid:52)(cid:53) , − y,y − η (cid:1) , − η (cid:1) −
12 ) . Hence η p ( y | α, σ ) → α ←− ( ∇ A N ) − A (1 ,η (cid:52)(cid:53) , − η (cid:52)(cid:53) , − y,y − η (cid:1) , − η (cid:1) −
12 ) A (0 ,η (cid:52)(cid:53) , − η (cid:52)(cid:53) , − y,y − η (cid:1) , − η (cid:1) −
12 ) A (2 ,η (cid:52)(cid:53) , − η (cid:52)(cid:53) , − y,y − η (cid:1) , − η (cid:1) −
12 ) A (0 ,η (cid:52)(cid:53) , − η (cid:52)(cid:53) , − y,y − η (cid:1) , − η (cid:1) −
12 ) − η α → p ( y | α, σ ) and the first update in Algorithm 6 follows from the definition of G N given in Section S.1.3.The message p ( y | α, σ ) to σ is m p ( by | α, σ ) → σ ( σ ) = proj I χ (cid:2) m σ → p ( y | α, σ ) ( σ ) (cid:82) ∞−∞ p ( y | α, σ ) m α → p ( y | α, σ ) ( α ) dα (cid:3) m σ → p ( y | α, σ ) ( σ ) . (S.12)It follows from (8) and (18) that m α → p ( y | α, σ ) ( α ) = exp (cid:40)(cid:20) αα (cid:21) T η α → p ( y | α, σ ) (cid:41) = exp (cid:0) α η ⊕ + α η ⊕ (cid:1) where η ⊕ ≡ η α → p ( y | α, σ ) . The integral in (S.12) is (cid:90) ∞−∞ (2 πσ ) − / exp (cid:26) − ( y − α ) σ (cid:27) exp (cid:0) α η ⊕ + α η ⊕ (cid:1) dα ∝ (cid:90) ∞−∞ (2 πσ ) − / exp (cid:26) − ( y − α ) σ (cid:27) { π ( σ ⊕ ) } − / exp (cid:26) − ( y − µ ⊕ ) σ ⊕ ) (cid:27) dα where (cid:34) µ ⊕ ( σ ⊕ ) (cid:35) ≡ (cid:34) − η ⊕ / (2 η ⊕ ) − / (2 η ⊕ ) (cid:35)
9s the common parameter vector corresponding to η ⊕ . Using (A.2) of Wand and Jones(1993), the last-written integral is { π ( σ + ( σ ⊕ ) } − / exp (cid:26) − ( y − µ ⊕ ) { σ + ( σ ⊕ ) } (cid:27) . Therefore, the function inside the proj I χ (cid:2) · ] in (S.12) is ( σ ) η (cid:1) exp( η (cid:1) /σ )[2 π { σ − / (2 η ⊕ ) } ] − / exp (cid:20) −{ y + η ⊕ / (2 η ⊕ ) } { σ − / (2 η ⊕ ) } (cid:21) . Plugging into (S.12) we then have m p ( by | α, σ ) → σ ( σ ) = exp (cid:40)(cid:20) log( σ )1 /σ (cid:21) T η p ( y | α, σ ) → σ (cid:41) where η p ( y | α, σ ) → σ ←− ( ∇ A I χ ) − (cid:82) ∞ log( σ )( σ ) η (cid:1) { σ − / (2 η ⊕ ) } − / exp (cid:20) η (cid:1) σ − { y + η ⊕ / (2 η ⊕ } { σ − / (2 η ⊕ } (cid:21) dσ (cid:82) ∞ ( σ ) η (cid:1) { σ − / (2 η ⊕ ) } − / exp (cid:20) η (cid:1) σ − { y + η ⊕ / (2 η ⊕ } { σ − / (2 η ⊕ } (cid:21) dσ (cid:82) ∞ (1 /σ )( σ ) η (cid:1) { σ − / (2 η ⊕ ) } − / exp (cid:20) η (cid:1) σ − { y + η ⊕ / (2 η ⊕ } { σ − / (2 η ⊕ } (cid:21) dσ (cid:82) ∞ ( σ ) η (cid:1) { σ − / (2 η ⊕ ) } − / exp (cid:20) η (cid:1) σ − { y + η ⊕ / (2 η ⊕ } { σ − / (2 η ⊕ } (cid:21) dσ − η (cid:1) . The change of variable σ = e − x and some simple algebra then leads to η p ( y | α, σ ) → σ ←− ( ∇ A I χ ) − −B (1 , − η (cid:1) , − η (cid:1) , − η ⊕ { y + η ⊕ / (2 η ⊕ ) } , − η ⊕ ,
12 ) B (0 , − η (cid:1) , − η (cid:1) , − η ⊕ { y + η ⊕ / (2 η ⊕ ) } , − η ⊕ ,
12 ) B (0 , − η (cid:1) , − η (cid:1) , − η ⊕ { y + η ⊕ / (2 η ⊕ ) } , − η ⊕ ,
12 ) B (0 , − η (cid:1) , − η (cid:1) , − η ⊕ { y + η ⊕ / (2 η ⊕ ) } , − η ⊕ ,
12 ) − η σ → p ( y | α, σ ) . The second update in Algorithm 6 follows from the definition of G IG1 given in Section S.1.3.
S.2.7 Derivation of Algorithm 7
The only factor to stochastic node message for the logistic fragment is, from (9): m p ( y | α ) → α ( α ) = proj [ m α → p ( y | α ) ( α ) p ( y | α ) (cid:14) Z ] m α → p ( y | α ) ( α ) with projection onto an appropriate exponential family. Assumption (19) and conjugacyconsiderations implies projection onto the Univariate Normal family. Setting η ≡ η α → p ( y | α ) , we then have m p ( y | α ) → α ( α ) = proj N [exp( η α + η α ) exp { yα − log(1 + e α ) } ]exp( η α + η α ) . N (cid:2) exp { ( η + y ) α + η α − log(1 + e α ) } (cid:3) ∝ exp (cid:20) αα (cid:21) T ( ∇ A N ) − (cid:82) ∞−∞ α exp { ( η + y ) α + η α − log(1+ e α ) } dα (cid:82) ∞−∞ exp { ( η + y ) α + η α − log(1+ e α ) } dα (cid:82) ∞−∞ α exp { ( η + y ) α + η α − log(1+ e α ) } dα (cid:82) ∞−∞ exp { ( η + y ) α + η α − log(1+ e α ) } dα ∝ exp (cid:20) αα (cid:21) T ( ∇ A N ) − C (1 ,η + y,η ) C (0 ,η + y,η ) C (2 ,η + y,η ) C (0 ,η + y,η ) . Hence η p ( y | α ) → α ←− ( ∇ A N ) − C (1 ,η + y,η ) C (0 ,η + y,η ) C (2 ,η + y,η ) C (0 ,η + y,η ) − η α → p ( y | α ) and the update in Algorithm 7 follows from the definition of H logistic given in Section S.1.3. S.2.8 Derivation of Algorithm 8
Via arguments analogous to those given in Section S.2.7, the factor to stochastic node nat-ural parameter update is η p ( y | α ) → α ←− ( ∇ A N ) − (cid:82) ∞−∞ α Φ((2 y − α ) exp { η α + η α } dα (cid:82) ∞−∞ Φ((2 y − α ) exp { η α + η α } dα (cid:82) ∞−∞ α Φ((2 y − α ) exp { η α + η α } dα (cid:82) ∞−∞ Φ((2 y − α ) exp { η α + η α } dα − η α → p ( y | α ) where, as before, η ≡ η α → p ( y | α ) . The integral results (cid:90) ∞−∞ Φ( a + bx ) φ ( x ) dx = Φ (cid:18) a √ b + 1 (cid:19) , (cid:90) ∞−∞ x Φ( a + bx ) φ ( x ) dx = b √ b + 1 φ (cid:18) a √ b + 1 (cid:19) and (cid:90) ∞−∞ x Φ( a + bx ) φ ( x ) dx = Φ (cid:18) a √ b + 1 (cid:19) − ab (cid:112) ( b + 1) φ (cid:18) a √ b + 1 (cid:19) , (S.13)and standard algebraic manipulations lead to η p ( y | α ) → α ←− H probit (cid:16) η α → p ( y | α ) (cid:17) where H probit is defined in Section S.1.3. S.2.9 Derivation of Algorithm 9