Automatic variational inference with cascading flows
AAutomatic variational inference with cascading flows
Luca Ambrogioni Gianluigi Silvestri
Marcel van Gerven Abstract
The automation of probabilistic reasoning is oneof the primary aims of machine learning. Re-cently, the confluence of variational inference anddeep learning has led to powerful and flexible au-tomatic inference methods that can be trained bystochastic gradient descent. In particular, normal-izing flows are highly parameterized deep modelsthat can fit arbitrarily complex posterior densities.However, normalizing flows struggle in highlystructured probabilistic programs as they needto relearn the forward-pass of the program. Au-tomatic structured variational inference (ASVI)remedies this problem by constructing variationalprograms that embed the forward-pass. Here, wecombine the flexibility of normalizing flows andthe prior-embedding property of ASVI in a newfamily of variational programs, which we namedcascading flows. A cascading flows program in-terposes a newly designed highway flow archi-tecture in between the conditional distributionsof the prior program such as to steer it towardthe observed data. These programs can be con-structed automatically from an input probabilis-tic program and can also be amortized automat-ically. We evaluate the performance of the newvariational programs in a series of structured in-ference problems. We find that cascading flowshave much higher performance than both normal-izing flows and ASVI in a large set of structuredinference problems.
1. Introduction
The aim of probabilistic programming is to provide a fullyautomated software system for statistical inference on ar-bitrary user-specified probabilistic models (also referred toas probabilistic programs) (Milch et al., 2007; Sato, 1997;Kersting and De Raedt, 2007; Pfeffer, 2001; Park et al., * Equal contribution Donders Centre for Cognition, RadboudUniversity, Netherlands One Planet, Netherlands. Correspon-dence to: Luca Ambrogioni
2. Related work
Automatic algorithms for VI date back to VIBES (Bishopet al., 2002; Bishop and Winn, 2003). These older worksare based on self-consistency equations and variational mes-sage passing (Winn and Bishop, 2005). The use of gradient-based automatic VI for probabilistic programming was pi-oneered in (Wingate and Weber, 2013) using a mean fieldapproach. Automatic differentiation VI (ADVI) perfectedthis approach by exploiting the automatic-differentiationtools developed in the deep learning community (Kucukel-bir et al., 2017). ADVI uses fixed invertible transformationsto map each variable into an unbounded domain where thedensity can be modeled with either univariate of multivari-ate normal distributions. Structured stochastic VI (Hoffmanand Blei, 2015) exploits the local conjugate update rulesof locally conjugate models. This has the major benefit ofpreserving the local structure of the probabilistic programbut it severely limits the class of possible input programssince local conjugacy is a stringent condition. Automaticstructured VI (ASVI) lifts this constraint as it applies atrainable convex-update rule to non-conjugate links (Am- brogioni et al., 2021). Several other papers introduced newtechniques for constructing structured variational programs.Copula VI introduces the use of vine copulas to model thestatistical coupling between the latent variables in the pro-gram (Tran et al., 2015). Similarly, hierarchical VI usesauxiliary latent variables to achieve the same goal (Ran-ganath et al., 2016). Neither of those approaches are trulyautomatic however, as they do not prescribe a unique wayto construct the coupling functions.Normalizing flows are very expressive and highly param-eterized models that use deep learning components to ap-proximate complex multivariate densities (Rezende and Mo-hamed, 2015). Normalizing flows can approximate arbi-trarily complex densities and can therefore be turned intopowerful automatic VI methods (Rezende and Mohamed,2015; Kingma et al., 2016). However, conventional nor-malizing flows do not exploit the input probabilistic pro-gram. Structured conditional continuous normalizing flowsare a newly introduced class of flows constrained to havethe same conditional independence structure of the trueposterior (Weilbach et al., 2020). However, these flow ar-chitectures only inherit the graphical structure of the inputprogram, without incorporating the forward pass (i.e. theform of the conditional distributions and link functions).Modern deep probabilistic programming frameworks pro-vide automatic implementation of many of variational pro-grams. For example, PyMC (Patil et al., 2010) provides animplementation of ADVI, Pyro (Bingham et al., 2019) im-plements the automatic construction of mean field, multivari-ate normal and normalizing flow programs (i.e. variationalguides) and TensorFlow probability also offers an automaticimplementation of ASVI (Dillon et al., 2017; Ambrogioniet al., 2021).Our new highway flow architecture is a key componentof our new approach and it is inspired by highway net-works (Srivastava et al., 2015). To the best of our knowl-edge, highway flows are the first form of residual-like flowswith a tractable Jacobian determinant that can be expressedin closed form. Existing residual flows express the log de-terminant of the Jacobian as an infinite series that needsto be estimated using a “Russian roulette” estimator (Chenet al., 2019). We achieve tractability by applying a highwayoperator to each single layer of the network.Our use of auxiliary variables is similar to the approachadopted in (Dupont et al., 2019) and (Weilbach et al., 2020).The bound we use for training the augmented model wasinitially derived in (Ranganath et al., 2016). To the best ofour knowledge we are the first to exploit the statistics of theauxiliary variables to implement inference amortization.Structured variational programs have the most clear bene-fits in timeseries analysis problems (Eddy, 1996; Foti et al., utomatic structured variational inference with cascading flows
3. Preliminaries
In this section we introduce the notation used in the rest ofthe paper and discuss in detail the methods that form thebasis for our work.
We denote the values of (potentially multivariate) randomvariables using lowercase notation and arrays of randomvariables using boldface notation. We consider probabilisticprograms constructed as a chain of conditional probabilitiesover an array of random variables x = ( x , . . . , x N ) . Wedenote the array of values of the parents of the j -th variableas π j ⊆ { x i } i (cid:54) = j , which is a sub-array of parent variablessuch that the resulting graphical model is a directed acyclicgraph. Using this notation, we can express the joint densityof a probabilistic program as follows: p ( x ) = N (cid:89) j =1 ρ j ( x j | θ j ( π j )) , (1)where θ j ( π j ) is a link function that maps the values of theparents of the j -th variable to the parameters of its density ρ j ( · | · ) . For example, if the density is Gaussian the linkfunction outputs the value of its mean and scale given thevalues of the parents. Note that θ j is a constant when thearray of parents is empty. An automatic structured variational inference method pro-vides an algorithm that takes a probabilistic program as in-put and outputs a variational program q ( x ) . Convex-updatevariational programs (Ambrogioni et al., 2021) are definedby the following transformation: p ( x ) (cid:55)→ CU q ( x ) = N (cid:89) j U α j λ j ρ j ( x j | θ j ( π j )) , (2)with convex-update operator U α j λ j ρ j ( x j | θ j ( π j )) = ρ j ( x j | λ j (cid:12) θ j ( π j ) (3) + (1 − λ j ) (cid:12) α j ) , where λ j and α j are (potentially vector-valued) learnableparameters and (cid:12) is the element-wise product. This reducesto the prior probabilistic program for λ j → and to a meanfield variational program for λ j → . Convex-update ASVI has the advantage of preserving the forward-pass of theprobabilistic program, which is often a major source of sta-tistical coupling in the posterior distribution. The variationprogram can be trained to fit the posterior by minimizingthe evidence lower bound (ELBO) by stochastic gradientdescent (SGD) on the differentiable parameters λ j and α j . Normalizing flows express the density of arbitrary multi-variate distributions as the push forward of a known basedistribution through a differentiable invertible function (i.e.a diffeomorphism). The base distribution is usually a spheri-cal normal distribution, hence the "normalizing" in the name.Consider a d -variate probability density p ( z ) and a diffeo-morphism Ψ : R d → R d . We can derive the density of thenew random variable x = Ψ( z ) using the change of variableformula p X ( x ) = | det J (Ψ − ( x )) | p (Ψ − ( x )) , (4)where Ψ − is the inverse transformation and J ( z ) is theJacobi matrix of Ψ . We can write this more compactlyusing the push-forward operator T Ψ associated to Ψ thatmaps the density p ( z ) to the density p X ( x ) given in Eq. 4: p X ( x ) = T Ψ [ p ( · )] ( x ) .Now consider a flexible family of functions Ψ w ( x ) parame-terized by the "weights" w . We can use Eq. 13 to approx-imate arbitrarily complicated target densities by training w by SGD. Given a probabilistic program p ( x, y ) with y observed, we can approximate the posterior p ( x | y ) byminimizing the evidence lower bound:ELBO ( w ) = E w (cid:20) log p ( x, y ) | det J w (Ψ − ( x )) | p (Ψ − ( x )) (cid:21) (5)where the expectation is taken with respect to the trans-formed distribution T Ψ w [ p ( · )] ( x ) . Since Ψ is differen-tiable, we can obtain the reparameterization gradient for theELBO as follows: ∇ w ELBO ( w ) = E (cid:20) ∇ w log p (Ψ w ( z ) , y ) | det J w ( z ) | p ( z ) (cid:21) (6)where the expectation is now taken with respect to the fixeddistribution p ( z ) . In the following we refer to this approachas global flow (GF) since the transformation is applied syn-chronously to all variables in the model.
4. Variational inference with cascading flows
Now that we have covered the relevant background, wewill introduce our new approach that combines ASVI withnormalizing flows. utomatic structured variational inference with cascading flows
The convex-update parameterization poses strong con-straints on the form of the conditional posteriors. On theother hand, normalizing flow programs do not embed thepotentially very complex structure of the input program. Inthis paper we combine the two approaches by replacingthe convex-update operator in Eq. 2 with the push-forwardoperator used in normalizing flows. More precisely, con-sider a family of diffeomorphisms Ψ w = (Ψ w , . . . , Ψ w N N ) depending on the array of parameters w = ( w , . . . , w N ) .Using these transformations, we can define the cascadingflows variational program associated with a program definedby Eq. 1: p ( x ) (cid:55)→ CF q w ( x ) = N (cid:89) j T w j [ ρ j ( · | θ j ( π j ))] ( x j ) , (7)where we denoted the push-forward operator induced by thediffeomorphism Ψ w j j (as defined by Eq. 4) as T w j to simplifythe notation. A cascading flows variational program can betrained by SGD using gradient estimators of the ELBO withthe reparameterization given in Eq. 6.The name cascading flows comes from the fact that, sincethe transformation is performed locally for each conditionaldistribution, its effects cascade down the variational modelthrough the (transformed) conditional dependencies of theprior program. However, in order to preserve the forward-pass of the probabilistic program it is crucial to use a param-eterized diffeomorphism that can be initialized around theidentity function and whose deviation from the identity canbe easily controlled. Specifically, we opt for transformationsof the following form: Ψ w j j ( x ) = γx + (1 − γ ) f j ( x ; w j ) , (8)where γ ∈ (0 , . The resulting architecture of a cascadingflows variational program is visualized in Fig 1. From thediagram it is clear that a cascading flow is constructed byinserting transformation layers within the architecture of theprobabilistic program. Transformations of the form givenin Eq. 8 with tractable Jacobians are not currently presentin the normalizing flow literature. Therefore in the nextsubsection we will introduce a new flow architecture of thisform, which we named highway flow . Here we introduce highway flow blocks gated by a learn-able scalar gate parameter. As also visualized in Fig. 2, ahighway flow block is comprised by three separate highwaylayers:
Layer jLayer j + 1Layer j + 2 Layer jLayer j + 1Layer j + 2 CF Layer jCF Layer j + 1
A) B)
Figure 1.
Diagram of a cascading flows architecture. A) Detail ofarchitecture of an input probabilistic program. B) Detail of theassociated cascading flows architecture.
Upper TriLower TriActivation
Figure 2.
Diagram of a highway flow block.
1. Upper triangular highway layer: l U ( z ; U, λ ) = λz + (1 − λ ) ( U z + b U ) (9) log det J U = (cid:88) k log ( λ + (1 − λ ) U kk ) (10)2. Lower triangular layer: l L ( z ; L, λ ) = λz + (1 − λ ) ( Lz + b L ) (11) log det J L = (cid:88) k log ( λ + (1 − λ ) L kk ) (12)3. Highway activation functions: f ( z ; λ ) = λz + (1 − λ ) g ( z ) (13) log det d f ( x k ) d x = (cid:88) k log (cid:18) λ + (1 − λ ) d g ( x k ) d x (cid:19) (14)where U is an upper-triangular matrix with positive-valueddiagonal, L is a lower-triangular matrix with ones in thediagonal and g ( x ) is a differentiable non-decreasing acti-vation function. A highway flow layer is a composition ofthese three types of layers: l h ( · ) = f ◦ l L ◦ l U ( · ) . (15) utomatic structured variational inference with cascading flows A highway network is a sequence of M highway layers witha common gate parameter and different weights and biases.Note that a highway flow can be expressed in the form givenin Eq. 8 by defining γ as follows: γ = λ M , (16)which is clearly equal to one (zero) when λ is equal to one(zero). The expressiveness of a normalizing flow is limited by theinvertible nature of the transformation. This is particularlyproblematic for low-dimensional variables since the expres-sivity of many flow architectures depends on the dimen-sionality of the input (Papamakarios et al., 2019). Thislimitation can be a major shortcoming in cascading flowsas they are particularly useful in highly structured problemscharacterized by a large number of coupled low-dimensionalvariables. Fortunately, we can remedy this limitation by in-troducing a set of auxiliary variables. For each variable x j , we create a D -dimensional variable (cid:15) j following a basedistribution p j ( (cid:15) j ) . We can now use an augmented diffeo-morphism ˆΨ w j j ( x j , (cid:15) j ) and define the joint posterior overboth x j and (cid:15) j : q ( x j , (cid:15) j | π j ) = ˆ T w j [ ρ j ( · | θ j ( π j )) p j ( · )] ( x j , (cid:15) j ) , (17)where the push-forward operator now transforms the (inde-pendent) joint density of x j and (cid:15) j . The conditional varia-tional posterior is then obtained by marginalizing out (cid:15) j : q ( x j | π j ) = (cid:90) q ( x j , (cid:15) j | π j ) d (cid:15) j . (18)This infinite mixture of flows is much more capable of mod-eling complex and potentially multi-modal distributions.The use of latent mixture models for VI was originallyintroduced in (Ranganath et al., 2016). Given a set of obser-vations y , the ELBO of any mixture variational probabilisticprogram can be lower bounded using Jensen’s inequality: E x (cid:20) log p ( x , y ) (cid:82) q ( x , (cid:15) ) d (cid:15) (cid:21) ≥ E x , (cid:15) (cid:20) log p ( x , y ) r ( (cid:15) ) q ( x , (cid:15) ) (cid:21)(cid:124) (cid:123)(cid:122) (cid:125) Augmented ELBO , (19)where q ( (cid:15) ) = (cid:82) q ( x , (cid:15) ) d x is the marginal variational pos-terior and r ( (cid:15) ) is an arbitrarily chosen distribution over theauxiliary variables. This result justifies the use of the aug-mented ELBO for training the variational posterior. Thegap between the mixture ELBO (Eq. 19) and the augmentedELBO can be reduced by parameterizing r ( (cid:15) ) and trainingits parameters by minimizing the augmented ELBO such asto match q ( (cid:15) ) . The highway flow architecture can be easily adapted to theaugmented variable space. Since there is no need to gate theflow of the auxiliary variables, the scalar λ in Eq. 9, Eq. 11and Eq. 13 should be replaced by the vector l whose entriescorresponding to the original variables are equal to λ whilethe entries corresponding to the auxiliary variables are equalto zero. A) B)
Figure 3.
Backward auxiliary coupling. A) Example graphicalmodel. B) Cascading flows model coupled with reversed locallinear Gaussian model.
So far, we outlined a model that has the same conditional in-dependence structure of the prior and is therefore incapableof modeling dependencies arising from colliders. Fortu-nately, the local auxiliary variables that we introduced inorder to increase the flexibility of the local flows can assumea second role as "input ports" that can induce non-local cou-plings. In fact, by coupling their auxiliary variables we caninduce statistical dependencies between any subset of vari-ables in the model. Importantly, this coupling can take a verysimple form since additional complexity can be modeled bythe highway flows. We will now introduce a local auxiliarymodel with a non-trivial conditional independence structureinspired by the forward-backward algorithm. The graphicalstructure of the posterior probabilistic program is a sub-graph of the moralization of the prior graph (Bishop, 2006).Therefore, in order to be able to capture all the dependenciesin the posterior, it suffices to couple the auxiliary variablesby reversing the arrows of the original prior model. In fact,all the parents of every node get coupled by the reversedarrows. As an example, consider the following probabilisticmodel (Fig. 3A): ρ ( x | x , x ) ρ ( x ) ρ ( x | x ) ρ ( x ) . We couple the auxiliary variables by reversing the arrows, asformalized in the graphical model p ( (cid:15) ) p ( (cid:15) , | (cid:15) ) p ( (cid:15) , | (cid:15) ) p ( (cid:15) | (cid:15) ) . The graphical structure of the resulting aug-mented model is shown in Fig. 3B. The auxiliary variablescan be coupled in a very simple manner since the diffeo- utomatic structured variational inference with cascading flows
Table 1.
Predictive and latent log-likelihood (forward KL) of variational timeseries models. Error are SEM estimated over 10 repetitions.CF ASVI MF GF MVN CF (non-res)BR-r Pred − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . Latent − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . BR-c Pred . ± .
18 1 . ± .
14 1 . ± . . ± . . ± .
03 1 . ± . Latent − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . LZ-r Pred − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . Latent − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . LZ-c Pred . ± . . ± .
03 0 . ± .
003 0 . ± .
15 0 . ± .
001 0 . ± . Latent − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . PD-r Pred − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . Latent − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . PD-c Pred . ± . . ± .
06 0 . ± .
003 01 . ± .
02 1 . ± .
02 0 . ± . Latent − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . RNN-r Pred − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . Latent − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . RNN-c Pred . ± . . ± .
06 0 . ± .
03 2 . ± .
36 0 . ± .
02 1 . ± . Latent − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . morphisms can add arbitrary complexity to the final jointdensity. Here we use a simple linear normal model: (cid:15) k | υ k = K (cid:88) j =1 a j (cid:12) υ j + a (cid:12) ξ k , (20)where a j ≥ and (cid:80) n a n = 1 . In this expression, ξ k is astandard Gaussian vector and υ k = ( υ , ..., υ K ) is an arraycontaining all the parents auxiliary variables (the auxiliariesof the children of x k in the original graph). We will now discuss a strategy to implement automatic infer-ence amortization in cascading flows models. The problemof inference amortization is closely connected to the prob-lem of modeling collider dependencies as the latter arisefrom the backward messages originating from observednodes (Bishop, 2006). It is therefore not a coincidence thatwe can implement both collider dependencies and amorti-zation by shifting the statistics of the local auxiliary vari-ables. This results in a very parsimonious amortizationscheme where data streams are fed locally to the variousnodes and propagate backward through the linear networkof auxiliary Gaussian variables. We denote the set of aux-iliary variables associated to the children of x k in the in-put probabilistic program as υ k . Furthermore, we denotethe observed value of x k as y k and the set of observedvalues of the j -th child of x k as y k = ( y k , ..., y kK ) . Theamortized auxiliary model can then be obtained as a mod-ification of Eq. 20: p ( (cid:15) k | υ k ) = N (cid:0) m k , σ I (cid:1) with (cid:15) k | υ k = B ( k ) [ y k ] + (cid:80) Kj =1 a j (cid:12) υ j + a (cid:12) ξ k , where B ( k ) are learnable affine transformations (i.e. linear layers Auxiliary graph
Forward pass connectionAuxiliary connectionAmortization input Latent nodeObservable nodeAuxiliary node
Figure 4.
Visualization of the backward amortization algorithm. in deep learning lingo).
5. Experiments
Our experiments are divided into three sections. In the firstsection we focus on highly structured timeseries problemsexhibiting linear and non-linear dynamics. Bayesian prob-lems of this nature have relevance for several scientific fieldssuch as physics, chemistry, biology, neuroscience and pop-ulation dynamics. In the second section we compare thecapacity of several variational programs to model depen- utomatic structured variational inference with cascading flows dencies arising from colliders in the graphical structure. Infact, the inability of modeling collider dependencies was themain limitation of the original convex-update ASVI fam-ily (Ambrogioni et al., 2021). Finally, in the third section wetest the performance of the automatic amortization scheme.All the code used in the experiments is included in thesupplementary material and documented in SupplementaryA.
Architectures : In all experiments, the CF architectureswere comprised of three highway flow blocks with softplusactivation functions in each block except for the last whichhad linear activations. CF programs use an independentnetwork of this form for each latent variable in the program.Each variable was supplemented with auxiliary variables,the width of each network was therefore equal to the dimen-sionality of the variable plus . Weights and biases wereinitialized from centered normal distributions with scale . . The λ variable was defined independently for eachnetwork as the logistic sigmoid of a leanable parameter l ,which was initialized as in order to keep the variationalprogram close to the input program. For each variable, theposterior auxiliary distributions r ( (cid:15) ) (see Eq. 19) were spher-ical normals parameterized by mean and standard deviation(i.e. Gaussian mean field). CF ASVI R e g r e ss i on C l a ss i f i ca ti on Figure 5.
Example of qualitative results in the Lorentz system (firstcoordinate). The gray shaded area is observed through noisy mea-surements. The dashed lines are the ground-truth timeseries whilethe thin colored lines are samples from the trained variationalprogram.
We consider discretized SDEs, defined by conditional densi-ties of the form: ρ t ( x t +1 | x t ) = N (cid:0) x t +1 ; µ ( x t , t ) d t, σ ( x t , t ) d t (cid:1) , (21) where µ ( x t , t ) is the drift function and σ ( x t , t ) is the volatil-ity function. Specifically, we used Brownian motions (BR: µ ( x ) = x , σ ( x ) = 1 ), Lorentz dynamical systems (LZ: µ ( x , x , x ) = (10( x − x ) , x (28 − x ) − x , x x − / x ) , σ = 2 ), Lotka-Volterra population dynamicsmodels (PD: µ ( x , x ) = (0 . x − . x x , . x x − . x ) , σ = 3 ) and recurrent neural dynamics with ran-domized weights (RNN, see Supplementary B). The initialdistributions were spherical Gaussians (see SupplementaryB for the parameters). For each dynamical model we gen-erated noisy observations using Gaussian emission models(denoted with an "r" in the tables): p ( y t | x t ) = N ([ x t ] , σ lk ) (22)or Bernoulli logistic models ("c" in the tables): p ( y t | x t ) = Bernoulli ( g ( k [ x t ] )) , (23)where [ x t ] denotes the first entry of the vector x t , σ lk is the standard deviation of the Gaussian emission model, g ( · ) is the logistic sigmoid function and k is a gain fac-tor. All the numeric values of the parameters in differentexperiments are given in Supplementary B. For each taskwe evaluate the performance of the cascading flows pro-grams (CF) against a suite of baseline variational posteriors,including convex-update ASVI (Ambrogioni et al., 2021),mean field (MF) and multivariate normal (MN) (Kucukelbiret al., 2017). We also compare our new cascading flowsfamily with the more conventional global normalizing flow(GF) approach where a single diffeomorphism transformsall the latent variables in the probabilistic program intoa spherical normal distribution (Rezende and Mohamed,2015). For the sake of comparison, we adopt an archi-tecture identical to our highway flow model (including auxiliary variables) except for the absence of the highwaygates. Note however that the global architecture has muchgreater width as it is applied to all variables in the pro-gram. Finally, we compare the performance of CF with amodified version of CF that does not use highway gates.Ground-truth multivariate timeseries ˜ x = (˜ x , . . . , ˜ x T ) were sampled from the generative model together with sim-ulated first-half observations y T/ = ( y , . . . , y T/ ) andsecond-half observations y T/ T = ( y T/ , . . . , y T ) . Allthe variational models were trained conditioned only onthe first half observations. Performance was assessed us-ing two metrics. The first metric is the average marginallog-probability of the ground-truth given the variational pos-terior: T J (cid:80) Tt (cid:80) Jj log q ([˜ x t ] j | y T/ ) , where J is thedimensionality of the ground-truth vector ˜ x t . The marginalposterior density q ([˜ x t ] j | y T/ ) was estimated from samples drawn from the variational program using Gaus-sian kernel density estimation with bandwidth . σN − / ,where ˆ σ is the empirical standard deviation of the samples.Our second metric is log p ( y T/ T | y T/ ) : the predic-tive log-probability of the ground-truth observations in the utomatic structured variational inference with cascading flows Table 2.
Multivariate latent log-likelihood (forward KL) of variational binary tree models. Error are SEM estimated over 15 repetitions.
CF ASVI MF GF MNLinear-2 . ± . . ± .
05 0 . ± .
01 0 . ± .
01 1 . ± . Linear-4 . ± .
35 3 . ± .
26 1 . ± . . ± . . ± . Tanh-2 . ± . − . ± .
47 0 . ± .
01 0 . ± .
01 1 . ± . Tanh-4 . ± . − . ± . − . ± . . ± . . ± . second half of the timeseries given the observations in thefirst half, estimated using kernel density estimation from predicted observation samples drawn from the exactlikelihood conditioned on the samples from the variationalprogram. Each experiment was repeated times. In eachrepetition, all the variational programs were re-trained for iterations (enough to ensure convergence in all meth-ods) given new sampled ground-truth timeseries and ob-servations. The reported results are the mean and SEM ofthe two metrics over these repetitions. Results:
Theresults are given in Table 1. The cascading flows family(CF) outperforms all other approaches by a large margin inall non-linear tasks. Convex-update ASVI consistently hasthe second highest performance, easily outperforming thehighly parameterized global normalizing flow (GF). Thisresult is consistent with (Ambrogioni et al., 2021), whereASVI was shown to outperform inverse autoregressive flowarchitectures (Kingma et al., 2016). Finally, the CF modelwithout highway gates (CF non-res) has very low perfor-mance, proving the crucial importance of our new highwayflow architecture. A qualitative comparison (LZ-r and LZ-c)between CF and ASVI is shown in Fig. 5.
In our second experiment we test the capability of the vari-ational programs to model statistical dependencies arisingfrom colliders. We tested this in a Gaussian binary treemodel where the mean of a scalar variable x dj in the d -thlayer is a function of two variables in the d − -th layer: x dj ∼ N (cid:0) link ( π d − j , π d − j ) , σ (cid:1) , where π d − j and π d − j arethe two parents of x dj . All the -th layer variables havezero mean. The inference problem consists of observingthe only variable in the last layer and computing the pos-terior of all the other variables. We considered a settingwith linear coupling link ( x, y ) = x − y and tanh couplinglink ( x, y ) = tanh ( x ) − tanh ( y ) with and layers. Per-formance was estimated using the posterior log-probabilityof the ground-truth sample estimated using a multivariateGaussian approximation that matches the empirical meanand covariance matrix of samples drawn from thetrained variational program. We compared performancewith ASVI, mean-field (MF), global flow (GF) and multi-variate normal (MN). Note that neither ASVI nor MF canmodel collider dependencies. Full details are given in Sup-plementary C. The reported results are the mean and SEM Table 3.
Latent log-likelihood (forward KL) of amortized time-series models. Error are SEM estimated over 50 repetitions.
CF MF GFPD − . ± . − . ± . − . ± . LZ − . ± . − . ± . − . ± . of the two metrics over repetitions. Results:
The resultsare given in Table 2. The CF model always outperformsASVI and MF, proving that it can capture collider dependen-cies. CF has the highest performances in shallow modelsbut lower performance than GF and MN in deeper models.
Finally, we evaluate the performance of the amortizationscheme in non-linear timeseries inference problems. Weused the population dynamics (PD) and the Lorentz (LZ)systems. The CF program was amortized using the approachdescribed in Sec. 4.5. The MF and normalizing flow base-lines were amortized using inference networks (see Supple-mentary D). All the details were identical to the timeseriesexperiments except for the fact that all time points wereobserved. Performance was again quantified as the averagemarginal log-probability of the ground-truth kernel densityestimated from the samples from the amortized variationalprograms. Each amortized program was trained once andtested on new sampled timeseries/observations pairs. Thereported results are the mean and SEM of these repeti-tions. Results:
The results are reported in Table 3. Like inthe non-amortized case, CF greatly outperforms both MFand GF in both timeseries inference problems.
6. Discussion
Our CF approach solves the two main limitations of ASVIas it can model collider dependencies and can be amortizedautomatically. We show that cascading flows programshave extremely high performance in structured inferenceproblems. However, their ability to capture deep collider de-pendencies is lower than normalizing flows. In our opinion,designing principled ways to model collider dependenciesis a promising research direction. utomatic structured variational inference with cascading flows
References
B. Milch, B. Marthi, S. Russell, D. Sontag, D. L. Ong, andA. Kolobov. Blog: Probabilistic models with unknownobjects.
Statistical Relational Learning , page 373, 2007.T. Sato. Prism: A symbolic-statistical modeling language.In
Proceeding of the Intl. Joint Conference on ArtificialIntelligence , 1997.K. Kersting and L. De Raedt. Bayesian logic programming:Theory and tool.
Statistical Relational Learning , page291, 2007.A. Pfeffer. Ibal: A probabilistic rational programming lan-guage. In
Proceeding of the Intl. Joint Conference onArtificial Intelligence , 2001.S. Park, F. Pfenning, and S. Thrun. A probabilistic languagebased upon sampling functions.
ACM SIGPLAN Notices ,40(1):171–182, 2005.N. Goodman, V. Mansinghka, D. M. Roy, K. Bonawitz, andJ. B. Tenenbaum. Church: A language for generativemodels. arXiv preprint arXiv:1206.3255 , 2012.D. Wingate, A. Stuhlmüller, and N. Goodman. Lightweightimplementations of probabilistic programming languagesvia transformational compilation. In
Proceedings of theInternational Conference on Artificial Intelligence andStatistics , 2011.A. Patil, D. Huard, and C. J. Fonnesbeck. Pymc: Bayesianstochastic modelling in python.
Journal of StatisticalSoftware , 35(4):1, 2010.J. V Dillon, I. Langmore, D. Tran, E. Brevdo, S. Vasude-van, D. Moore, B. Patton, A. Alemi, M. Hoffman, andR. A Saurous. Tensorflow distributions. arXiv preprintarXiv:1711.10604 , 2017.E. Bingham, J. P. Chen, M. Jankowiak, F. Obermeyer,N. Pradhan, T. Karaletsos, R. Singh, P. Szerlip, P. Hors-fall, and N. D. Goodman. Pyro: Deep universal proba-bilistic programming.
The Journal of Machine LearningResearch , 20(1):973–978, 2019.D. Tran, M. D. Hoffman, R. A. Saurous, E. Brevdo, K. Mur-phy, and D. M. Blei. Deep probabilistic programming. arXiv preprint arXiv:1701.03757 , 2017.D. Tran, A. Kucukelbir, A. B. Dieng, M. Rudolph, D. Liang,and D. M. Blei. Edward: A library for probabilis-tic modeling, inference, and criticism. arXiv preprintarXiv:1610.09787 , 2016.C. Bishop, D. Spiegelhalter, and J. Winn. Vibes: A varia-tional inference engine for bayesian networks.
NeuralInformation Processing Systems , 15:793–800, 2002. A. Kucukelbir, D. Tran, R. Ranganath, A. Gelman, and D. M.Blei. Automatic differentiation variational inference.
TheJournal of Machine Learning Research , 18(1):430–474,2017.C. Bishop and J. Winn. Structured variational distributionsin vibes. In
International Conference on Artificial Intelli-gence and Statistics , 2003.D. J. Rezende and S. Mohamed. Variational inference withnormalizing flows. In
International Conference on Ma-chine Learning , 2015.G. Papamakarios, E. Nalisnick, Danilo J. Rezende, S. Mo-hamed, and B. Lakshminarayanan. Normalizing flowsfor probabilistic modeling and inference. arXiv preprintarXiv:1912.02762 , 2019.L. Ambrogioni, K. Lin, E. Fetig, S. Vikram, M. Hinne,D. Moore, and M. van Gerven. Automatic structuredvariational inference. In
International Conference onArtificial Intelligence and Statistics , 2021.T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka-plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry,A. Askell, et al. Language models are few-shot learners.
Neural Information Processing Systems , 2020.J. Winn and C. M. Bishop. Variational message passing.
Journal of Machine Learning Research , 6(Apr):661–694,2005.D. Wingate and T. Weber. Automated variational in-ference in probabilistic programming. arXiv preprintarXiv:1301.1299 , 2013.M. D. Hoffman and D. M. Blei. Structured stochastic varia-tional inference. In
Proceedings of the Artificial Intelli-gence and Statistics , pages 361–369, 2015.D. Tran, D. Blei, and E. M. Airoldi. Copula variationalinference. In
Neural Information Processing Systems ,2015.R Ranganath, D Tran, and D Blei. Hierarchical variationalmodels. In
International Conference on Machine Learn-ing , pages 324–333, 2016.D. P Kingma, T. Salimans, R. Jozefowicz, X. Chen,I. Sutskever, and M. Welling. Improved variational infer-ence with inverse autoregressive flow. In
Neural Informa-tion Processing Systems , 2016.C Weilbach, B Beronov, F Wood, and W Harvey. Structuredconditional continuous normalizing flows for efficientamortized inference in graphical models. In
InternationalConference on Artificial Intelligence and Statistics , pages4441–4451. PMLR, 2020. utomatic structured variational inference with cascading flows
R K Srivastava, K Greff, and J Schmidhuber. Highwaynetworks. arXiv preprint arXiv:1505.00387 , 2015.R. T.Q. Chen, J. Behrmann, D. K. Duvenaud, and J. Jacob-sen. Residual flows for invertible generative modeling.In
Neural Information Processing Systems , pages 9916–9926, 2019.E Dupont, A Doucet, and Y W Teh. Augmented neuralODEs. In
Neural Information Processing Systems , pages3140–3150, 2019.S. R. Eddy. Hidden Markov models.
Current Opinion inStructural Biology , 6(3):361–365, 1996.N. Foti, J. Xu, D. Laird, and E. Fox. Stochastic variationalinference for hidden Markov models. In
Neural Informa-tion Processing Systems , 2014.M. Johnson and A. Willsky. Stochastic variational infer-ence for Bayesian time series models. In
InternationalConference on Machine Learning , 2014.M. Karl, M. Soelch, Ju. Bayer, and P. Van der Smagt.Deep variational Bayes filters: Unsupervised learningof state space models from raw data. arXiv preprintarXiv:1605.06432 , 2016.M. Fortunato, C. Blundell, and O. Vinyals. Bayesian recur-rent neural networks. arXiv preprint arXiv:1704.02798 ,2017.C. M. Bishop.
Pattern recognition and machine learning .Springer, 2006. utomatic structured variational inference with cascading flows
A. Experiments
The experiments where implemented in Python 3.8 using the packages PyTorch, Numpy and Matplotlib. The implementationcode is contained in the "modules" folder. The probabilistic program objects are defined in the file "models.py" and the deeparchitectures are in "networks.py". The experiments can be run from the files "timeseries experiment.py", "collider linearexperiments.py", "collider tanh experiments.py" and "amortized experiments.py" in the "experiments" folder.
B. Models
In the experiments, we use the following dynamical models.
B.1. Brownian motion (BR)
Dimensionality: J = 1 Drift: µ ( x ) = x Diffusion: σ ( x ) = 1 Initial density: p ( x ) = N (0 , Time horizon: T = 40 Time step: d t = 1 Regression noise: σ lk = 1 Classification gain: k = 2 B.2. Lorentz system (LZ)
Dimensionality: J = 3 Drift: µ ( x , x , x ) = (10( x − x ) , x (28 − x ) − x , x x − / x ) Diffusion: σ ( x ) = 2 Initial density: p ( x ) = N (3 , ) Time horizon: T = 40 Time step: d t = 1 Regression noise: σ lk = 3 Classification gain: k = 2 B.3. Population dynamics (PD)
Dimensionality: J = 2 Drift: µ ( x , x , x ) = ( ReLu (0 . x − . x x ) , ReLu (0 . x x − . x )) Diffusion: σ ( x ) = 2 Initial density: p ( x ) = N (0 , Time horizon: T = 100 Time step: d t = 0 . Regression noise: σ lk = 3 Classification gain: k = 2 Note: The ReLu activation was added to avoid instabilities arising from negative values.
B.4. Recurrent neural network (RNN)
Dimensionality: J = 3 Drift: µ ( x ) = tanh( W tanh( W tanh( W x + b ) + b )) Diffusion: σ ( x ) = 0 . Initial density: p ( x ) = N (0 , Time horizon: T = 40 Time step: d t = 0 . Regression noise: σ lk = 1 Classification gain: k = 2 utomatic structured variational inference with cascading flows Note: W is a × matrix, W is a × matrix, W is a × matrix, b is a 2D vector and b is a 5D vec-tor. All the entries of these quantities are sampled from normal distributions with mean and standard deviation (SD) .These parameters we re-sampled for every repetition of the experiment. C. Collider dependencies