[PDF] Automatic variational inference with cascading flows

Abstract

The automation of probabilistic reasoning is one of the primary aims of machine learning. Recently, the confluence of variational inference and deep learning has led to powerful and flexible automatic inference methods that can be trained by stochastic gradient descent. In particular, normalizing flows are highly parameterized deep models that can fit arbitrarily complex posterior densities. However, normalizing flows struggle in highly structured probabilistic programs as they need to relearn the forward-pass of the program. Automatic structured variational inference (ASVI) remedies this problem by constructing variational programs that embed the forward-pass. Here, we combine the flexibility of normalizing flows and the prior-embedding property of ASVI in a new family of variational programs, which we named cascading flows. A cascading flows program interposes a newly designed highway flow architecture in between the conditional distributions of the prior program such as to steer it toward the observed data. These programs can be constructed automatically from an input probabilistic program and can also be amortized automatically. We evaluate the performance of the new variational programs in a series of structured inference problems. We find that cascading flows have much higher performance than both normalizing flows and ASVI in a large set of structured inference problems.

Full PDF

AAutomatic variational inference with cascading ﬂows

Luca Ambrogioni Gianluigi Silvestri

Marcel van Gerven Abstract

The automation of probabilistic reasoning is oneof the primary aims of machine learning. Re-cently, the conﬂuence of variational inference anddeep learning has led to powerful and ﬂexible au-tomatic inference methods that can be trained bystochastic gradient descent. In particular, normal-izing ﬂows are highly parameterized deep modelsthat can ﬁt arbitrarily complex posterior densities.However, normalizing ﬂows struggle in highlystructured probabilistic programs as they needto relearn the forward-pass of the program. Au-tomatic structured variational inference (ASVI)remedies this problem by constructing variationalprograms that embed the forward-pass. Here, wecombine the ﬂexibility of normalizing ﬂows andthe prior-embedding property of ASVI in a newfamily of variational programs, which we namedcascading ﬂows. A cascading ﬂows program in-terposes a newly designed highway ﬂow archi-tecture in between the conditional distributionsof the prior program such as to steer it towardthe observed data. These programs can be con-structed automatically from an input probabilis-tic program and can also be amortized automat-ically. We evaluate the performance of the newvariational programs in a series of structured in-ference problems. We ﬁnd that cascading ﬂowshave much higher performance than both normal-izing ﬂows and ASVI in a large set of structuredinference problems.

1. Introduction

The aim of probabilistic programming is to provide a fullyautomated software system for statistical inference on ar-bitrary user-speciﬁed probabilistic models (also referred toas probabilistic programs) (Milch et al., 2007; Sato, 1997;Kersting and De Raedt, 2007; Pfeffer, 2001; Park et al., * Equal contribution Donders Centre for Cognition, RadboudUniversity, Netherlands One Planet, Netherlands. Correspon-dence to: Luca Ambrogioni .Preprint. a r X i v : . [ s t a t . M L ] F e b utomatic structured variational inference with cascading ﬂows a form of automatic structured variational inference (ASVI)with an automated variational program that incorporates theprior probabilistic program as a special case. ASVI has beenshown to have very good performance in time series analysisand deep learning problems in spite of being very parsimo-niously parameterized compared to normalizing ﬂows (Am-brogioni et al., 2021). However, the convex-update familiesused in ASVI have two main limitations: I) they cannotmodel statistical dependencies arising from colliders in thegraphical model (the "explaining away" effect) and II) theyconstrain the conditional posterior distributions to be in thesame family as the prior. In this paper we integrate ASVIand normalizing ﬂows to obtain a new form of automatic VI,which we refer to as cascading ﬂows (CF). The approachincorporates the forward-pass of the probabilistic programwhile being capable of modeling collider dependencies andarbitrarily complex conditional distributions. To this end,we make use of a novel highway ﬂow architecture. CFrespects the following design principles:• Automation: It can be constructed using a fully auto-matic algorithm that takes the prior probabilistic pro-gram as input.• Locality: It is constructed by locally transforming eachconditional distribution in the conditional prior so thatit inherits the graphical structure of the input program(with potentially some extra coupling). This ensuresscalability and modularity.• Prior information ﬂow: It embeds the forward-pass ofthe probabilistic program.

2. Related work

Automatic algorithms for VI date back to VIBES (Bishopet al., 2002; Bishop and Winn, 2003). These older worksare based on self-consistency equations and variational mes-sage passing (Winn and Bishop, 2005). The use of gradient-based automatic VI for probabilistic programming was pi-oneered in (Wingate and Weber, 2013) using a mean ﬁeldapproach. Automatic differentiation VI (ADVI) perfectedthis approach by exploiting the automatic-differentiationtools developed in the deep learning community (Kucukel-bir et al., 2017). ADVI uses ﬁxed invertible transformationsto map each variable into an unbounded domain where thedensity can be modeled with either univariate of multivari-ate normal distributions. Structured stochastic VI (Hoffmanand Blei, 2015) exploits the local conjugate update rulesof locally conjugate models. This has the major beneﬁt ofpreserving the local structure of the probabilistic programbut it severely limits the class of possible input programssince local conjugacy is a stringent condition. Automaticstructured VI (ASVI) lifts this constraint as it applies atrainable convex-update rule to non-conjugate links (Am- brogioni et al., 2021). Several other papers introduced newtechniques for constructing structured variational programs.Copula VI introduces the use of vine copulas to model thestatistical coupling between the latent variables in the pro-gram (Tran et al., 2015). Similarly, hierarchical VI usesauxiliary latent variables to achieve the same goal (Ran-ganath et al., 2016). Neither of those approaches are trulyautomatic however, as they do not prescribe a unique wayto construct the coupling functions.Normalizing ﬂows are very expressive and highly param-eterized models that use deep learning components to ap-proximate complex multivariate densities (Rezende and Mo-hamed, 2015). Normalizing ﬂows can approximate arbi-trarily complex densities and can therefore be turned intopowerful automatic VI methods (Rezende and Mohamed,2015; Kingma et al., 2016). However, conventional nor-malizing ﬂows do not exploit the input probabilistic pro-gram. Structured conditional continuous normalizing ﬂowsare a newly introduced class of ﬂows constrained to havethe same conditional independence structure of the trueposterior (Weilbach et al., 2020). However, these ﬂow ar-chitectures only inherit the graphical structure of the inputprogram, without incorporating the forward pass (i.e. theform of the conditional distributions and link functions).Modern deep probabilistic programming frameworks pro-vide automatic implementation of many of variational pro-grams. For example, PyMC (Patil et al., 2010) provides animplementation of ADVI, Pyro (Bingham et al., 2019) im-plements the automatic construction of mean ﬁeld, multivari-ate normal and normalizing ﬂow programs (i.e. variationalguides) and TensorFlow probability also offers an automaticimplementation of ASVI (Dillon et al., 2017; Ambrogioniet al., 2021).Our new highway ﬂow architecture is a key componentof our new approach and it is inspired by highway net-works (Srivastava et al., 2015). To the best of our knowl-edge, highway ﬂows are the ﬁrst form of residual-like ﬂowswith a tractable Jacobian determinant that can be expressedin closed form. Existing residual ﬂows express the log de-terminant of the Jacobian as an inﬁnite series that needsto be estimated using a “Russian roulette” estimator (Chenet al., 2019). We achieve tractability by applying a highwayoperator to each single layer of the network.Our use of auxiliary variables is similar to the approachadopted in (Dupont et al., 2019) and (Weilbach et al., 2020).The bound we use for training the augmented model wasinitially derived in (Ranganath et al., 2016). To the best ofour knowledge we are the ﬁrst to exploit the statistics of theauxiliary variables to implement inference amortization.Structured variational programs have the most clear bene-ﬁts in timeseries analysis problems (Eddy, 1996; Foti et al., utomatic structured variational inference with cascading ﬂows

3. Preliminaries

In this section we introduce the notation used in the rest ofthe paper and discuss in detail the methods that form thebasis for our work.

We denote the values of (potentially multivariate) randomvariables using lowercase notation and arrays of randomvariables using boldface notation. We consider probabilisticprograms constructed as a chain of conditional probabilitiesover an array of random variables x = ( x , . . . , x N ) . Wedenote the array of values of the parents of the j -th variableas π j ⊆ { x i } i (cid:54) = j , which is a sub-array of parent variablessuch that the resulting graphical model is a directed acyclicgraph. Using this notation, we can express the joint densityof a probabilistic program as follows: p ( x ) = N (cid:89) j =1 ρ j ( x j | θ j ( π j )) , (1)where θ j ( π j ) is a link function that maps the values of theparents of the j -th variable to the parameters of its density ρ j ( · | · ) . For example, if the density is Gaussian the linkfunction outputs the value of its mean and scale given thevalues of the parents. Note that θ j is a constant when thearray of parents is empty. An automatic structured variational inference method pro-vides an algorithm that takes a probabilistic program as in-put and outputs a variational program q ( x ) . Convex-updatevariational programs (Ambrogioni et al., 2021) are deﬁnedby the following transformation: p ( x ) (cid:55)→ CU q ( x ) = N (cid:89) j U α j λ j ρ j ( x j | θ j ( π j )) , (2)with convex-update operator U α j λ j ρ j ( x j | θ j ( π j )) = ρ j ( x j | λ j (cid:12) θ j ( π j ) (3) + (1 − λ j ) (cid:12) α j ) , where λ j and α j are (potentially vector-valued) learnableparameters and (cid:12) is the element-wise product. This reducesto the prior probabilistic program for λ j → and to a meanﬁeld variational program for λ j → . Convex-update ASVI has the advantage of preserving the forward-pass of theprobabilistic program, which is often a major source of sta-tistical coupling in the posterior distribution. The variationprogram can be trained to ﬁt the posterior by minimizingthe evidence lower bound (ELBO) by stochastic gradientdescent (SGD) on the differentiable parameters λ j and α j . Normalizing ﬂows express the density of arbitrary multi-variate distributions as the push forward of a known basedistribution through a differentiable invertible function (i.e.a diffeomorphism). The base distribution is usually a spheri-cal normal distribution, hence the "normalizing" in the name.Consider a d -variate probability density p ( z ) and a diffeo-morphism Ψ : R d → R d . We can derive the density of thenew random variable x = Ψ( z ) using the change of variableformula p X ( x ) = | det J (Ψ − ( x )) | p (Ψ − ( x )) , (4)where Ψ − is the inverse transformation and J ( z ) is theJacobi matrix of Ψ . We can write this more compactlyusing the push-forward operator T Ψ associated to Ψ thatmaps the density p ( z ) to the density p X ( x ) given in Eq. 4: p X ( x ) = T Ψ [ p ( · )] ( x ) .Now consider a ﬂexible family of functions Ψ w ( x ) parame-terized by the "weights" w . We can use Eq. 13 to approx-imate arbitrarily complicated target densities by training w by SGD. Given a probabilistic program p ( x, y ) with y observed, we can approximate the posterior p ( x | y ) byminimizing the evidence lower bound:ELBO ( w ) = E w (cid:20) log p ( x, y ) | det J w (Ψ − ( x )) | p (Ψ − ( x )) (cid:21) (5)where the expectation is taken with respect to the trans-formed distribution T Ψ w [ p ( · )] ( x ) . Since Ψ is differen-tiable, we can obtain the reparameterization gradient for theELBO as follows: ∇ w ELBO ( w ) = E (cid:20) ∇ w log p (Ψ w ( z ) , y ) | det J w ( z ) | p ( z ) (cid:21) (6)where the expectation is now taken with respect to the ﬁxeddistribution p ( z ) . In the following we refer to this approachas global ﬂow (GF) since the transformation is applied syn-chronously to all variables in the model.

4. Variational inference with cascading ﬂows

Now that we have covered the relevant background, wewill introduce our new approach that combines ASVI withnormalizing ﬂows. utomatic structured variational inference with cascading ﬂows

The convex-update parameterization poses strong con-straints on the form of the conditional posteriors. On theother hand, normalizing ﬂow programs do not embed thepotentially very complex structure of the input program. Inthis paper we combine the two approaches by replacingthe convex-update operator in Eq. 2 with the push-forwardoperator used in normalizing ﬂows. More precisely, con-sider a family of diffeomorphisms Ψ w = (Ψ w , . . . , Ψ w N N ) depending on the array of parameters w = ( w , . . . , w N ) .Using these transformations, we can deﬁne the cascadingﬂows variational program associated with a program deﬁnedby Eq. 1: p ( x ) (cid:55)→ CF q w ( x ) = N (cid:89) j T w j [ ρ j ( · | θ j ( π j ))] ( x j ) , (7)where we denoted the push-forward operator induced by thediffeomorphism Ψ w j j (as deﬁned by Eq. 4) as T w j to simplifythe notation. A cascading ﬂows variational program can betrained by SGD using gradient estimators of the ELBO withthe reparameterization given in Eq. 6.The name cascading ﬂows comes from the fact that, sincethe transformation is performed locally for each conditionaldistribution, its effects cascade down the variational modelthrough the (transformed) conditional dependencies of theprior program. However, in order to preserve the forward-pass of the probabilistic program it is crucial to use a param-eterized diffeomorphism that can be initialized around theidentity function and whose deviation from the identity canbe easily controlled. Speciﬁcally, we opt for transformationsof the following form: Ψ w j j ( x ) = γx + (1 − γ ) f j ( x ; w j ) , (8)where γ ∈ (0 , . The resulting architecture of a cascadingﬂows variational program is visualized in Fig 1. From thediagram it is clear that a cascading ﬂow is constructed byinserting transformation layers within the architecture of theprobabilistic program. Transformations of the form givenin Eq. 8 with tractable Jacobians are not currently presentin the normalizing ﬂow literature. Therefore in the nextsubsection we will introduce a new ﬂow architecture of thisform, which we named highway ﬂow . Here we introduce highway ﬂow blocks gated by a learn-able scalar gate parameter. As also visualized in Fig. 2, ahighway ﬂow block is comprised by three separate highwaylayers:

Layer jLayer j + 1Layer j + 2 Layer jLayer j + 1Layer j + 2 CF Layer jCF Layer j + 1

A) B)

Figure 1.

Diagram of a cascading ﬂows architecture. A) Detail ofarchitecture of an input probabilistic program. B) Detail of theassociated cascading ﬂows architecture.

Upper TriLower TriActivation

Figure 2.

Diagram of a highway ﬂow block.

1. Upper triangular highway layer: l U ( z ; U, λ ) = λz + (1 − λ ) ( U z + b U ) (9) log det J U = (cid:88) k log ( λ + (1 − λ ) U kk ) (10)2. Lower triangular layer: l L ( z ; L, λ ) = λz + (1 − λ ) ( Lz + b L ) (11) log det J L = (cid:88) k log ( λ + (1 − λ ) L kk ) (12)3. Highway activation functions: f ( z ; λ ) = λz + (1 − λ ) g ( z ) (13) log det d f ( x k ) d x = (cid:88) k log (cid:18) λ + (1 − λ ) d g ( x k ) d x (cid:19) (14)where U is an upper-triangular matrix with positive-valueddiagonal, L is a lower-triangular matrix with ones in thediagonal and g ( x ) is a differentiable non-decreasing acti-vation function. A highway ﬂow layer is a composition ofthese three types of layers: l h ( · ) = f ◦ l L ◦ l U ( · ) . (15) utomatic structured variational inference with cascading ﬂows A highway network is a sequence of M highway layers witha common gate parameter and different weights and biases.Note that a highway ﬂow can be expressed in the form givenin Eq. 8 by deﬁning γ as follows: γ = λ M , (16)which is clearly equal to one (zero) when λ is equal to one(zero). The expressiveness of a normalizing ﬂow is limited by theinvertible nature of the transformation. This is particularlyproblematic for low-dimensional variables since the expres-sivity of many ﬂow architectures depends on the dimen-sionality of the input (Papamakarios et al., 2019). Thislimitation can be a major shortcoming in cascading ﬂowsas they are particularly useful in highly structured problemscharacterized by a large number of coupled low-dimensionalvariables. Fortunately, we can remedy this limitation by in-troducing a set of auxiliary variables. For each variable x j , we create a D -dimensional variable (cid:15) j following a basedistribution p j ( (cid:15) j ) . We can now use an augmented diffeo-morphism ˆΨ w j j ( x j , (cid:15) j ) and deﬁne the joint posterior overboth x j and (cid:15) j : q ( x j , (cid:15) j | π j ) = ˆ T w j [ ρ j ( · | θ j ( π j )) p j ( · )] ( x j , (cid:15) j ) , (17)where the push-forward operator now transforms the (inde-pendent) joint density of x j and (cid:15) j . The conditional varia-tional posterior is then obtained by marginalizing out (cid:15) j : q ( x j | π j ) = (cid:90) q ( x j , (cid:15) j | π j ) d (cid:15) j . (18)This inﬁnite mixture of ﬂows is much more capable of mod-eling complex and potentially multi-modal distributions.The use of latent mixture models for VI was originallyintroduced in (Ranganath et al., 2016). Given a set of obser-vations y , the ELBO of any mixture variational probabilisticprogram can be lower bounded using Jensen’s inequality: E x (cid:20) log p ( x , y ) (cid:82) q ( x , (cid:15) ) d (cid:15) (cid:21) ≥ E x , (cid:15) (cid:20) log p ( x , y ) r ( (cid:15) ) q ( x , (cid:15) ) (cid:21)(cid:124) (cid:123)(cid:122) (cid:125) Augmented ELBO , (19)where q ( (cid:15) ) = (cid:82) q ( x , (cid:15) ) d x is the marginal variational pos-terior and r ( (cid:15) ) is an arbitrarily chosen distribution over theauxiliary variables. This result justiﬁes the use of the aug-mented ELBO for training the variational posterior. Thegap between the mixture ELBO (Eq. 19) and the augmentedELBO can be reduced by parameterizing r ( (cid:15) ) and trainingits parameters by minimizing the augmented ELBO such asto match q ( (cid:15) ) . The highway ﬂow architecture can be easily adapted to theaugmented variable space. Since there is no need to gate theﬂow of the auxiliary variables, the scalar λ in Eq. 9, Eq. 11and Eq. 13 should be replaced by the vector l whose entriescorresponding to the original variables are equal to λ whilethe entries corresponding to the auxiliary variables are equalto zero. A) B)

Figure 3.

Backward auxiliary coupling. A) Example graphicalmodel. B) Cascading ﬂows model coupled with reversed locallinear Gaussian model.

So far, we outlined a model that has the same conditional in-dependence structure of the prior and is therefore incapableof modeling dependencies arising from colliders. Fortu-nately, the local auxiliary variables that we introduced inorder to increase the ﬂexibility of the local ﬂows can assumea second role as "input ports" that can induce non-local cou-plings. In fact, by coupling their auxiliary variables we caninduce statistical dependencies between any subset of vari-ables in the model. Importantly, this coupling can take a verysimple form since additional complexity can be modeled bythe highway ﬂows. We will now introduce a local auxiliarymodel with a non-trivial conditional independence structureinspired by the forward-backward algorithm. The graphicalstructure of the posterior probabilistic program is a sub-graph of the moralization of the prior graph (Bishop, 2006).Therefore, in order to be able to capture all the dependenciesin the posterior, it sufﬁces to couple the auxiliary variablesby reversing the arrows of the original prior model. In fact,all the parents of every node get coupled by the reversedarrows. As an example, consider the following probabilisticmodel (Fig. 3A): ρ ( x | x , x ) ρ ( x ) ρ ( x | x ) ρ ( x ) . We couple the auxiliary variables by reversing the arrows, asformalized in the graphical model p ( (cid:15) ) p ( (cid:15) , | (cid:15) ) p ( (cid:15) , | (cid:15) ) p ( (cid:15) | (cid:15) ) . The graphical structure of the resulting aug-mented model is shown in Fig. 3B. The auxiliary variablescan be coupled in a very simple manner since the diffeo- utomatic structured variational inference with cascading ﬂows

Table 1.

Predictive and latent log-likelihood (forward KL) of variational timeseries models. Error are SEM estimated over 10 repetitions.CF ASVI MF GF MVN CF (non-res)BR-r Pred − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . Latent − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . BR-c Pred . ± .

18 1 . ± .

14 1 . ± . . ± . . ± .

03 1 . ± . Latent − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . LZ-r Pred − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . Latent − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . LZ-c Pred . ± . . ± .

03 0 . ± .

003 0 . ± .

15 0 . ± .

001 0 . ± . Latent − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . PD-r Pred − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . Latent − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . PD-c Pred . ± . . ± .

06 0 . ± .

003 01 . ± .

02 1 . ± .

02 0 . ± . Latent − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . RNN-r Pred − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . Latent − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . RNN-c Pred . ± . . ± .

06 0 . ± .

03 2 . ± .

36 0 . ± .

02 1 . ± . Latent − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . morphisms can add arbitrary complexity to the ﬁnal jointdensity. Here we use a simple linear normal model: (cid:15) k | υ k = K (cid:88) j =1 a j (cid:12) υ j + a (cid:12) ξ k , (20)where a j ≥ and (cid:80) n a n = 1 . In this expression, ξ k is astandard Gaussian vector and υ k = ( υ , ..., υ K ) is an arraycontaining all the parents auxiliary variables (the auxiliariesof the children of x k in the original graph). We will now discuss a strategy to implement automatic infer-ence amortization in cascading ﬂows models. The problemof inference amortization is closely connected to the prob-lem of modeling collider dependencies as the latter arisefrom the backward messages originating from observednodes (Bishop, 2006). It is therefore not a coincidence thatwe can implement both collider dependencies and amorti-zation by shifting the statistics of the local auxiliary vari-ables. This results in a very parsimonious amortizationscheme where data streams are fed locally to the variousnodes and propagate backward through the linear networkof auxiliary Gaussian variables. We denote the set of aux-iliary variables associated to the children of x k in the in-put probabilistic program as υ k . Furthermore, we denotethe observed value of x k as y k and the set of observedvalues of the j -th child of x k as y k = ( y k , ..., y kK ) . Theamortized auxiliary model can then be obtained as a mod-iﬁcation of Eq. 20: p ( (cid:15) k | υ k ) = N (cid:0) m k , σ I (cid:1) with (cid:15) k | υ k = B ( k ) [ y k ] + (cid:80) Kj =1 a j (cid:12) υ j + a (cid:12) ξ k , where B ( k ) are learnable afﬁne transformations (i.e. linear layers Auxiliary graph

Forward pass connectionAuxiliary connectionAmortization input Latent nodeObservable nodeAuxiliary node

Figure 4.

Visualization of the backward amortization algorithm. in deep learning lingo).

5. Experiments

Our experiments are divided into three sections. In the ﬁrstsection we focus on highly structured timeseries problemsexhibiting linear and non-linear dynamics. Bayesian prob-lems of this nature have relevance for several scientiﬁc ﬁeldssuch as physics, chemistry, biology, neuroscience and pop-ulation dynamics. In the second section we compare thecapacity of several variational programs to model depen- utomatic structured variational inference with cascading ﬂows dencies arising from colliders in the graphical structure. Infact, the inability of modeling collider dependencies was themain limitation of the original convex-update ASVI fam-ily (Ambrogioni et al., 2021). Finally, in the third section wetest the performance of the automatic amortization scheme.All the code used in the experiments is included in thesupplementary material and documented in SupplementaryA.

Architectures : In all experiments, the CF architectureswere comprised of three highway ﬂow blocks with softplusactivation functions in each block except for the last whichhad linear activations. CF programs use an independentnetwork of this form for each latent variable in the program.Each variable was supplemented with auxiliary variables,the width of each network was therefore equal to the dimen-sionality of the variable plus . Weights and biases wereinitialized from centered normal distributions with scale . . The λ variable was deﬁned independently for eachnetwork as the logistic sigmoid of a leanable parameter l ,which was initialized as in order to keep the variationalprogram close to the input program. For each variable, theposterior auxiliary distributions r ( (cid:15) ) (see Eq. 19) were spher-ical normals parameterized by mean and standard deviation(i.e. Gaussian mean ﬁeld). CF ASVI R e g r e ss i on C l a ss i f i ca ti on Figure 5.

Example of qualitative results in the Lorentz system (ﬁrstcoordinate). The gray shaded area is observed through noisy mea-surements. The dashed lines are the ground-truth timeseries whilethe thin colored lines are samples from the trained variationalprogram.

We consider discretized SDEs, deﬁned by conditional densi-ties of the form: ρ t ( x t +1 | x t ) = N (cid:0) x t +1 ; µ ( x t , t ) d t, σ ( x t , t ) d t (cid:1) , (21) where µ ( x t , t ) is the drift function and σ ( x t , t ) is the volatil-ity function. Speciﬁcally, we used Brownian motions (BR: µ ( x ) = x , σ ( x ) = 1 ), Lorentz dynamical systems (LZ: µ ( x , x , x ) = (10( x − x ) , x (28 − x ) − x , x x − / x ) , σ = 2 ), Lotka-Volterra population dynamicsmodels (PD: µ ( x , x ) = (0 . x − . x x , . x x − . x ) , σ = 3 ) and recurrent neural dynamics with ran-domized weights (RNN, see Supplementary B). The initialdistributions were spherical Gaussians (see SupplementaryB for the parameters). For each dynamical model we gen-erated noisy observations using Gaussian emission models(denoted with an "r" in the tables): p ( y t | x t ) = N ([ x t ] , σ lk ) (22)or Bernoulli logistic models ("c" in the tables): p ( y t | x t ) = Bernoulli ( g ( k [ x t ] )) , (23)where [ x t ] denotes the ﬁrst entry of the vector x t , σ lk is the standard deviation of the Gaussian emission model, g ( · ) is the logistic sigmoid function and k is a gain fac-tor. All the numeric values of the parameters in differentexperiments are given in Supplementary B. For each taskwe evaluate the performance of the cascading ﬂows pro-grams (CF) against a suite of baseline variational posteriors,including convex-update ASVI (Ambrogioni et al., 2021),mean ﬁeld (MF) and multivariate normal (MN) (Kucukelbiret al., 2017). We also compare our new cascading ﬂowsfamily with the more conventional global normalizing ﬂow(GF) approach where a single diffeomorphism transformsall the latent variables in the probabilistic program intoa spherical normal distribution (Rezende and Mohamed,2015). For the sake of comparison, we adopt an archi-tecture identical to our highway ﬂow model (including auxiliary variables) except for the absence of the highwaygates. Note however that the global architecture has muchgreater width as it is applied to all variables in the pro-gram. Finally, we compare the performance of CF with amodiﬁed version of CF that does not use highway gates.Ground-truth multivariate timeseries ˜ x = (˜ x , . . . , ˜ x T ) were sampled from the generative model together with sim-ulated ﬁrst-half observations y T/ = ( y , . . . , y T/ ) andsecond-half observations y T/ T = ( y T/ , . . . , y T ) . Allthe variational models were trained conditioned only onthe ﬁrst half observations. Performance was assessed us-ing two metrics. The ﬁrst metric is the average marginallog-probability of the ground-truth given the variational pos-terior: T J (cid:80) Tt (cid:80) Jj log q ([˜ x t ] j | y T/ ) , where J is thedimensionality of the ground-truth vector ˜ x t . The marginalposterior density q ([˜ x t ] j | y T/ ) was estimated from samples drawn from the variational program using Gaus-sian kernel density estimation with bandwidth . σN − / ,where ˆ σ is the empirical standard deviation of the samples.Our second metric is log p ( y T/ T | y T/ ) : the predic-tive log-probability of the ground-truth observations in the utomatic structured variational inference with cascading ﬂows Table 2.

Multivariate latent log-likelihood (forward KL) of variational binary tree models. Error are SEM estimated over 15 repetitions.

CF ASVI MF GF MNLinear-2 . ± . . ± .

05 0 . ± .

01 0 . ± .

01 1 . ± . Linear-4 . ± .

35 3 . ± .

26 1 . ± . . ± . . ± . Tanh-2 . ± . − . ± .

47 0 . ± .

01 0 . ± .

01 1 . ± . Tanh-4 . ± . − . ± . − . ± . . ± . . ± . second half of the timeseries given the observations in theﬁrst half, estimated using kernel density estimation from predicted observation samples drawn from the exactlikelihood conditioned on the samples from the variationalprogram. Each experiment was repeated times. In eachrepetition, all the variational programs were re-trained for iterations (enough to ensure convergence in all meth-ods) given new sampled ground-truth timeseries and ob-servations. The reported results are the mean and SEM ofthe two metrics over these repetitions. Results:

Theresults are given in Table 1. The cascading ﬂows family(CF) outperforms all other approaches by a large margin inall non-linear tasks. Convex-update ASVI consistently hasthe second highest performance, easily outperforming thehighly parameterized global normalizing ﬂow (GF). Thisresult is consistent with (Ambrogioni et al., 2021), whereASVI was shown to outperform inverse autoregressive ﬂowarchitectures (Kingma et al., 2016). Finally, the CF modelwithout highway gates (CF non-res) has very low perfor-mance, proving the crucial importance of our new highwayﬂow architecture. A qualitative comparison (LZ-r and LZ-c)between CF and ASVI is shown in Fig. 5.

In our second experiment we test the capability of the vari-ational programs to model statistical dependencies arisingfrom colliders. We tested this in a Gaussian binary treemodel where the mean of a scalar variable x dj in the d -thlayer is a function of two variables in the d − -th layer: x dj ∼ N (cid:0) link ( π d − j , π d − j ) , σ (cid:1) , where π d − j and π d − j arethe two parents of x dj . All the -th layer variables havezero mean. The inference problem consists of observingthe only variable in the last layer and computing the pos-terior of all the other variables. We considered a settingwith linear coupling link ( x, y ) = x − y and tanh couplinglink ( x, y ) = tanh ( x ) − tanh ( y ) with and layers. Per-formance was estimated using the posterior log-probabilityof the ground-truth sample estimated using a multivariateGaussian approximation that matches the empirical meanand covariance matrix of samples drawn from thetrained variational program. We compared performancewith ASVI, mean-ﬁeld (MF), global ﬂow (GF) and multi-variate normal (MN). Note that neither ASVI nor MF canmodel collider dependencies. Full details are given in Sup-plementary C. The reported results are the mean and SEM Table 3.

Latent log-likelihood (forward KL) of amortized time-series models. Error are SEM estimated over 50 repetitions.

CF MF GFPD − . ± . − . ± . − . ± . LZ − . ± . − . ± . − . ± . of the two metrics over repetitions. Results:

The resultsare given in Table 2. The CF model always outperformsASVI and MF, proving that it can capture collider dependen-cies. CF has the highest performances in shallow modelsbut lower performance than GF and MN in deeper models.

Finally, we evaluate the performance of the amortizationscheme in non-linear timeseries inference problems. Weused the population dynamics (PD) and the Lorentz (LZ)systems. The CF program was amortized using the approachdescribed in Sec. 4.5. The MF and normalizing ﬂow base-lines were amortized using inference networks (see Supple-mentary D). All the details were identical to the timeseriesexperiments except for the fact that all time points wereobserved. Performance was again quantiﬁed as the averagemarginal log-probability of the ground-truth kernel densityestimated from the samples from the amortized variationalprograms. Each amortized program was trained once andtested on new sampled timeseries/observations pairs. Thereported results are the mean and SEM of these repeti-tions. Results:

The results are reported in Table 3. Like inthe non-amortized case, CF greatly outperforms both MFand GF in both timeseries inference problems.

6. Discussion

Our CF approach solves the two main limitations of ASVIas it can model collider dependencies and can be amortizedautomatically. We show that cascading ﬂows programshave extremely high performance in structured inferenceproblems. However, their ability to capture deep collider de-pendencies is lower than normalizing ﬂows. In our opinion,designing principled ways to model collider dependenciesis a promising research direction. utomatic structured variational inference with cascading ﬂows

References

B. Milch, B. Marthi, S. Russell, D. Sontag, D. L. Ong, andA. Kolobov. Blog: Probabilistic models with unknownobjects.

Statistical Relational Learning , page 373, 2007.T. Sato. Prism: A symbolic-statistical modeling language.In

Proceeding of the Intl. Joint Conference on ArtiﬁcialIntelligence , 1997.K. Kersting and L. De Raedt. Bayesian logic programming:Theory and tool.

Statistical Relational Learning , page291, 2007.A. Pfeffer. Ibal: A probabilistic rational programming lan-guage. In

Proceeding of the Intl. Joint Conference onArtiﬁcial Intelligence , 2001.S. Park, F. Pfenning, and S. Thrun. A probabilistic languagebased upon sampling functions.

ACM SIGPLAN Notices ,40(1):171–182, 2005.N. Goodman, V. Mansinghka, D. M. Roy, K. Bonawitz, andJ. B. Tenenbaum. Church: A language for generativemodels. arXiv preprint arXiv:1206.3255 , 2012.D. Wingate, A. Stuhlmüller, and N. Goodman. Lightweightimplementations of probabilistic programming languagesvia transformational compilation. In

Proceedings of theInternational Conference on Artiﬁcial Intelligence andStatistics , 2011.A. Patil, D. Huard, and C. J. Fonnesbeck. Pymc: Bayesianstochastic modelling in python.

Journal of StatisticalSoftware , 35(4):1, 2010.J. V Dillon, I. Langmore, D. Tran, E. Brevdo, S. Vasude-van, D. Moore, B. Patton, A. Alemi, M. Hoffman, andR. A Saurous. Tensorﬂow distributions. arXiv preprintarXiv:1711.10604 , 2017.E. Bingham, J. P. Chen, M. Jankowiak, F. Obermeyer,N. Pradhan, T. Karaletsos, R. Singh, P. Szerlip, P. Hors-fall, and N. D. Goodman. Pyro: Deep universal proba-bilistic programming.

The Journal of Machine LearningResearch , 20(1):973–978, 2019.D. Tran, M. D. Hoffman, R. A. Saurous, E. Brevdo, K. Mur-phy, and D. M. Blei. Deep probabilistic programming. arXiv preprint arXiv:1701.03757 , 2017.D. Tran, A. Kucukelbir, A. B. Dieng, M. Rudolph, D. Liang,and D. M. Blei. Edward: A library for probabilis-tic modeling, inference, and criticism. arXiv preprintarXiv:1610.09787 , 2016.C. Bishop, D. Spiegelhalter, and J. Winn. Vibes: A varia-tional inference engine for bayesian networks.

NeuralInformation Processing Systems , 15:793–800, 2002. A. Kucukelbir, D. Tran, R. Ranganath, A. Gelman, and D. M.Blei. Automatic differentiation variational inference.

TheJournal of Machine Learning Research , 18(1):430–474,2017.C. Bishop and J. Winn. Structured variational distributionsin vibes. In

International Conference on Artiﬁcial Intelli-gence and Statistics , 2003.D. J. Rezende and S. Mohamed. Variational inference withnormalizing ﬂows. In

International Conference on Ma-chine Learning , 2015.G. Papamakarios, E. Nalisnick, Danilo J. Rezende, S. Mo-hamed, and B. Lakshminarayanan. Normalizing ﬂowsfor probabilistic modeling and inference. arXiv preprintarXiv:1912.02762 , 2019.L. Ambrogioni, K. Lin, E. Fetig, S. Vikram, M. Hinne,D. Moore, and M. van Gerven. Automatic structuredvariational inference. In

International Conference onArtiﬁcial Intelligence and Statistics , 2021.T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka-plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry,A. Askell, et al. Language models are few-shot learners.

Neural Information Processing Systems , 2020.J. Winn and C. M. Bishop. Variational message passing.

Journal of Machine Learning Research , 6(Apr):661–694,2005.D. Wingate and T. Weber. Automated variational in-ference in probabilistic programming. arXiv preprintarXiv:1301.1299 , 2013.M. D. Hoffman and D. M. Blei. Structured stochastic varia-tional inference. In

Proceedings of the Artiﬁcial Intelli-gence and Statistics , pages 361–369, 2015.D. Tran, D. Blei, and E. M. Airoldi. Copula variationalinference. In

Neural Information Processing Systems ,2015.R Ranganath, D Tran, and D Blei. Hierarchical variationalmodels. In

International Conference on Machine Learn-ing , pages 324–333, 2016.D. P Kingma, T. Salimans, R. Jozefowicz, X. Chen,I. Sutskever, and M. Welling. Improved variational infer-ence with inverse autoregressive ﬂow. In

Neural Informa-tion Processing Systems , 2016.C Weilbach, B Beronov, F Wood, and W Harvey. Structuredconditional continuous normalizing ﬂows for efﬁcientamortized inference in graphical models. In

InternationalConference on Artiﬁcial Intelligence and Statistics , pages4441–4451. PMLR, 2020. utomatic structured variational inference with cascading ﬂows

R K Srivastava, K Greff, and J Schmidhuber. Highwaynetworks. arXiv preprint arXiv:1505.00387 , 2015.R. T.Q. Chen, J. Behrmann, D. K. Duvenaud, and J. Jacob-sen. Residual ﬂows for invertible generative modeling.In

Neural Information Processing Systems , pages 9916–9926, 2019.E Dupont, A Doucet, and Y W Teh. Augmented neuralODEs. In

Neural Information Processing Systems , pages3140–3150, 2019.S. R. Eddy. Hidden Markov models.

Current Opinion inStructural Biology , 6(3):361–365, 1996.N. Foti, J. Xu, D. Laird, and E. Fox. Stochastic variationalinference for hidden Markov models. In

Neural Informa-tion Processing Systems , 2014.M. Johnson and A. Willsky. Stochastic variational infer-ence for Bayesian time series models. In

InternationalConference on Machine Learning , 2014.M. Karl, M. Soelch, Ju. Bayer, and P. Van der Smagt.Deep variational Bayes ﬁlters: Unsupervised learningof state space models from raw data. arXiv preprintarXiv:1605.06432 , 2016.M. Fortunato, C. Blundell, and O. Vinyals. Bayesian recur-rent neural networks. arXiv preprint arXiv:1704.02798 ,2017.C. M. Bishop.

Pattern recognition and machine learning .Springer, 2006. utomatic structured variational inference with cascading ﬂows

A. Experiments

The experiments where implemented in Python 3.8 using the packages PyTorch, Numpy and Matplotlib. The implementationcode is contained in the "modules" folder. The probabilistic program objects are deﬁned in the ﬁle "models.py" and the deeparchitectures are in "networks.py". The experiments can be run from the ﬁles "timeseries experiment.py", "collider linearexperiments.py", "collider tanh experiments.py" and "amortized experiments.py" in the "experiments" folder.

B. Models

In the experiments, we use the following dynamical models.

B.1. Brownian motion (BR)

Dimensionality: J = 1 Drift: µ ( x ) = x Diffusion: σ ( x ) = 1 Initial density: p ( x ) = N (0 , Time horizon: T = 40 Time step: d t = 1 Regression noise: σ lk = 1 Classiﬁcation gain: k = 2 B.2. Lorentz system (LZ)

Dimensionality: J = 3 Drift: µ ( x , x , x ) = (10( x − x ) , x (28 − x ) − x , x x − / x ) Diffusion: σ ( x ) = 2 Initial density: p ( x ) = N (3 , ) Time horizon: T = 40 Time step: d t = 1 Regression noise: σ lk = 3 Classiﬁcation gain: k = 2 B.3. Population dynamics (PD)

Dimensionality: J = 2 Drift: µ ( x , x , x ) = ( ReLu (0 . x − . x x ) , ReLu (0 . x x − . x )) Diffusion: σ ( x ) = 2 Initial density: p ( x ) = N (0 , Time horizon: T = 100 Time step: d t = 0 . Regression noise: σ lk = 3 Classiﬁcation gain: k = 2 Note: The ReLu activation was added to avoid instabilities arising from negative values.

B.4. Recurrent neural network (RNN)

Dimensionality: J = 3 Drift: µ ( x ) = tanh( W tanh( W tanh( W x + b ) + b )) Diffusion: σ ( x ) = 0 . Initial density: p ( x ) = N (0 , Time horizon: T = 40 Time step: d t = 0 . Regression noise: σ lk = 1 Classiﬁcation gain: k = 2 utomatic structured variational inference with cascading ﬂows Note: W is a × matrix, W is a × matrix, W is a × matrix, b is a 2D vector and b is a 5D vec-tor. All the entries of these quantities are sampled from normal distributions with mean and standard deviation (SD) .These parameters we re-sampled for every repetition of the experiment. C. Collider dependencies