[PDF] Discrete flow posteriors for variational inference in discrete dynamical systems

Abstract

Each training step for a variational autoencoder (VAE) requires us to sample from the approximate posterior, so we usually choose simple (e.g. factorised) approximate posteriors in which sampling is an efficient computation that fully exploits GPU parallelism. However, such simple approximate posteriors are often insufficient, as they eliminate statistical dependencies in the posterior. While it is possible to use normalizing flow approximate posteriors for continuous latents, some problems have discrete latents and strong statistical dependencies. The most natural approach to model these dependencies is an autoregressive distribution, but sampling from such distributions is inherently sequential and thus slow. We develop a fast, parallel sampling procedure for autoregressive distributions based on fixed-point iterations which enables efficient and accurate variational inference in discrete state-space latent variable dynamical systems. To optimize the variational bound, we considered two ways to evaluate probabilities: inserting the relaxed samples directly into the pmf for the discrete distribution, or converting to continuous logistic latent variables and interpreting the K-step fixed-point iterations as a normalizing flow. We found that converting to continuous latent variables gave considerable additional scope for mismatch between the true and approximate posteriors, which resulted in biased inferences, we thus used the former approach. Using our fast sampling procedure, we were able to realize the benefits of correlated posteriors, including accurate uncertainty estimates for one cell, and accurate connectivity estimates for multiple cells, in an order of magnitude less time.

Full PDF

DDiscrete ﬂow posteriors for variational inference indiscrete dynamical systems

Laurence Aitchison

University of CambridgeCambridge, CB2 1PZ, UK [email protected]

Vincent Adam

PROWLER.io66-68 Hills Road, Cambridge, CB2 1LA, UK [email protected]

Srinivas C. Turaga

HHMI Janelia Research CampusAshburn, VA 20147 [email protected]

Abstract

Each training step for a variational autoencoder (VAE) requires us to samplefrom the approximate posterior, so we usually choose simple (e.g. factorised)approximate posteriors in which sampling is an efﬁcient computation that fullyexploits GPU parallelism. However, such simple approximate posteriors are ofteninsufﬁcient, as they eliminate statistical dependencies in the posterior. While itis possible to use normalizing ﬂow approximate posteriors for continuous latents,some problems have discrete latents and strong statistical dependencies, includ-ing the neuroscience problem of inferring discrete spiking activity from noisycalcium-imaging data: not only does the posterior inherit dependencies from synap-tic connectivity encoded in the prior, but there are also strong explaining-awayeffects. The most natural approach to model these dependencies is an autoregres-sive distribution, but sampling from such distributions is inherently sequentialand thus slow. We develop a fast, parallel sampling procedure for autoregressivedistributions based on ﬁxed-point iterations which enables efﬁcient and accuratevariational inference in discrete state-space latent variable dynamical systems. Tooptimize the variational bound, we considered two ways to evaluate probabilities:inserting the relaxed samples directly into the pmf for the discrete distribution,or converting to continuous logistic latent variables and interpreting the K-stepﬁxed-point iterations as a normalizing ﬂow. We found that converting to continuouslatent variables gave considerable additional scope for mismatch between the trueand approximate posteriors, which resulted in biased inferences, we thus used theformer approach. Using our fast sampling procedure, we were able to realize thebeneﬁts of correlated posteriors, including accurate uncertainty estimates for onecell, and accurate connectivity estimates for multiple cells, in an order of magnitudeless time.

The development of variational auto-encoders (VAE) [1, 2] has enabled Bayesian methods to beapplied to very large-scale data, by leveraging neural networks, trained by stochastic gradient ascent,to approximate the posterior. To train these neural network approximate posteriors, we need tosample from them in each stochastic gradient ascent step, restricting us to approximate posteriorsin which sampling can be performed rapidly, by leveraging parallel GPU computations. As such,

Preprint. Work in progress. a r X i v : . [ s t a t . M L ] M a y ost work on VAE’s has used factorised approximate posteriors (e.g. [1–5]), but in many domains ofinterest we expect the posterior over latents to be highly correlated, not only because the posteriorinherits correlations from the prior (e.g. as we might ﬁnd in a dynamical system), but also because thelikelihood itself can induce correlations due to effects such as explaining away [6, 7]. One approach tointroducing correlations into the approximate posterior is normalizing ﬂows [8, 9] which transformsvariables generated from a simple, often factorised distribution into a complex correlated distribution,in such a way that the determinant of the Jacobian (and thus the probability of the transformedvariables) can easily be computed.However, the normalizing ﬂow approach can only be applied to continuous latents, and there areimportant problems which require discrete latent variables and correlated posteriors, making efﬁcientand accurate stochastic variational inference challenging. In particular, we consider the neuroscienceproblem of inferring the correlated spiking activity of neural populations recorded by calcium imaging.Due to the indirect nature of calcium imaging, spike inference algorithms must be used to infer theunderlying neural spiking activity leading to measured ﬂuorescence dynamics [10, 11]. Not onlydoes the data naturally induce strong explaining away induced anticorrelations (as a spike in nearbytimebins produces very similar data), but there are prior correlations induced by synaptic connectivitywhich induces similar correlations in the approximate posterior.To address these challenging tasks with discrete latent variables, and correlated approximate posteriors,we considered two approaches. First, we considered applying normalizing ﬂows by transformingour discrete latents into continuous latents, which are thresholded to recover the original discretevariables [12]. However, we found that working with continuous variables gave rise to far more scopefor mismatch between the approximate and true posteriors than working with discrete variables, andthat this mismatch resulted in biased inferences. Instead, we developed a fast-sampling procedurefor discrete autoregressive posteriors. Instead, we took the more natural approach: considering anautoregressive approximate posterior that mirrors the structure of the prior, but to circumvent theslow sequential sampling procedure, we developed ﬂow-like ﬁxed-point iterations that are guaranteedto sample the true posterior after T iterations, but in practice converge much more rapidly (in oursimulations, ∼ iterations), and efﬁciently exploit GPU parallelism.Applying the ﬂow-like ﬁxed point iterations to simulated neuroscience problems, we were able tosample from autoregressive approximate posteriors in almost the same time required for factorisedposteriors, and at least an order of magnitude faster than sequentially sampling from the underlyingautoregressive process, allowing us to realize the beneﬁts of correlated posteriors in large-scalesettings. The evidence lower bound objective (ELBO), takes the form of an expectation computed over theapproximate posterior, Q φ ( z ) , L = E Q φ ( z ) (cid:2) log P θ ( x | z ) + log P θ ( z ) − log Q φ ( z ) (cid:3) (1)Optimizing this objective with respect to the generative model parameters, θ , is straightforward, aswe can push the derivative inside the expectation, and perform stochastic gradient descent. However,we cannot do the same with the recognition parameters, φ , as they control the distribution over whichthe expectation is taken. To solve this issue, the usual approach is the reparameterisation trick [1, 2]which performs the expectation over IID random noise, and transforms this noise into samples fromthe approximate posterior, L = E (cid:15) (cid:2) log P θ ( x, z ( (cid:15) ; φ )) − log Q φ ( z ( (cid:15) ; φ )) (cid:3) . (2)As the recognition parameters φ no longer appear in the distribution over which the expectation istaken, we can again optimize this expression using stochastic gradient descent.While the reparameterisation trick is extremely effective for continuous latent variables, it cannot beused for discrete latents, as we cannot back-propagate gradients through discrete latents. To rectifythis issue, one approach is to relax the discrete variables, ¯ z , to form an approximately equivalentmodel with continuous random variables, ˆ z , through which gradients can be propagated [12, 13].To consider the simplest possible case, we take a single binary variable, ¯ z , drawn from a Bernoullidistribution with log-odds ratio u , and probability p = σ ( u ) = 1 / (1 + e − u ) , ¯P (¯ z ) = Bernoulli ( σ ( u )) , (3)2nstead of sampling ¯ z directly from a Bernoulli, we can obtain samples of ¯ z from the same distributionby ﬁrst sampling from a Logistic, and thresholding that sample using, P ( l ) = Logistic ( u, , ¯ z = Θ( l ) = lim β →∞ σ ( βl ) , ˆ z = σ ( βl ) , (4)where Θ( l ) is the Heaviside step function, which is for negative inputs, and for positive inputs,and β is an inverse-temperature parameter. For a non-inﬁnite temperature, we obtain an analogousrelaxed variable, ˆ z , that lies between and .Notably, for any (directed) model with discrete latents, including priors and approximate posteriorsin the VAE setting, we can always use this approach to ﬁnd an equivalent model with thresholdedcontinuous latent variables.While we can usually use the same functional form for the likelihood for relaxed and discrete models,it is less clear how we should compute the probabilities of the relaxed discrete latent variablesunder the prior and approximate posterior. In particular, we have two options for how we evaluateprobabilities for the relaxed variables. The most straightforward approach is to simply insert the relaxed variables, ˆ z into the originalprobability mass function for the discrete model, ¯P (ˆ z ) , and ¯Q (ˆ z ) [13]. Taking the univariateexample, this gives, log ¯P (ˆ z ) = ˆ z log p + (1 − ˆ z ) log(1 − p ) . (5)However, this immediately highlights a key issue: to obtain a valid variational bound, we need toevaluate the probability density of samples from the approximate posterior. In our case, samplesfrom the approximate posterior are the relaxed variables, ˆ z , so we need to evaluate ˆQ (ˆ z ) . However,we are actually using a a different expression, ¯Q (ˆ z ) , and while this may in practice be an effectiveapproximation, it cannot give us a valid variational bound and therefore we must be (and are) carefulto evaluate our model using ¯Q (¯ z ) , which we can compute but not differentiate. To obtain a valid variational bound, we need to evaluate the actual probability of the relaxed variable,i.e. we need to compute ˆQ (ˆ z ) , and ˆP (ˆ z ) . While we could work with the ˆ z directly, this is knownto be numerically unstable so instead we work in terms of the logistic variables, l [12]. These twoapproaches are equivalent, in the sense that the ratio of prior and approximate posterior probabilitiesis the same, because the gradient terms introduced by the change of variables cancel. To understandhow we set the distributions, ˆP ( l ) , and ˆQ ( l ) , we remember that the ultimate goal is to have a relaxed,continuous model which closely approximates the discrete model of interest. As such, we rememberthat we could obtain an exactly equivalent model by sending the inverse-temperature to ∞ , and using, P ( l ) = Logistic ( u ; 1) Q ( l ) = Logistic ( v ; 1) . (6)Further, note that we have not speciﬁed whether these distributions correspond to the discrete orrelaxed model, as we can obtain either ˆ z or ¯ z from exactly the same l , by using different inverse-temperatures for the transformation (Eq. 4). We are interested in discrete dynamical systems, which can be written directly in terms of the discretevariables, or equivalently in terms of continuous variables drawn from a Logistic distribution, ¯P ( ¯z t | ¯z t − ) = Bernoulli ( σ ( u t )) , P ( l t | l t − ) = Logistic ( u t , (7a) u t = u t ( ¯z t − ) = lim β →∞ u ( σ ( β l t − )) . (7b)To form the approximate posterior, the most straightforward approach is to use another discreteautoregressive process, as this allows us to effectively model any correlations induced by the prior, ¯Q ( ¯z t | x , ¯z t − ) = Bernoulli ( σ ( v t )) , Q ( l t | x , l t − ) = Logistic ( v t , (8a) v t = v t (cid:0) x , ¯z t − (cid:1) = lim β →∞ v ( x , σ ( β l t − )) . (8b)3 t − z t − η t − z t − η t z t η t +1 z t +1 η t +2 z t +2 A η t − z t − z t − z t − η t − z t − z t − z t − η t z t z t z t η t +1 z t +1 z t +1 z t +1 η t +2 z t +2 z t +2 z t +2 z z z − p r o b a b ili t y priortrue posteriorapprox. posterior B − l og P ( x ) − L C − −

505 posterior log-odds a pp r o x . l og - o dd s D − p o s t e r i o r s t d . E Figure 1: Differences between autoregressive and ﬂow posteriors, based on discrete or continuouslatents. A The sequential autoregressive (top), and parallel ﬂow-like sampling procedure (bottom). B The true and approximate posteriors for continuous Logistic latent variables. C The differencebetween the model evidence and the ELBO induced by the posterior mismatch in B . D Biasedinferences induced by the mismatch between the prior and approximate posterior. E The standarddeviation of ˆ z as we modulate the standard deviation of a Gaussian likelihood encouraging ˆ z ≈ . . While we can sample the autoregressive approximate posterior given above, in practice, this canbe extremely slow for either the discrete or relaxed model, as the sequential structure (Fig. 1A top)fails to fully exploit the parallelism available on today’s GPU hardware. As such, we consideredﬁxed-point iterations analogous to normalizing ﬂows which do fully utilize GPU parallelism, andrapidly converge to a true sample (Fig. 1A bottom). In particular, these iterations begin by samplingIID noise, η , from Logistic (0 , , and iterating, l k +1 ( η , l k ) = η + v (cid:0) x , σ (cid:0) β l k (cid:1)(cid:1) . (9)where all the variables are N × T matrices, and where v (cid:0) x , σ (cid:0) β l k (cid:1)(cid:1) is simply Eq. (8b) computed inparallel across time for an externally speciﬁed input, rather than previous samples, l k +1 t (cid:0) η t , l k t − (cid:1) = η t + v t (cid:0) x , σ (cid:0) β l k t − (cid:1)(cid:1) . (10)We could use any initial condition, but the most straightforward is given by z it = 0 , or equivalently, l it = −∞ . Importantly, these iterations are guaranteed to sample from the original autoregressivemodel if we run as many iterations as there are time-steps ( K = T ). In particular, we can see thatthe result of the ﬁrst iteration, l , is correct at timestep , i.e. l = l k ≥ , as v takes input only fromdata, x . Further, the result of the second iteration is correct up to timestep , as v only takes inputfrom l , and this is correct (i.e. it equals l ). In general, we have l t = l k ≥ tt . While this worst-caseis extremely slow, converting a O ( T ) computation to an order O ( T ) computation, in practice, theiterations reach steady-state rapidly so we are able to use K (cid:28) T , giving an order of magnitudeimprovement in efﬁciency by making improved use of GPU parallelism.While this section gives a fast procedure for sampling from the approximate posterior, it is stillunclear how we should evaluate probabilities under this approximate posterior. We have two options,which mirror the two options introduced above (Sec. 2.1 and Sec. 2.2 respectively). First, we can compute relaxed Bernoulli variables using ˆz = σ (cid:0) β l K (cid:1) , and insert them into theprobability mass function for the autoregressive discrete approximate posterior (Eq. 8a). While thiscan be computed efﬁciently in parallel, it introduces another level of mismatch between distribution4e sample and the distribution under which we evaluate the probability, in the sense that we sampleusing the ﬁxed point iterations, but evaluate probabilities under the sequential autoregressive process(Eq. 8a). Remarkably, this is often not an issue for the discrete model, as we can simply iterate untilconvergence, and convergence is very well-deﬁned as the latents are either or . However, for therelaxed model it is more difﬁcult to deﬁne convergence as the latents can lie anywhere between and , so in practice we used a ﬁxed number of iterations (in particular, K = 5 ). To evaluate the probability of the continuous variables under the ﬁxed point iterations, we interpretthe iterations as constituting a normalizing ﬂow.Normalizing ﬂows exploit the fact that we can compute the probability density of a random variable, l k ( η ) , generated by transforming, via a one-to-one function, a sample η , P (cid:0) l k ( η ) (cid:1) = P ( η ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ l k ( η ) ∂ η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − (11)where | ∂ l k /∂ η | is the absolute value of the determinant of the Jacobian of l k ( η ) . While thedeterminant of the Jacobian is often very difﬁcult to compute, we can ensure the Jacobean is byusing a restricted family of transformations, under which the value of l kt depends only on the currentvalue, η t via simple addition, and on past values of η via an arbitrary function, l k +1 t = η t + f k +1 (cid:0) x , η t − (cid:1) . (12)And we know that our ﬁxed-point iterations indeed lie in this family of functions, as Eq. (10) mirrorsthe above deﬁnition, with, f k +1 (cid:0) x , η t − (cid:1) = v t (cid:0) x , σ (cid:0) β l k t − (cid:1)(cid:1) (13)where l k t − indeed depends only on η t − (see Eq. 10). Using continuous latent variables might seem appealing, as we can not only compute the exactapproximate posterior probability, even when the ﬁxed point iterations have not converged, butthe normalizing ﬂow framework gives us considerable ﬂexibility in choosing f k +1 (i.e. we are notrestricted to normalizing ﬂows which have an interpretation as a discrete dynamical system). However,moving from a discrete latent variable (Eq. 3 with u = 0 , and thus σ ( u ) = 0 . ) to a continuouslatent variable (Eq. 4) introduces scope for considerable mismatch between the true and approximateposterior (Fig. 1B), which simply is not possible for the binary model where the posterior remainsBernoulli. This mismatch between the approximate and true posterior implies a discrepancy betweenthe ELBO and the true model evidence (Fig. 1C), and this discrepancy grows as the evidence infavour of ¯ z = 0 or ¯ z = 1 increases. This mismatch has a potentially large magnitude (comparethe scale on Fig. 1B to that in Fig. 2F), and thus can dramatically modify the variational inferenceobjective function, introducing the possibility for biased inferences. In fact, this is exactly what wesee (Fig. 1BD), with the approximate posterior underestimating the evidence in favour of ¯ z = 0 or ¯ z = 1 .Further, the Logistic approximate posterior can introduce additional biases when optimized incombination with a relaxed binary variables (Figs. 1B–D use a hard-thresholding or equivalently,an inﬁnite inverse-temperature). In particular, we consider a Gaussian likelihood, P ( x = 0 . | ˆ z ) = N ( x = 0 .

5; ˆ z, σ ) , and thus, as σ decreases, there is increasing evidence that ˆ z is close to . . Underthe true model, with a hard-threshold, this can never be achieved: ¯ z must be either or . However,if we combine a relaxation (Eq. 4) with a Logistic approximate posterior, it is possible to reduce thevariance of the logistic sufﬁciently such that the relaxed Bernoulli variable, ˆ z , is indeed close to . .To quantify this effect, we consider how the standard deviation of ˆ z (which should be . under thetrue posterior), varies as we change the standard deviation of the likelihood, σ (Fig. 1E).5 ﬂ u o r . ( a . u . ) dataexact posteriorfactorised supervisedfactorised VAEautoregressive (ﬂow) VAE CA s (1) s (2) s (3) s (4) s (5) s (1) s (2) s (3) s (4) s (5) s (1) s (2) s (3) s (4) s (5) B s (1) s (2) s (3) s (4) s (5) s (1) s (2) s (3) s (4) s (5) s (1) s (2) s (3) s (4) s (5)

10 Hz t r u e p r o b a b ili t y D p r o b a b ili t y E · − − − I W A E o b j . / t i m e - s t e p factorised superviseddiscrete factorisedcontinuous factoriseddiscrete ﬂowcontinuous ﬂowdiscrete autoregressivecontinuous autoregressive F · − t i m e p e r i t e r a t i o n ( s ) G Figure 2: Correlated approximate posteriors in a single-channel (neuron) model. A Factorisedgenerative model. B Correlated approximate posterior, with an autoregressive temporal structure, butwithout correlations between cells. C Example observed ﬂuorescence trace and average reconstruc-tions (top), inferred rates (middle) and inferred spiking (bottom), for the true posterior (dark gray),then models with factorised posteriors trained using VAE (red) and supervised (purple) procedures,and ﬁnally our new discrete ﬂow trained using VAE (blue). D The true marginal probability of therebeing a spike in simulated data, against inferred probability. The optimal is unity (dark-gray). E Thenumber of inferred spikes, given only one spike in the underlying data. F The time course of the VAEobjective under the different models, and the highest possible value for the objective, estimated usingimportance sampling. G The time required for a single iteration of the algorithm, for the differentvariants in F . One case where binary latent variables are essential is that of calcium spike deconvolution: inferringlatent binary variables representing the presence or absence of a neural spike in a small time bin,based on noisy optical observations (Fig. 2C).We take the binary variables, z it as representing whether neuron i spiked in time-bin t , and as such,we can interpret u it as the corresponding synaptic input (in contrast, as v t is not part of the generativemodel, it does not have a speciﬁc a biological interpretation). For the generative model, we take aweighted sum of inputs from past spikes, ﬁltered by a temporal kernel κ u and κ v , which is mirrored bythe recognition model, to ensure that the recognition model can capture any prior-induced statisticalstructure, u t = b u + W u τ X t =1 κ u t (cid:12) z t − t v t = b v ( x ) + W v τ X t =1 κ v t (cid:12) z t − t . (14)where (cid:12) is the Hadamard product, so r = g (cid:12) h implies r i = g i h i .6 − i n f e rr e d w e i g h t factorisedﬂowautoregressive C c o rr e l a t i o n factorisedﬂowautoregressive D s l o p e E − − − · − time (s) b i a s F − − VA E o b j . / t i m e - s t e p / ce ll G factorised ﬂow autoregressive0123 t i m e p e r i t e r a t i o n ( s ) HCA s (1) s (2) s (3) s (4) s (5) s (1) s (2) s (3) s (4) s (5) s (1) s (2) s (3) s (4) s (5) B s (1) s (2) s (3) s (4) s (5) s (1) s (2) s (3) s (4) s (5) s (1) s (2) s (3) s (4) s (5) Figure 3: Correlated approximate posteriors to estimate the connectivity between multiple neurons. A Autoregressive generative model, incorporating the effects of synaptic connectivity (note the lackof self-connections). B Correlated autoregressive approximate posterior. C The inferred and groundtruth weights using factorised, ﬂow and autoregressive posteriors.

DEF

The correlation ( D ), slope ( E )and bias (F) for the points in C , plotted across training time. G The time course of the VAE objectiveunder the different models. H The time required for a single training iteration of the algorithm, forthe different models.

We considered a single neuron, whose spiking was IID (Poisson) with ﬁring rate . Hz (Fig. 2A).Fluorescence data was simulated by convolving spikes with a double-exponential temporal kernelwith rise time . s and decay time s and adding noise with standard deviation e − . . We learnedthe recognition model, which consisted of two components. The ﬁrst is a neural network mappingdata to spike inferences, b v ( x ) , which consisted of two hidden layers, with 20 units per cell per timebin, where the ﬁrst layer takes input from time points from a single cell’s ﬂuorescence trace, andthe second layer takes time points from the previous hidden layer, and we use Elu nonlinearities[14] (for further details see [11, 5]). The second is the recognition temporal kernel κ v , which capturesthe anticorrelations induced by explaining away (Fig. 2BC).All strategies, including factorised and autoregressive (ﬂow) VAE’s, and supervised training giveroughly similar reconstructions (Fig. 2C top). Thus, to understand how the autoregressive posterior issuperior, we need to look in more depth at the posteriors themselves. In particular, the factorised VAEhad very narrow posteriors, spuriously indicating a very high degree of certainty in the spike timing(Fig. 2C middle), whereas both the supervised and autoregressive VAE and also the true posterior(estimate by importance sampling) indicate a higher level of temporal uncertainty. These differencesare even more evident if we consider spike trains sampled from the approximate posterior (Fig. 2Cbottom), or if we consider calibration: the probability of there actually being a spike in the underlyingdata, when the inference method under consideration indicates a particular probability of spiking(Fig. 2D). However, the sampled spike trains indicate another issue: while the true and VAE posteriorsgenerally have one spike corresponding to each ground-truth spike, supervised training produces7onsiderable uncertainty about the spike-count (Fig. 2E). Thus, the VAE with an autoregressive (ﬂow)posterior combines the best of both worlds: achieving reasonable timing uncertainty (unlike thefactorised VAE), whilst achieving reasonable spike counts (unlike supervised training).As such, the autoregressive VAE performs more effectively than the factorised methods consideringthe IWAE objective (with 10 samples) [15], under which the models are trained (Fig. 2F). As expectedgiven Sec. 3.4, to get good performance, it is important to compute probabilities by putting the relaxedvariables into the discrete pmf (binary, as opposed to logistic), and to use the parallel ﬁxed pointiteration based ﬂow, as opposed to the autoregressive distribution (note however, that we did notthoroughly explore the space of normalizing ﬂows). The considerable differences in speed betweenthe approximate parallel ﬁxed point iterations, and the exact autoregressive computation arises froman order-of-magnitude difference in the time required for an individual iteration (Fig. 2G). A second problem is inferring synaptic connectivity between cells, based on noisy observations.Here, we considered a network of cells with no self-connectivity, and with weights, W u thatare sparse (probability of . of being non-zero), with the non-zero weights drawn from a Gaussianwith standard deviation (Fig. 3A), and with a ms temporal decay. We used an autoregressive,fully connected recognition model (Fig. 3B). We use the same parameters as in the above simulation,except that we use a somewhat more realistic ﬂuorescence rise-time of 100 ms (previously, we used300 ms so as to highlight uncertainty in timing).We inferred weights under three methods: a factorised VAE, and an autoregressive posterior wherewe use the fast-ﬂow like sampling procedure (ﬂow), and where we use the slow sequential samplingprocedure (autoregressive) (Fig. 3C). The ﬂow posterior gave a considerable advantage over the othermethods in terms of the correlation between ground truth and inferred weights (Fig. 3D), and in termsof the slope (Fig. 3E), indicating a reduction in the bias towards underestimating the magnitude ofweights, while the bias (i.e. the additive offset in Fig. 3C) remained small for all the methods. Assuch, the autoregressive posterior with ﬂow-based sampling increased the ELBO considerably overthe factorised model, or the autoregressive model with the slow sequential sampling, (Fig. 3G), andthese differences again arise because of large differences in the time required for a single trainingiteration (Fig. 3H). We have described an approach to sampling from a discrete autoregressive distribution using a parallel,ﬂow-like procedure, derived by considering a ﬁxed-point iteration that converges to a sample from theunderlying autoregressive process. We applied this sampling procedure to autoregressive approximateposteriors in variational inference, in order to sample rapidly from the approximate posterior, andhence to speed up training. This allowed us to rapidly learn autoregressive posteriors in the contextof neural data analysis, allowing us to realise the beneﬁts of autoregressive approximate posteriorsfor single and multi cell data in reasonable timescales.It is important to remember that while we can sample using K ﬁxed-point iterations, we can onlyevaluate the probability of a sample once it has converged. This mismatch introduces a level ofapproximation in addition to those that are typical when relaxing discrete distributions [13, 12], butwe can deal with the additional approximation error in the same way: by evaluating the model usingsamples drawn from the underlying discrete, autoregressive approximate posterior.Finally, our work suggests two directions that may be interesting to pursue. First, while onestraightforwardly use normalizing ﬂows to deﬁne approximate posteriors for latent dynamical systemswith continuous latent variables, it may be difﬁcult to deﬁne approximate posteriors that are wellsuited to the underlying dynamical process. In this context, our procedure of using ﬁxed-pointiterations may be a useful starting point. Second, we showed that while it may be possible to convert adiscrete latent variable model to an equivalent model with continuous latents, this typically introducesconsiderable scope for mismatch between the prior and approximate posterior. However, the actualapproximate posterior is relatively simple, a mixture of truncated Logistics, and as such, it may bepossible to design approximate posteriors that more closely match the true posterior, though it is lessclear how such distributions might be embedded within a normalizing ﬂow.8 eferences [1] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” ICLR , 2014.[2] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic backpropagation and approximateinference in deep generative models,”

ICML , 2014.[3] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, “Weight uncertainty in neuralnetworks,” arXiv preprint arXiv:1505.05424 , 2015.[4] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerch-ner, “beta-vae: Learning basic visual concepts with a constrained variational framework,”

ICLR ,2017.[5] L. Aitchison, L. Russell, A. M. Packer, J. Yan, P. Castonguay, M. Häusser, and S. C. Turaga,“Model-based Bayesian inference of neural activity and connectivity from all-optical interroga-tion of a neural circuit,” in

Advances in Neural Information Processing Systems , pp. 3486–3495,2017.[6] J. Pearl, “Embracing causality in default reasoning,”

Artiﬁcial Intelligence , vol. 35, pp. 259–271,1988.[7] J. Pearl,

Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference .Morgan Kaufmann, 1988.[8] D. Rezende and S. Mohamed, “Variational inference with normalizing ﬂows,” in

InternationalConference on Machine Learning , pp. 1530–1538, 2015.[9] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and M. Welling, “Improvedvariational inference with inverse autoregressive ﬂow,” in

Advances in Neural InformationProcessing Systems , pp. 4743–4751, 2016.[10] J. Friedrich, P. Zhou, and L. Paninski, “Fast online deconvolution of calcium imaging data,”

PLoS Computational Biology , vol. 13, p. e1005423, 2017.[11] A. Speiser, J. Yan, E. W. Archer, L. Buesing, S. C. Turaga, and J. H. Macke, “Fast amortizedinference of neural activity from calcium imaging data with variational autoencoders,” in

Advances in Neural Information Processing Systems , pp. 4024–4034, 2017.[12] C. J. Maddison, A. Mnih, and Y. W. Teh, “The concrete distribution: A continuous relaxation ofdiscrete random variables,” arXiv preprint arXiv:1611.00712 , 2016.[13] E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with Gumbel-Softmax,” arXivpreprint arXiv:1611.01144 , 2016.[14] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning byexponential linear units (elus),” arXiv preprint arXiv:1511.07289 , 2015.[15] Y. Burda, R. Grosse, and R. Salakhutdinov, “Importance weighted autoencoders,” arXiv preprintarXiv:1509.00519arXiv preprintarXiv:1509.00519