[PDF] Dynamic Variational Autoencoders for Visual Process Modeling

Abstract

This work studies the problem of modeling visual processes by leveraging deep generative architectures for learning linear, Gaussian representations from observed sequences. We propose a joint learning framework, combining a vector autoregressive model and Variational Autoencoders. This results in an architecture that allows Variational Autoencoders to simultaneously learn a non-linear observation as well as a linear state model from sequences of frames. We validate our approach on artificial sequences and dynamic textures.

Full PDF

DDYNAMIC VARIATIONAL AUTOENCODERS FOR VISUAL PROCESS MODELING

Alexander Sagel ∗ , Hao Shen † fortiss - The Research Institute of the Free State of BavariaMunich, Germany ABSTRACT

This work studies the problem of modeling visual processesby leveraging deep generative architectures for learning lin-ear, Gaussian representations from observed sequences. Wepropose a joint learning framework, combining a vector au-toregressive model and a

Variational Autoencoder . This re-sults in an architecture that allows Variational Autoencodersto simultaneously learn a non-linear observation as well asa linear state model from sequences of frames. We validateour approach on synthesis of artiﬁcial sequences and dynamictextures. To this end, we use our architecture to learn a sta-tistical model of each visual process, and generate a new se-quence from each learned visual process model.

Index Terms — Neural Networks, Statistical Learning,Video Signal Processing, Unsupervised learning

1. INTRODUCTION

Many visual real-world phenomena such as motion capturesof bird migrations, crowd movements or ocean waves can bethought of as random processes. From a statistical point ofview, these visual processes can be thus described in termsof their probabilistic properties, if an appropriate model canbe inferred from observed sequences of the process. Sucha model can be helpful for studying the dynamic propertiesof the process, estimating possible trajectories [1], detectinganomalous behavior [2] or generating new sequences [3].One of the most classic models for visual processes is thelinear dynamic system (LDS) based approach proposed byDoretto et. al [4]. It is a holistic and generative sequential latent variable description. Since it is essentially a combi-nation of linear transformations and additive Gaussian noise,it is mathematically simple and easy to interpret. However,it bears the disadvantage of constraining each video frame tolie in an low-dimensional afﬁne space. Since real-world ob-servation manifolds are anything but linear, this restricts theapplicability to very simple visual processes.The contribution of our work is to generalize the LDSmodel by Doretto et al. to non-linear visual processes, while ∗ [email protected] . Alexander Sagel carried out a part of the re-search during his stay at Inria in Rennes, France. † [email protected] still keeping its advantages of being an easy-to-analyze latentvariable model. To this end, we use Variational Autoencodersto learn a non-linear state space model, in which the latentstates still follow a linear, autoregressive dynamic. This isdone by supplementing the Variational Autoencoder architec-ture with a linear layer that models the temporal transition.

2. BACKGROUND

The dynamic texture model in [4] is an LDS of the form h t +1 = Ah t + v t , y t = ¯ y + Ch t , (1)where h t ∈ R n , n < d, is the low dimensional state spacevariable at time t , A ∈ R n × n the state transition matrix , y t ∈ R d the observation at time t , and C ∈ R d × n the full-rank observation matrix . The vector ¯ y ∈ R d represents aconstant offset of the observation space and the input term v t is modeled as i.i.d. zero-mean Gaussian noise that is inde-pendent of h t . Learning a model as described in Eq. (1) isdone by inferring A and C from one or several videos. To doso, it is sensible to assume that the process is (approximately)second-order stationary and ﬁrst-order Markov. We make thesame two assumptions, although our method is easily gener-alizable to Markov processes of higher orders.Real-world visual processes are generally non-linear andnon-Gaussian. Nevertheless, the model Eq. (1) has consid-erable appeal in terms of simplicity. This is particularly truefor the latent model of h t in the system equation which con-sists of a vector autoregressive (VAR) process. For instance,while it is theoretically possible to replace the latent VAR bya recurrent neural network (RNN) with noise input to obtaina much richer class of temporal progressions, such an ad-justment would lead to a loss of tractability, due to the un-predictable long-term behavior of RNNs as opposed to lin-ear transition models. Modeling the expected temporal pro-gression as a matrix multiplication also greatly simpliﬁes theinversion and interpolation of a temporal step by means ofmatrix algebra. Luckily, we can still generalize the systemEq. (1) while keeping the latent VAR model by employing h t +1 = Ah t + v t , y t = C ( h t ) , (2)where n < d and C : R n → R d is a function which maps to a r X i v : . [ c s . N E ] F e b he observation manifold M , i.e. the manifold of the framesproduced by the visual process, and is bijective on C − ( M ) .We propose to keep the latent VAR model and combine itwith a Variational Autoencoder (VAE) [5] to generate videoframes on a non-linear observation manifold from the VARsamples. Since it would go beyond the scope of this paperto review VAEs, we refer the reader to [6] and make do withthe simpliﬁcation that the VAE is an autoencoder-like deeparchitecture that can be trained such that the decoder mapsstandard Gaussian noise to samples of the probability distri-bution of the high-dimensional training data. This is done byoptimizing the so-called variational lower bound .While it is trivial to learn a deep generative model such asa VAE, and combine it with a separately learned VAR process,we propose to learn both aspects jointly, to fully exploit themodel capacity. One can thus describe our method as systemidentiﬁcation [7] with deep learning.Combinations of latent VAR models with learned, non-linear observation functions have been explored in previousworks. For instance, in [8] the authors apply the kernel trickto do so. Approaches of combining VAR models with deeplearning have been studied in [9, 10], but neither of these em-ploy VAEs. By contrast, the authors of [11] combine LinearDynamic Systems (LDSs) with VAEs. However, their modelis locally linear and the transition distribution is modeled inthe variational bound, whereas we model it as a separate layer.This also is the main difference to the work [12], in whichVAEs are used as Kalman Filters, and the work [13] whichuses VAEs for state space identiﬁcation of dynamic systems.Similarly, the work [1] combines VAEs with linear dynamicmodels for forecasting images but proposes a training objec-tive that is considerably more complex than ours.From an application point of view, generative modelingof visual processes is closely related to video synthesis whichhas been discussed, among others in [3, 14, 15]. Two ofthe most recent works in the generative modeling and syn-thesis of visual processes are [16, 17]. Both models achievevery good synthesis results for dynamic textures. One ad-vantage of our model over these techniques is the ability tosynthesize videos frame-by-frame in an online manner oncethe model is trained, without signiﬁcantly increasing mem-ory and time consumption as the sequence to be synthesizedbecomes longer. This is due to our model not requiring anoptimization procedure for each synthesized sequence.

3. JOINT LEARNING VIA DYNAMIC VAES

In the following, we describe our method to learn a modeldescribed by Eq. (2) from video sequences. We use uprightcharacters, e.g. x , y to denote random variables, as opposedto italic ones, e.g. x , y that we use for all other quantities.Without loss of generality [18, 19], we assume that themarginal distribution of each state space variable h t is stan-dard Gaussian at any time t . Thus, the model in Eq. (2) is entirely described by the matrix A and the function C . Asequence of length N of a visual process Y is viewed as arealization of the random variable y N = [ y , . . . , y N ] . Theaccording sequence in the latent space is a realization of therandom variable h N = [ h , . . . , h N ] . Let us deﬁne the ran-dom variable ˜ y N = [˜ y , . . . , ˜ y N ] := [ C ( h ) , . . . , C ( h N )] . (3)In order to model Y by Eq. (2), we need to make sure thatthe probability distributions of y N and ˜ y N coincide for any N ∈ N . Taking into account the stationarity and Markovassumptions from Section 2, this is equivalent to demandingthat the joint probabilities of two succeeding frames coincide.The joint probability distribution of two succeeding states h t and h t +1 is zero-mean Gaussian and, the following rela-tion holds. (cid:20) h t h t +1 (cid:21) ∼ N (cid:18)(cid:20) h t h t +1 (cid:21) ; 0 , (cid:20) I A (cid:62)

A I (cid:21)(cid:19) ∀ t. (4)With this in mind, the problem boils down to ﬁnding afunction C and a matrix A such that the random variable ˜ y = (cid:2) C ( h ) C ( h ) (cid:3) has the same distribution as y .Since a VAE decoder can be trained to map from a stan-dard Gaussian to a data distribution, it makes sense to employit as the observation function C . Unfortunately, a classicalVAE does not provide a framework to simultaneously learnthe matrix A from sequential data. However, we still canuse a VAE to accomplish this task. Let us denote by f θ asuccessfully trained VAE decoder, i.e. a function that trans-forms samples x of standard Gaussian noise to samples froma high dimensional data distribution. The variable θ denotesthe entirety of trainable parameters of f θ . Let us assume that θ = ( A , B , η ) contains the matrices A , B ∈ R n × n such that AA (cid:62) + BB (cid:62) = I n (5)is satisﬁed, and the weights η of the subnetwork C η of f θ thatimplements the observation function C . Consider the follow-ing deﬁnition for f θ . f θ : R n → R d , (cid:20) x x (cid:21) (cid:55)→ (cid:20) C η ( x ) C η ( Ax + Bx ) (cid:21) . (6)If f θ is trained to map x ∼ N ( x ; 0 , I n ) to ˜ y = f θ ( x ) with ˜ y ∼ p ( y ) , then A and C = C η indeed fulﬁll theproperty that the joint probabilities of the random variables ˜ y and y , as deﬁned above, coincide, if the random variables h , h follow the joint probability distribution in Eq. (4).To model f θ as a neural network, we formalize the initialtransformation of x in Eq. (6) as a multiplication with F = (cid:20) I n A B (cid:21) (7)from the left. The function f θ can be realized by the neuralarchitecture depicted in Fig. 1. The ﬁrst layer is linear and im-plements the multiplication with F . We refer to this layer as · (dynamic layer) C η ( · ) x h h y y split join [ ˜ y (cid:62) , ˜ y (cid:62) ] (cid:62) Fig. 1 . Decoder of a Dynamic VAE with a dynamic layer.the dynamic layer . The output of the dynamic layer is dividedinto into an upper half h and a lower half h and both halvesare fed to the subnetwork that implements C η . The weightsof the dynamic layer contain the matrices A , B . Thus, theycan be trained along η , by back-propagation of the stochas-tic gradient computed from pairs of succeeding video frames.When the architecture is employed as the decoder of a VAE,the parameters A and η are implicitly trained to satisfy thedesired requirements. However, we need to make sure thatthe stationarity constraint in Eq. (5) is not violated. We haveobserved that this can be effectually done by adding a simpleregularizer to the loss function of the VAE. Let L ( θ, ϑ ) denotethe variational lower bound of the VAE, with θ, ϑ denoting thetrainable parameters of the decoder and encoder respectively.The loss function of our model is given as ˜ L ( θ, ϑ ) = L ( θ, ϑ ) + λ (cid:107) AA (cid:62) + BB (cid:62) − I n (cid:107) F , (8)where λ > is a regularizer constant. We refer to the result-ing model consisting of a VAE with a dynamic layer and astationarity regularizer, as the Dynamic VAE (DVAE).

4. EXPERIMENTS

One way to evaluate a generative video model is to gener-ate synthetic sequences of visual processes from the trainedmodel and to assess how much the their statistics correspondto the training data. Unfortunately, most quantitative qualitymeasures that are common in video prediction such as PeakSignal to Noise Ratio or Structural Similarity are unsuitablefor evaluating video synthesis since they favor entirely deter-ministic models. As a more appropriate quality measure forgenerative models, the

Fr´echet Inception Distance (FID) [20]has been widely accepted.However, even the FID is designed to compare the prob-ability distributions of isolated still-image frames, neglectingthe temporal behavior of the process. Additionally to the FIDscore, it is thus necessary to visually inspect the synthesis re-sults. In this section we thus aim to do both. Subsection 4.1provides qualitative results on artiﬁcial sequences in order todemonstrate the inference capabilities of our model. Subsec-tion 4.2 additionally provides quantitative comparisons withstate-of-the-art methods for dynamic texture synthesis, in par-ticular the recent

Spatial-Temporal Generative Convet (STG-CONV) [16] and the

Dynamic Generator Model [21], when-ever the results were made available. To this end, FID scoresare computed via [22].

Experiment Resolution Conv.layers Kernelsize σ y MNIST × × × × × Table 1 . Experimental ConﬁgurationTrainingLDSVAEDVAE

Fig. 2 . Synthesis of MNIST sequence 01234. . .Synthesis is performed by sampling from h t +1 = Ah t + Bv t , v t ∼ i . i . d . N ( v t , , I n ) , (9)and mapping the latent states h t to the observation space bymeans of C η . The initial latent state h is estimated by ap-plying the encoder to a frame pair from the training sequence.We train C η and A simultaneously by means of our DynamicVAE framework introduced in Section 3. As the encoder ofthe VAE, we use the discriminator of the DCGAN, but adaptthe number of output channels to be n , where n is the latentdimension of the model. As the decoder, we employ the DC-GAN generator. The latent dimension is set to n = 10 forall experiments. The number of convolutional layers and thesize of the convolutional kernel of the decoder output layer(encoder input layer) varies to match the resolution of the in-put data. We set the regularizer constant to λ = 100 . Theexact conﬁguration is listed in Table 1, where σ y denotes theconditional variance of the VAE output distribution. PyTorchcode is available at [23]. As an ablation study, we test our model on sequences ofMNIST numbers. The aim is to see how well the DVAE cap-tures the deterministic (predictable numbers) and stochastic(random writing style) aspects of the sequence, compared toen entirely linear system, and an autoregressive VAE model inwhich the VAE and VAR were learnt separately. Fig. 2 depictsthe synthesis result for the number sequence 0123401234. . . .Our model captures the two essential features of this visualprocess. On the one hand, the particular number in a frame isentirely deterministic and can be inferred from the previousframe. On the other hand, the way the number is drawn is ran-dom and unpredictable. By contrast, the separate VAE+VARodel is only able to capture the appearance of the numbersand can not reproduce the number ordering, while the linearmodel generates hardly recognizable frames.Training 1Training 2Training 4LDSSTGCONVDVAE

Fig. 3 . Synthesis results for

Running Cows . Due to spaceconstraints, two of the 5 training sequences are omitted.The

Running Cows sequences were used by the authors of[16] to demonstrate the synthesis performance of STGCONV.Fig. 3 depicts synthesis results for this sequence. We ob-serve that although occasional discontinuities occur in the se-quence synthesized by the DVAE, the overall running move-ment is accurately reproduced and unlike LDS or STGCONV,the DVAE does not introduce artifacts like additional legs.Our model is also capable to learn from incomplete data,when the obstruction mask is given, albeit the results tend tobecome less steady over time. Fig. 4 depicts the synthesis ofa video obstructed by 50% salt+pepper noise. Fig. 5 depictsthe synthesis of a video obstructed by 50% rectangular mask.

We synthesize sequences of eleven dynamic textures thatwere provided in the supplementary material of [16]. Ta-ble 2 summarizes the resulting FID scores. As a demon-stration, Fig. 6 depicts the synthesis results for the

Fire Pot video. Compared to the STGCONV method, the DVAEyields slightly blurrier results. However, this advantage of theSTGCONV can be explained by its tendency to reproduce thetraining sequence. This behavior can be observed in Fig. 6.Comparing the training sequence to the STGCONV resultindicates that, at each time step, the STGCONV frames area slightly perturbed version of the training frames, whereasTrainingSynthesis

Fig. 4 . Synthesizing a sequence learned on data obstructed bya salt+pepper mask. TrainingSynthesis

Fig. 5 . Synthesizing a sequence learned on data obstructed bya rectangular mask.TrainingSTGCONVDGMDVAE

Fig. 6 . Synthesis of the dynamic texture

Fire Pot

DVAE produces frames that evolve differently over time,while maintaining a natural dynamic. The DGM producesless predictable sequences, but the frame transition appearsless natural, resembling a fading of one frame into the other.

5. CONCLUSION

We proposed a deep generative model for visual processesbased on an auto-regressive state space model and a Varia-tional Autoencoder. Despite being based on a dynamic be-havior of a mathematically very simple form, our model iscapable to reproduce various kinds of highly non-linear andnon-Gaussian visual processes and compete with state-of-the-art approaches for dynamic texture synthesis.

Sequence LDS STGCONV DGM DVAE

Cows 267.0 311.2 -

Flowing Water 221.0 - -

Boiling Water - - 175.8Sea 108.9 - -

River 238.4 - 110.1Mountain Stream 224.9 - -

Spring Water 333.4 - 235.9Fountain 241.9 271.9 -

Waterfall 381.4 - 336.1Washing Machine - - 128.9Flashing Lights 191.0 166.7 257.8

Fire Pot 189.6 172.2 146.3 . FID Scores . REFERENCES [1] Matthew Johnson, David K Duvenaud, Alex Wiltschko,Ryan P Adams, and Sandeep R Datta, “Composinggraphical models with neural networks for structuredrepresentations and fast inference,” in

Advances in neu-ral information processing systems , 2016, pp. 2946–2954.[2] Vijay Mahadevan, Weixin Li, Viral Bhalodia, and NunoVasconcelos, “Anomaly detection in crowded scenes,”in . IEEE, 2010, pp.1975–1981.[3] Carl Vondrick, Hamed Pirsiavash, and Antonio Tor-ralba, “Generating videos with scene dynamics,” in

Advances In Neural Information Processing Systems ,2016, pp. 613–621.[4] Gianfranco Doretto, Alessandro Chiuso, Ying Nian Wu,and Stefano Soatto, “Dynamic textures,”

InternationalJournal of Computer Vision , vol. 51, no. 2, pp. 91–109,2003.[5] Diederik P Kingma and Max Welling, “Auto-encodingvariational bayes,” arXiv preprint arXiv:1312.6114 ,2013.[6] Carl Doersch, “Tutorial on variational autoencoders,” arXiv preprint arXiv:1606.05908 , 2016.[7] Peter Van Overschee and Bart De Moor, “N4sid:Subspace algorithms for the identiﬁcation of combineddeterministic-stochastic systems,”

Automatica , vol. 30,no. 1, pp. 75–93, 1994, Special issue on statistical signalprocessing and control.[8] Antoni B Chan and Nuno Vasconcelos, “Classifyingvideo with kernel dynamic textures,” in

IEEE Con-ference on Computer Vision and Pattern Recognition(CVPR) . IEEE, 2007, pp. 1–6.[9] Rakesh Chalasani and Jose C Principe, “Deep predic-tive coding networks,” arXiv preprint arXiv:1301.3541 ,2013.[10] Wenqian Liu, Abhishek Sharma, Octavia Camps, andMario Sznaier, “Dyan: A dynamical atoms-based net-work for video prediction,” in

Proceedings of the Eu-ropean Conference on Computer Vision (ECCV) , 2018,pp. 170–185.[11] Manuel Watter, Jost Springenberg, Joschka Boedecker,and Martin Riedmiller, “Embed to control: A locallylinear latent dynamics model for control from raw im-ages,” in

Advances in neural information processingsystems , 2015, pp. 2746–2754. [12] Rahul G Krishnan, Uri Shalit, and David Sontag, “Deepkalman ﬁlters,” arXiv preprint arXiv:1511.05121 , 2015.[13] Maximilian Karl, Maximilian Soelch, Justin Bayer, andPatrick van der Smagt, “Deep variational bayes ﬁlters:Unsupervised learning of state space models from rawdata,” .[14] Tianfan Xue, Jiajun Wu, Katherine Bouman, and BillFreeman, “Visual dynamics: Probabilistic future framesynthesis via cross convolutional networks,” in

Ad-vances in Neural Information Processing Systems , 2016,pp. 91–99.[15] Nitish Srivastava, Elman Mansimov, and RuslanSalakhudinov, “Unsupervised learning of video repre-sentations using lstms,” in

International conference onmachine learning , 2015, pp. 843–852.[16] Jianwen Xie, Song-Chun Zhu, and Ying Nian Wu, “Syn-thesizing dynamic patterns by spatial-temporal genera-tive convnet,” in

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2017, pp.7093–7101.[17] Matthew Tesfaldet, Marcus A Brubaker, and Konstanti-nos G Derpanis, “Two-stream convolutional networksfor dynamic texture synthesis,” in

IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) ,2018, pp. 6703–6712.[18] Bijan Afsari and Ren´e Vidal, “The alignment distanceon spaces of linear dynamical systems,” in

Decision andControl (CDC), 2013 IEEE 52nd Annual Conference on .IEEE, 2013, pp. 1162–1167.[19] A. Sagel and M. Kleinsteuber, “Alignment distances onsystems of bags,”

IEEE Transactions on Circuits andSystems for Video Technology , pp. 1–1, 2018.[20] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner,Bernhard Nessler, and Sepp Hochreiter, “Gans trainedby a two time-scale update rule converge to a local nashequilibrium,” p. Advances in Neural Information Pro-cessing Systems, Jan. 2017.[21] Jianwen Xie, Ruiqi Gao, Zilong Zheng, Song-ChunZhu, and Ying Nian Wu, “Learning dynamic generatormodel by alternating back-propagation through time,” arXiv preprint arXiv:1812.10587 , 2018.[22] Maximilian Seitzer, “Fr´echet inception distance(ﬁd score) in pytorch,” https://github.com/mseitzer/pytorch-fid , 2019.[23] Alexander Sagel, “Dynamic vae,” https://github.com/alexandersagel/DynamicVAEhttps://github.com/alexandersagel/DynamicVAE