Variational Temporal Abstraction
VVariational Temporal Abstraction
Taesup Kim , , † , Sungjin Ahn ∗ , Yoshua Bengio ∗ Mila, Université de Montréal, Rutgers University, Kakao Brain
Abstract
We introduce a variational approach to learning and inference of temporally hierar-chical structure and representation for sequential data. We propose the
VariationalTemporal Abstraction (VTA), a hierarchical recurrent state space model that caninfer the latent temporal structure and thus perform the stochastic state transitionhierarchically. We also propose to apply this model to implement the jumpy-imagination ability in imagination-augmented agent-learning in order to improvethe efficiency of the imagination. In experiments, we demonstrate that our proposedmethod can model 2D and 3D visual sequence datasets with interpretable temporalstructure discovery and that its application to jumpy imagination enables moreefficient agent-learning in a 3D navigation task.
Discovering temporally hierarchical structure and representation in sequential data is the key to manyproblems in machine learning. In particular, for an intelligent agent exploring an environment, itis critical to learn such spatio-temporal structure hierarchically because it can, for instance, enableefficient option-learning and jumpy future imagination, abilities critical to resolving the sampleefficiency problem (Hamrick, 2019). Without such temporal abstraction, imagination would easilybecome inefficient; imagine a person planning one-hour driving from her office to home with futureimagination at the scale of every second. It is also biologically evidenced that future imagination isthe very fundamental function of the human brain (Mullally & Maguire, 2014; Buckner, 2010) whichis believed to be implemented via hierarchical coding of the grid cells (Wei et al., 2015).There have been approaches to learn such hierarchical structure in sequences such as the HM-RNN (Chung et al., 2016). However, as a deterministic model, it has the main limitation that itcannot capture the stochastic nature prevailing in the data. In particular, this is a critical limitationto imagination-augmented agents because exploring various possible futures according to the un-certainty is what makes the imagination meaningful in many cases. There have been also manyprobabilistic sequence models that can deal with such stochastic nature in the sequential data (Chunget al., 2015; Krishnan et al., 2017; Fraccaro et al., 2017). However, unlike HMRNN, these modelscannot automatically discover the temporal structure in the data.In this paper, we propose the Hierarchical Recurrent State Space Model (HRSSM) that combines theadvantages of both worlds: it can discover the latent temporal structure (e.g., subsequences) while alsomodeling its stochastic state transitions hierarchically. For its learning and inference, we introduce avariational approximate inference approach to deal with the intractability of the true posterior. We alsopropose to apply the HRSSM to implement efficient jumpy imagination for imagination-augmentedagents. We note that the proposed HRSSM is a generic generative sequence model that is not tied tothe specific application to the imagination-augmented agent but can be applied to any sequential data.In experiments, on 2D bouncing balls and 3D maze exploration, we show that the proposed modelcan model sequential data with interpretable temporal abstraction discovery. Then, we show that themodel can be applied to improve the efficiency of imagination-augmented agent-learning. ∗ Equal advising, † work also done while visiting Rutgers University a r X i v : . [ c s . L G ] O c t he main contributions of the paper are:1. We propose the Hierarchical Recurrent State Space Model (HRSSM) that is the first stochas-tic sequence model that discovers the temporal abstraction structure.2. We propose the application of HRSSM to imagination-augmented agent so that it canperform efficient jumpy future imagination.3. In experiments, we showcase the temporal structure discovery and the benefit of HRSSMfor agent learning. In our model, we assume that a sequence X = x T = ( x , . . . , x T ) has a latent structure of temporalabstraction that can partition the sequence into N non-overlapping subsequences X = ( X , . . . , X N ) .A subsequence X i = x i l i has length l i such that T = (cid:80) Ti =1 l i and L = { l i } . Unlike previousworks (Serban et al., 2017), we treat the number of subsequences N and the lengths of subsequences L as discrete latent variables rather than given parameters. This makes our model discover theunderlying temporal structure adaptively and stochastically.We also assume that a subsequence X i is generated from a temporal abstraction z i and an observation x t has observation abstraction s t . The temporal abstraction and observation abstraction havea hierarchical structure in such a way that all observations in X i are governed by the temporalabstraction z i in addition to the local observation abstraction s t . As a temporal model, the twoabstractions take temporal transitions. The transition of temporal abstraction occurs only at thesubsequence scale while the observation transition is performed at every time step. This generativeprocess can then be written as follows: p ( X, S, L, Z, N ) = p ( N ) N (cid:89) i =1 p ( X i , S i | z i , l i ) p ( l i | z i ) p ( z i | z
Inferring the subsequence structure is important because the other stateinference can be decomposed into independent subsequences. This sequence decomposition isimplemented by the following decomposition: q ( M | X ) = T (cid:89) t =1 q ( m t | X ) = T (cid:89) t =1 Bern ( m t | σ ( ϕ ( X ))) , ϕ is a convolutional neural network (CNN) applying convolutions over the temporal axis toextract dependencies between neighboring observations. This enables to sample all indicators M independently and simultaneously. Empirically, we found this CNN-based architecture workingbetter than an RNN-based architecture. State Inference.
State inference is also performed hierarchically. The temporal abstraction predictor q ( Z | M, X ) = (cid:81) Tt =1 q ( z t | M, X ) does inference by encoding subsequences determined by M and X . To use the same temporal abstraction across the time steps of a subsequence, the distribution q ( z t | M, X ) is also conditioned on the boundary indicator m t − : q ( z t | M, X ) = (cid:26) δ ( z t = z t − ) if m t − = 0 (COPY) , ˜ q ( z t | ψ fwd t − , ψ bwd t ) otherwise (UPDATE) . We use the distribution ˜ q ( z t | ψ fwd t − , ψ bwd t ) to update the state z t . It is conditioned on all previousobservations x We demonstrate our model on visual sequence datasets to show (1) how sequence data is decomposedinto perceptually plausible subsequences without any supervision, (2) how jumpy future prediction isdone with temporal abstraction and (3) how this jumpy future prediction can improve the planning as5igure 3: Left : Previously observed (context) data X ctx . Right : Each first row is the input observationsequence X and the second row is the corresponding reconstruction. The sequence decomposer q ( M | X ) predicts the starting frames of subsequences and it is indicated by arrows (red squaredframes). Subsequences are newly defined when a ball hits the wall by changing the color.an imagination module in a navigation problem. Moreover, we test conditional generation p ( X | X ctx ) where X ctx = x − ( T ctx − is the context observation of length T ctx . With the context, we preset thestate transition of the temporal abstraction by deterministically initializing c = f ctx ( X ctx ) with f ctx implemented by a forward RNN. We generated a synthetic 2D visual sequence dataset called bouncing balls . The dataset is composedof two colored balls that are designed to bounce in hitting the walls of a square box. Each ball isindependently characterized with certain rules: (1) The color of each ball is randomly changed whenit hits a wall and (2) the velocity (2D vector) is also slightly changed at every time steps with asmall amount of noise. We trained a model to learn 1D state representations and all observationdata x t ∈ R × × are encoded and decoded by convolutional neural networks. During training,the length of observation sequence data X is set to T = 20 and the context length is T ctx = 5 .Hyper-parameters related to sequence decomposition are set as N max = 5 and l max = 10 .Our results in Figure 3 show that the sequence decomposer q ( M | X ) predicts reasonable subsequencesby setting a new subsequence when the color of balls is changed or the ball is bounced. At thebeginning of training, the sequence decomposer is unstable with having large entropy and tends todefine subsequences with a small number of frames. It then began to learn to increase the length ofsubsequences and this is controlled by annealing the temperature τ of Gumbel-softmax towards smallvalues from 1.0 to 0.1. However, without our proposed prior on temporal structure, the sequencedecomposer fails to properly decompose sequences and our proposed model consequently convergesinto RSSM. Another sequence dataset is generated from the 3D maze environment by an agent that navigatesthe maze. Each observation data x t ∈ R × × is defined as a partially observed view observedby the agent. The maze consists of hallways with colored walls and is defined on a × gridmap as shown in Figure 5. The agent is set to navigate around this environment and the viewpointof the agent is constantly jittered with some noise. We set some constraints on the agent’s action( forward, left-turn, right-turn ) that the agent is not allowed to turn around when it is located onthe hallway. However, it can turn around when it arrives nearby intersections between hallways.Due to these constraints, the agent without a policy can randomly navigate the maze environmentand collect meaningful data. To train an environment model, we collected 1M steps (frames)from the randomly navigating agent and used it to train both RSSM and our proposed HRSSM. ForHRSSM, we used the same training setting as bouncing balls but different N max = 5 and l max = 8 for the sequence decomposition. The corresponding learning curves are shown in Figure 4 that bothreached a similar ELBO. This suggests that our model does not lose the reconstruction performance6igure 4: The learning curve of RSSM and HRSSM: ELBO, KL-divergence and reconstruction loss.Figure 5: Left : Bird’s-eye view (map) of the 3D maze with generated navigation paths. Whitedotted lines indicate the given context path X ctx and the corresponding frames are depicted belowthe map. Solid lines are the generated paths (blue: top, red: bottom) conditioned on the samecontext. Circles are the starting points of subsequences where the temporal abstract transitionsexactly take place. Right : The generated sequence data is shown with its temporal structure. Bothgenerations are conditioned on the same context but different input actions as indicated. Framesamples on each bottom row are generated with the temporal abstract transition ˜ p ( z t (cid:48) | c t (cid:48) ) with c t (cid:48) = f z -rnn ( z t (cid:48) − , c t (cid:48) − ) and this shows how the jumpy future prediction is done. Other samples ontop rows, which are not necessarily required for future prediction with our proposed HRSSM, aregenerated from the observation abstraction transition ˜ p ( s t | h t ) with h t = f s -rnn ( s t − (cid:107) z t , h t − ) . Theboundaries between subsequences are determined by p ( m t | s t ) .while discovering the hierarchical structure. We trained state transitions to be action-conditionedand therefore this allows to perform action-controlled imagination. For HRSSM, only the temporalabstraction state transition is action-conditioned as we aim to execute the imagination only withthe jumpy future prediction. The overall sequence generation procedure is described in Figure 5.The temporal structure of the generated sequence shows how the jumpy future prediction works andwhere the transitions of temporal abstraction occur. We see that our model learns to set each hallwayas a subsequence and consequently to perform jumpy transitions between hallways without repeatingor skipping a hallway. In Figure 6, a set of jumpy predicted sequences from the same context X ctx and different input actions are shown and this can be interpreted as imaginations the agent can use forplanning. Goal-Oriented Navigation We further use the trained model as an imagination module by aug-menting it to an agent to perform the goal-oriented navigation. In this experiment, the task is tonavigate to a randomly selected goal position within the given life steps. The goal position in thegrid map is not provided to the agent, but a × -cropped image around the goal position is given.To reach the goal fast, the agent is augmented with the imagination model and allowed to executea rollout over a number of imagination trajectories (i.e., a sequence of temporal abstractions) byvarying the input actions. Afterward, it decides the best trajectory that helps to reach the goal faster.To find the best trajectory, we use a simple strategy: a cosine-similarity based matching betweenall imagined state representations in imaginations and the feature of the goal image. The featureextractor for the goal image is jointly trained with the model. This way, at every time step we let During training, the × window (image) around the agent position is always given as additional observationdata and we trained feature extractor by maximizing the cosine-similarity between the extracted feature and thecorresponding time step state representation. X ctx and different input actionsFigure 7: Fully generate seqences conditioned on the same context X ctx and same input actions:generated paths are equal but the viewpoint and the lengths of subsequences are varied (red squaredframes are jumpy future predictions).the agent choose the first action resulting in the best trajectory. This approach can be considered as asimple variant of the Monte Carlo Tree Search (MCTS) and the detailed overall procedure can befound in Appendix. Each episode is defined by randomly initializing the agent position and the goalposition. The agent is allowed maximum 100 steps to reach the goal and the final reward is definedas the number of remaining steps when the agent reaches the goal or consumes all life-steps. Theperformance highly depends on the accuracy and the computationally efficiency of the model and wetherefore compare between RSSM and HRSSM with varying the length of imagined trajectories. Wemeasure the performance by randomly generated 5000 episodes and show how each setting performsacross the episodes by plotting the reward distribution in Figure 8. It is shown that the HRSSM signif-icantly improves the performance compared to the RSSM by having the same computational budget.Figure 8: Goal-oriented navigation with differentlengths of imagined trajectories.HRSSM showed consistent performanceover different lengths of imagined trajecto-ries and most episodes were solved within50 steps. We believe that this is becauseHRSSM is able to abstract multiple timesteps within a single state transition andthis enables to reduce the computationalcost for imaginations. The results alsoshow that finding the best trajectory be-comes difficult as the imagination lengthgets larger, i.e., the number of possibleimagination trajectories increases. Thissuggests that imaginations with temporalabstraction can benefit both the accuracyand the computationally efficiency in effec-tive ways. In this paper, we introduce the Variational Temporal Abstraction (VTA), a generic generative temporalmodel that can discover the hierarchical temporal structure and its stochastic hierarchical state transi-tions. We also propose to use this temporal abstraction for temporally-extended future imaginationin imagination-augmented agent-learning. Experiment results shows that in general sequential datamodeling, the proposed model discovers plausible latent temporal structures and perform hierarchicalstochastic state transitions. Also, in connection to the model-based imagination-augmented agent fora 3D navigation task, we demonstrate the potential of the proposed model in improving the efficiencyof agent-learning. 8 cknowledgments We would like to acknowledge Kakao Brain cloud team for providing computing resources used in thiswork. TK would also like to thank colleagues at Mila, Kakao Brain, and Rutgers Machine LearningGroup. SA is grateful to Kakao Brain, the Center for Super Intelligence (CSI), and Element AI fortheir support. Mila (TK and YB) would also like to thank NSERC, CIFAR, Google, Samsung, Nuance,IBM, Canada Research Chairs, Canada Graduate Scholarship Program, and Compute Canada. References Auger-Méthé, M., Field, C., Albertsen, C. M., Derocher, A. E., Lewis, M. A., Jonsen, I. D., andFlemming, J. M. State-space models’ dirty little secrets: even simple linear gaussian models canhave estimation problems. Scientific reports , 6:26677, 2016.Bengio, Y., Léonard, N., and Courville, A. C. Estimating or propagating gradients through stochasticneurons for conditional computation. CoRR , abs/1308.3432, 2013. URL http://arxiv.org/abs/1308.3432 .Buckner, R. L. The role of the hippocampus in prediction and imagination. Annual review ofpsychology , 61:27–48, 2010.Buesing, L., Weber, T., Racaniere, S., Eslami, S., Rezende, D., Reichert, D. P., Viola, F., Besse, F.,Gregor, K., Hassabis, D., et al. Learning and querying fast generative models for reinforcementlearning. arXiv preprint arXiv:1802.03006 , 2018.Cho, K., van Merriënboer, B., Gülçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., and Bengio,Y. Learning phrase representations using rnn encoder–decoder for statistical machine translation.In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP) , pp. 1724–1734, Doha, Qatar, October 2014. Association for Computational Linguistics.URL .Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A. C., and Bengio, Y. A recurrent latent variablemodel for sequential data. In Advances in neural information processing systems , pp. 2980–2988,2015.Chung, J., Ahn, S., and Bengio, Y. Hierarchical multiscale recurrent neural networks. arXiv preprintarXiv:1609.01704 , 2016.Clevert, D., Unterthiner, T., and Hochreiter, S. Fast and accurate deep network learning by exponentiallinear units (elus). In , 2016. URL http://arxiv.org/abs/1511.07289 .Dai, H., Dai, B., Zhang, Y.-M., Li, S., and Song, L. Recurrent hidden semi-markov model. 2016.Fraccaro, M., Kamronn, S., Paquet, U., and Winther, O. A disentangled recognition and nonlineardynamics model for unsupervised learning. In Advances in Neural Information Processing Systems ,pp. 3601–3610, 2017.Ghahramani, Z. and Hinton, G. E. Variational learning for switching state-space models. Neuralcomputation , 12(4):831–864, 2000.Gregor, K., Papamakarios, G., Besse, F., Buesing, L., and Weber, T. Temporal difference variationalauto-encoder. In International Conference on Learning Representations , 2019.Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., and Davidson, J. Learning latentdynamics for planning from pixels. arXiv preprint arXiv:1811.04551 , 2018a.Hafner, D., Lillicrap, T. P., Fischer, I., Villegas, R., Ha, D., Lee, H., and Davidson, J. Learning latentdynamics for planning from pixels. CoRR , abs/1811.04551, 2018b.Hamrick, J. B. Analogues of mental simulation and imagination in deep learning. Current Opinionin Behavioral Sciences , 29:8 – 16, 2019. ISSN 2352-1546. SI: 29: Artificial Intelligence (2019).9ang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbel-softmax. In InternationalConference on Learning Representations , 2017.Jayaraman, D., Ebert, F., Efros, A., and Levine, S. Time-agnostic prediction: Predicting predictablevideo frames. In International Conference on Learning Representations , 2019. URL https://openreview.net/forum?id=SyzVb3CcFX .Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible 1x1 convolu-tions. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N.,and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31 , pp.10215–10224. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/8224-glow-generative-flow-with-invertible-1x1-convolutions.pdf .Kingma, D. P. and Welling, M. Auto-encoding variational bayes. In , 2014.Kipf, T., Li, Y., Dai, H., Zambaldi, V. F., Grefenstette, E., Kohli, P., and Battaglia, P. Compositionalimitation learning: Explaining and executing one task at a time. CoRR , abs/1812.01483, 2018.URL http://arxiv.org/abs/1812.01483 .Krishnan, R. G., Shalit, U., and Sontag, D. Structured inference networks for nonlinear state spacemodels. In Thirty-First AAAI Conference on Artificial Intelligence , 2017.Linderman, S. W., Miller, A. C., Adams, R. P., Blei, D. M., Paninski, L., and Johnson, M. J. Recurrentswitching linear dynamical systems. arXiv preprint arXiv:1610.08466 , 2016.Maddison, C. J., Mnih, A., and Teh, Y. W. The concrete distribution: A continuous relaxation ofdiscrete random variables. In International Conference on Learning Representations , 2017.Mullally, S. L. and Maguire, E. A. Memory, imagination, and predicting the future: a common brainmechanism? The Neuroscientist , 20(3):220–234, 2014.Neitz, A., Parascandolo, G., Bauer, S., and Schölkopf, B. Adaptive skip intervals: Temporalabstraction for recurrent dynamical models. In Bengio, S., Wallach, H., Larochelle, H., Grauman,K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems31 , pp. 9816–9826. 2018.Reddi, S. J., Kale, S., and Kumar, S. On the convergence of adam and beyond. In InternationalConference on Learning Representations , 2018. URL https://openreview.net/forum?id=ryQu7f-RZ .Serban, I. V., Sordoni, A., Lowe, R., Charlin, L., Pineau, J., Courville, A., and Bengio, Y. Ahierarchical latent variable encoder-decoder model for generating dialogues. In Thirty-First AAAIConference on Artificial Intelligence , 2017.Wei, X.-X., Prentice, J., and Balasubramanian, V. A principle of economy predicts the functionalarchitecture of grid cells. Elife , 4:e08362, 2015.Yu, S.-Z. Hidden semi-markov models. Artif. Intell. , 174(2):215–243, February 2010. ISSN0004-3702. doi: 10.1016/j.artint.2009.11.011.Zheng, X., Zaheer, M., Ahmed, A., Wang, Y., Xing, E. P., and Smola, A. J. State space LSTM modelswith particle MCMC inference. arXiv preprint arXiv:1711.11179 , 2017.10 ppendix A Action-Conditioned Temporal Abstraction State Transition We implement action-conditioned state transition p ( z t | a t , z Algorithm 1 Goal-oriented navigation with imagination (single episode) Input: environment env , environment model E , maximum length of imagined trajectories l img Output: reward r Initialize reward r ← Initialize environment x ← env.reset () Initialize context X ctx ← [ x ] Sample goal position and extract goal map feature g while r > do Sample action a from imagination-based planner given E , X ctx , g, l img (Algorithm 2) Do action x ← env.step ( a ) Update context X ctx ← X ctx + [ x ] Update reward r ← r − if current position is at the goal position then break return reward r Algorithm 2 Imagination-based planner Input: environment model E , previously observed sequence (context) X ctx , maximum length ofimagined trajectories l img , goal map feature g Output: action a max Initialize model E with c = f ctx ( X ctx ) Initialize d max ← −∞ Initialize a max ← None Set a list A of all possible action sequences based on l img for each action seuqence A in A do Get a sequence of states S ∈ R l img × d by doing imagination with model E conditioned on A Compute cosine-similarity D ∈ R l img between all states S and goal map feature g if max ( D ) > d max then Update d max ← max ( D ) Update a max ← A [0] return action a max Appendix C Implementation Details For bouncing balls, we define the reconstruction loss (data likelihood) by using binary cross-entropy.The images from 3D maze are pre-processed by reducing the bit depth to 5 bits (Kingma & Dhariwal,2018) and therefore the reconstruction loss is computed by using Gaussian distribution. We usedthe AMSGrad (Reddi et al., 2018), a variant of Adam, optimizer with learning rate e − and allmini-batchs are with 64 sequences with length T = 20 . Both CNN-based encoder and decoder arecomposed of 4 convolution layers with ELU activations (Clevert et al., 2016). A GRU (Cho et al.,2014) is used for all RNNs with 128 hidden units. The state representations of temporal abstractionand observation abstraction are sampled from 8-dimensional diagional Gaussian distributions.11 ppendix D Evidence Lower Bound (ELBO) We derive the ELBO without considering recurrent deterministic paths. Log-likelihood log p ( X )log p ( X ) = log (cid:88) M (cid:90) Z,S p ( X, Z, S, M ) ≥ (cid:88) M (cid:90) Z,S q ( Z, S, M | X ) log p ( X, Z, S, M ) q ( Z, S, M | X )= (cid:88) M (cid:90) Z,S q ( Z, S, M | X ) log p ( X | Z, S ) p ( Z, S, M ) q ( Z, S, M | X )= (cid:88) M (cid:90) Z,S q ( Z, S, M | X ) (cid:20) log p ( X | Z, S ) + log p ( Z, S, M ) q ( Z, S, M | X ) (cid:21) = E q ( Z,S,M | X ) [log p ( X | Z, S )] (cid:124) (cid:123)(cid:122) (cid:125) reconstruction − KL [ q ( Z, S, M | X ) || p ( Z, S, M )] (cid:124) (cid:123)(cid:122) (cid:125) KL divergence Decomposing p ( X | Z, S ) and p ( Z, S, M ) p ( X | Z, S ) = (cid:89) t p ( x t | s t ) (cid:124) (cid:123)(cid:122) (cid:125) decoder p ( Z, S, M ) = (cid:89) t p ( z t | z t − , m t − ) (cid:124) (cid:123)(cid:122) (cid:125) temporal abstract transition p ( s t | s t − , z t , m t − ) (cid:124) (cid:123)(cid:122) (cid:125) observation abstract transition p ( m t | s t ) (cid:124) (cid:123)(cid:122) (cid:125) boundary prior Decomposing q ( Z, S, M | X ) q ( Z, S, M | X ) = q ( M | X ) q ( Z, S | M, X )= q ( M | X ) q ( Z | M, X ) q ( S | Z, M, X )= q ( M | X ) (cid:89) t q ( z t | M, X ) q ( s t | z t , M, X ) q ( M | X ) = (cid:89) t q ( m t | X ) = (cid:89) t Bern ( m t | σ ( ϕ ( X ))) Reconstruction Term in ELBO (cid:88) M (cid:90) Z,S q ( Z, S, M | X ) log p ( X | Z, S )= (cid:88) M q ( M | X ) (cid:124) (cid:123)(cid:122) (cid:125) sample M (cid:90) Z,S q ( Z, S | M, X ) log p ( X | Z, S ) ≈ (cid:90) Z,S q ( Z, S | M, X ) (cid:88) t log p ( x t | s t )= (cid:90) Z,S (cid:88) t q ( z t , s t | M, X ) log p ( x t | s t )= (cid:90) Z,S (cid:88) t q ( z t | M, X ) q ( s t | z t , M, X ) (cid:124) (cid:123)(cid:122) (cid:125) sampling z t and s t log p ( x t | z t , s t ) ≈ (cid:88) t log p ( x t | s t ) L Term in ELBO log q ( Z, S, M | X ) − log p ( Z, S, M )= log q ( M | X ) + log q ( Z, S | M, X ) − log p ( Z, S, M )= log q ( M | X ) + log q ( Z | M, X ) + log q ( S | Z, M, X ) − log p ( Z, S, M )= log q ( M | X ) + (cid:88) t log q ( z t | M, X ) q ( s t | z t , M, X ) p ( z t | z t − , m t − ) p ( s t | s t − , z t , m t − ) p ( m t | s t )= log q ( M | X ) + (cid:88) t log q ( z t | M, X ) p ( z t | z t − , m t − ) + log q ( s t | z t , M, X ) p ( s t | s t − , z t , m t − ) + log 1 p ( m t | s t ) (cid:88) M (cid:90) Z,S q ( Z, S, M | X ) [log q ( Z, S, M | X ) − log p ( Z, S, M )]= (cid:88) M q ( M | X ) (cid:90) Z,S q ( Z, S | M, X ) [log q ( Z, S, M | X ) − log p ( Z, S, M )]= (cid:88) M q ( M | X ) (cid:90) Z,S q ( Z, S | M, X ) (cid:34) log q ( M | X ) + (cid:88) t log q ( z t | M, X ) q ( s t | z t , M, X ) p ( z t | z t − , m t − ) p ( s t | s t − , z t , m t − ) p ( m t | s t ) (cid:35) = (cid:88) M q ( M | X ) (cid:34) log q ( M | X ) + (cid:88) t (cid:90) Z,S q ( Z, S | M, X ) log q ( z t | M, X ) q ( s t | z t , M, X ) p ( z t | z t − , m t − ) p ( s t | s t − , z t , m t − ) p ( m t | s t ) (cid:35) = (cid:88) M q ( M | X ) (cid:34) log q ( M | X ) + (cid:88) t (cid:90) z t ,s t q ( z t , s t | M, X ) log q ( z t | M, X ) q ( s t | z t , M, X ) p ( z t | z t − , m t − ) p ( s t | s t − , z t , m t − ) p ( m t | s t ) (cid:35) = (cid:88) M q ( M | X ) (cid:34) log q ( M | X ) + (cid:88) t KL ( q (cid:48) ( z t ) || p (cid:48) ( z t )) + (cid:90) z t ,s t q ( z t , s t | M, X ) log q ( s t | z t , M, X ) p ( s t | s t − , z t , m t − ) p ( m t | s t ) (cid:35) = (cid:88) M q ( M | X ) log q ( M | X ) + (cid:88) t KL ( q (cid:48) ( z t ) || p (cid:48) ( z t )) + (cid:90) z t q (cid:48) ( z t ) (cid:124) (cid:123)(cid:122) (cid:125) sample z t [KL ( q (cid:48) ( s t ) || p (cid:48) ( s t )) − log p ( m t | s t )] ≈ (cid:88) M q ( M | X ) (cid:34) log q ( M | X ) + (cid:88) t KL ( q (cid:48) ( z t ) || p (cid:48) ( z t )) + KL ( q (cid:48) ( s t ) || p (cid:48) ( s t )) − log p ( m t | s t ) (cid:35) = (cid:88) M q ( M | X ) (cid:34) log (cid:81) t q ( m t | X ) (cid:81) t p ( m t | s t ) + (cid:88) t KL ( q (cid:48) ( z t ) || p (cid:48) ( z t )) + KL ( q (cid:48) ( s t ) || p (cid:48) ( s t )) (cid:35) = (cid:88) t (cid:48) KL ( q (cid:48) ( m t (cid:48) ) || p (cid:48) ( m t (cid:48) )) + (cid:88) M q ( M | X ) (cid:124) (cid:123)(cid:122) (cid:125) sample M (cid:34)(cid:88) t KL ( q (cid:48) ( z t ) || p (cid:48) ( z t )) + KL ( q (cid:48) ( s t ) || p (cid:48) ( s t )) (cid:35) ≈ (cid:88) t KL ( q (cid:48) ( m t ) || p (cid:48) ( m t )) (cid:124) (cid:123)(cid:122) (cid:125) sequence decomposer + KL ( q (cid:48) ( z t ) || p (cid:48) ( z t )) (cid:124) (cid:123)(cid:122) (cid:125) temporal abstraction + KL ( q (cid:48) ( s t ) || p (cid:48) ( s t )) (cid:124) (cid:123)(cid:122) (cid:125) observation abstraction where p (cid:48) ( m t ) = p ( m t | s t ) , p (cid:48) ( z t ) = p ( z t | z t − , m t − ) , p (cid:48) ( s t ) = p ( s t | s t − , z t , m t − ) q (cid:48) ( m t ) = q ( m t | X ) , q (cid:48) ( z t ) = q ( z t , s t | M, X ) , q (cid:48) ( s t ) = q ( s t | z t , M, X ) ppendix E Generated Samples (a) Subsequences with different z (and different s ) (b) Subsequences with same z (and different s ))