[PDF] Deep Generative Networks For Sequence Prediction

Abstract

This thesis investigates unsupervised time series representation learning for sequence prediction problems, i.e. generating nice-looking input samples given a previous history, for high dimensional input sequences by decoupling the static input representation from the recurrent sequence representation. We introduce three models based on Generative Stochastic Networks (GSN) for unsupervised sequence learning and prediction. Experimental results for these three models are presented on pixels of sequential handwritten digit (MNIST) data, videos of low-resolution bouncing balls, and motion capture data. The main contribution of this thesis is to provide evidence that GSNs are a viable framework to learn useful representations of complex sequential input data, and to suggest a new framework for deep generative models to learn complex sequences by decoupling static input representations from dynamic time dependency representations.

Full PDF

DDEEP GENERATIVE NETWORKS FORSEQUENCE PREDICTION

Markus B. Beissinger

A THESISin

Computer and Information Science

Presented to the Faculties of the University of Pennsylvania in Partial Fulﬁllmentof the Requirements for the Degree of Master of Science in Engineering2017Lyle H. UngarSupervisor of ThesisLyle H. UngarGraduate Group Chairperson a r X i v : . [ c s . L G ] A p r cknowledgements I would like to thank Lyle Ungar for his invaluable insights and methods of thinkingthroughout advising this thesis. I would also like to thank Mark Liberman (CIS,Linguistics, University of Pennsylvania) and Mitch Marcus (CIS, University ofPennsylvania) for their work on the thesis committee, and Li Yao and SherjilOzair (Computer Science and Operations Research, Universit´e de Montr´eal) forhelp with the GSN starter code. i bstract

This thesis investigates unsupervised time series representation learning for se-quence prediction problems, i.e. generating nice-looking input samples given aprevious history, for high dimensional input sequences by decoupling the staticinput representation from the recurrent sequence representation. We introducethree models based on Generative Stochastic Networks (GSN) for unsupervisedsequence learning and prediction. GSNs are a probabilistic generalization of de-noising auto-encoders that learn unsupervised hierarchical representations of com-plex input data, while being trainable by backpropagation.The ﬁrst model, the Temporal GSN (TGSN), uses the latent state variables H learned by the GSN to reduce input complexity such that learning the rep-resentations H over time becomes linear. This means a simple linear regressionstep H → H can encode the next set of latent state variables describing the inputdata in the sequence, learning P ( H t +1 | H tt − m ) for an arbitrary history, or context,window of size m .The second model, the Recurrent GSN (RNN-GSN), uses a Recurrent NeuralNetwork (RNN) to learn the sequences of GSN parameters H over time. By havingthe progression of H learned by an RNN instead of through regression like theTGSN, this model can learn sequences with arbitrary time dependencies.The third model, the Sequence Encoding Network (SEN), is a novel frameworkfor learning deep sequence representations. It uses a hybrid approach of stackingalternating reconstruction generative network layers with recurrent layers, allowingthe model to learn a deep representation of complex time dependencies.iiiiExperimental results for these three models are presented on pixels of sequen-tial handwritten digit (MNIST) data, videos of low-resolution bouncing balls, andmotion capture data . The main contribution of this thesis is to provide evidencethat GSNs are a viable framework to learn useful representations of complex se-quential input data, and to suggest a new framework for deep generative models tolearn complex sequences by decoupling static input representations from dynamictime dependency representations. code can be found at: https://github.com/mbeissinger/recurrent gsn ontents ist of Figures ist of Tables hapter 1Introduction Deep learning research has grown in popularity due to its ability to form usefulfeature representations of highly complex input data. Useful representations arethose that disentangle the factors of variation of input data, preserving the infor-mation that is ultimately useful for the given machine learning task. In practice,these representations are used as input features to other algorithms, where in thepast features would have been constructed by hand. Deep learning frameworks(especially deep convolutional neural networks [13]) have had recent successes forsupervised learning of representations for many tasks, creating breakthroughs forboth speech and object recognition [18, 12].Unsupervised learning of representations, however, has had slower progress.These models, mostly Restricted Boltzmann Machines (RBM) [11], auto-encoders[1], and sparse-coding variants [16], suﬀer from the diﬃculty of marginalizingacross an intractable number of conﬁgurations of random variables (observed, la-tent, or both). Each plausible conﬁguration of latent and observed variables wouldbe a mode in the distribution of interest P ( X, H ) or P ( X ) directly, and currentmethods of inference or sampling are forced to make strong assumptions aboutthese distributions. Recent advances on the generative view of denoising auto-encoders and generative stochastic networks [8] have alleviated this diﬃculty bysimply learning a local Markov chain transition operator through backpropaga-1tion, which is often unimodal (instead of parameterizing the data distributiondirectly, which is multi-modal). This approach has opened up unsupervised learn-ing of deep representations for many useful tasks, including sequence prediction.Unsupervised sequence prediction and labeling remains an important problem forartiﬁcial intelligence (AI), as many types of input data, such as language, video,etc., naturally form sequences and the vast majority is unlabeled.This thesis will cover four main topics: • Chapter 3 provides an overview of deep architectures - a background onrepresentation learning from probabilistic and direct encoding viewpoints.Recent work on generative viewpoints will be discussed as well, showing howdenoising auto-encoders can solve the multi-modal problem via learning aMarkov chain transition operator. • Chapter 4 introduces Generative Stochastic Networks - recent work gener-alizing the denoising auto-encoder framework into GSNs will be explained,as well as how this can be extended to sequence prediction tasks. • Chapters 5, 6, and 7 describe models using GSNs to learn complex sequentialinput data. • Chapter 8 discusses the results of these three models and baselines and whythey are able to use deep representations to learn sequence data. hapter 2Related Work

Due to the success of deep architectures on highly complex input data, applyingdeep architectures to sequence prediction tasks has been studied extensively inliterature. RBM variants have been the most popular for applying deep learningmodels to sequential data.

Temporal RBM (TRBM) [20] is one of the ﬁrst frameworks of non-linear se-quence models that are more powerful than traditional Hidden Markov models orlinear systems. It learns multilevel representations of sequential data by addingconnections from previous states of the hidden and visible units to the currentstates. When the RBM is known, the TRBM learns the dynamic biases of theparameters from one set of states to the next. However, inference over variablesis still exponentially diﬃcult.

Recurrent Temporal RBMs (RTRBMs) [21] are an extension of the TRBM.They add a secondary learned latent variable H (cid:48) that serves to reduce the numberof posterior probabilities needed to consider during inference through a learnedgenerative process. Exact inference can be done easily and gradient learning be-comes almost tractable. Temporal Convolution Machines (TCMs) [14] also build from TRBMs. Theymake better use of prior states by allowing the time-varying bias of the underlyingRBM to be a convolution of prior states with any function. Therefore, the states3of the TCM can directly depend on arbitrarily distant past states. This meansthe complexity of the hidden states are reduced, as a complex Markov sequencein the hidden layer is not necessary. However, inference is still diﬃcult.

RNN-RBM [10] is similar to the RTRBM. The RNN-RBM adds a recursiveneural network layer that acts as a dynamic state variable u which is dependenton the current input data and the past state variable. This state variable is whatthen determines the bias parameters of the next RBM in the sequence, rather thanjust a regression from the latents H . Sequential Deep Belief Networks (SDBNs) [2, 3] is a series of stacked RBMsthat have a Markov interaction over time between each corresponding hiddenlayer. Rather than adjusting the bias parameters dynamically like TRBMs, thisapproach learns a Markov transition between the hidden latent variables over time.This allows the hidden layer to model any dependencies between time frames ofthe observations.

Recursive Neural Networks (RNNs) [19] are a slightly diﬀerent framework usedfor sequence labeling in parsing natural language sentences or parsing natural sceneimages that have recursive structures. RNNs deﬁne a neural network that takestwo possible input vectors (such as adjoining words in a sentence) and producesa hidden representation vector as well as a prediction score of the representationbeing the correct merging of the two inputs. These hidden representation vectorscan be fed recursively into the RNN to calculate the highest probability recursivestructure of the input sequence. RNNs are therefore a supervised algorithm.Past work has also compared a deep architecture, Sentence-level LikelihoodNeural Nets (SLNN), with traditional Conditional Random Fields (CRF) for se-quence labeling tasks of Named Entity Recognition and Syntactic chunking [23].Wang et al. found that non-linear deep architectures, compared to linear CRFs,are more eﬀective in low dimensional continuous input spaces, but not in high-dimensional discrete input spaces. They also conﬁrm that distributional represen-tations can be used to achieve better generalization.While many of these related works perform well on sequential data such asvideo and language, all of them (except for the RTRBM) still struggle with in-ference due to the nature of RBMs. Using these sequential techniques on GSNs,which are easy to sample from and perform inference, have not yet been studied. hapter 3Background: Deep Architectures

Traditional machine learning algorithms’ performance depend heavily on the par-ticular features of the data chosen as inputs. For example, document classiﬁcation(such as marking emails as spam or not spam) can be performed by breaking downthe input document into bag-of-words or n-grams as features. Choosing the correctfeature representation of input data, or feature engineering, is a way to bring priorknowledge of a domain to increase an algorithm’s computational performance andaccuracy. To move towards general artiﬁcial intelligence, algorithms need to be lessdependent on this feature engineering and better learn to identify the explanatoryfactors of input data on their own [7].

Deep learning frameworks (also known as deep architectures) move in this direc-tion by capturing a good representation of input data by using compositions ofnon-linear transformations. A good representation can be deﬁned as one that dis-entangles underlying factors of variation for input data [5]. Deep learning frame-works can ﬁnd useful abstract representations of data across many domains: ithas had great commercial success powering most of Google and Microsoft’s cur-rent speech recognition, image classiﬁcation, natural language processing, objectrecognition, etc. Facebook is also planning on using deep learning to understand6its users . Deep learning has been so impactful in industry that MIT TechnologyReview named it as a top-10 breakthrough technology of 2013 .The central idea to building a deep architecture is to learn a hierarchy offeatures one level at a time where the input to one computational level is theoutput of the previous level for an arbitrary number of levels. Otherwise, shallowrepresentations (such as regression or support vector machines) go directly frominput data to output classiﬁcation.One loose analogue for deep architectures is neurons in the brain (a motivationfor artiﬁcial neural networks) - the output of a group of neurons is agglomeratedas the input to more neurons to form a hierarchical layer structure. Each layer N is composed of h computational nodes that connect to each computational nodein layer N + 1. Figure 3.1: An example deep architecture. There are two main ways to interpret the computation performed by these layereddeep architectures: • Probabilistic graphical models have nodes in each layer that are consideredas latent random variables. In this case, the probability distribution of the input data x and the hidden latent random variables h that describe theinput data in the joint distribution p ( x, h ) are the model. These latentrandom variables describe a distribution over the observed data. • Direct encoding models have nodes in each layer that are considered as com-putational units. This means each node h performs some computation (nor-mally nonlinear, such as a sigmoidal function, hyperbolic tangent, or rectiﬁerlinear unit) given its inputs from the previous layer.To illustrate, principal component analysis (PCA) is a simple feature extrac-tion algorithm that can span both of these interpretations. PCA learns a lineartransform h = f ( x ) = W T x + b where W is a weight matrix for the inputs x and b is a bias term. The columns of the dx × dh matrix W form an orthogonal basisfor the dh orthogonal directions of greatest variance in the input training data x .The result is dh decorrelated features that make representation layer h .Figure 3.2: Principal component analysis . From a probabilistic viewpoint, PCA is simply ﬁnding the principal eigenvec-tors of the covariance matrix of the data. PCA ﬁnds which features of the inputdata can explain away the most variance in the data[4]. From an encoding view-point, PCA is performing a linear computation over the input data to form ahidden representation h that has a lower dimensionality than the data.Note that because PCA is a linear transformation of the input x , it cannotreally be stacked in layers because the composition of linear operations is just an-other linear operation. There would be no abstraction beneﬁt of multiple layers.To show these two methods of analysis, this section will examine stacking Re-stricted Boltzmann Machines (RBM) from a probability viewpoint and nonlinearauto-encoders from a direct encoding viewpoint. A Boltzmann machine is a network of symmetrically-coupled binary random vari-ables or units. This means that it is a fully-connected, undirected graph. Thisgraph can be divided into two parts:1. The visible binary units x that make up the input data and2. The hidden or latent binary units h that explain away the dependenciesbetween the visible units x through their mutual interactions.Boltzmann machines describe this pattern of interaction through the distribu-tion over the joint space [ x, h ] with the energy function: ε BM Θ ( x, h ) = − x T U x − h T V h − x T W h − b T x − d T h Where the model parameters Θ are { U, V, W, b, d } .Evaluating conditional probabilities over this fully connected graph ends upbeing an intractable problem. For example, computing the conditional probability0of a hidden variable given the visibles, P ( h i | x ), requires marginalizing over all theother hidden variables. This would be evaluating a sum with 2 dh − x and hidden units h .Figure 3.3: A Restricted Boltzmann Machine.This gives us an RBM, which is a bipartite graph with the visible and hiddenunits forming distinct layers. Calculating the conditional distribution P ( h i | x ) isreadily tractable and factorizes to: P ( h | x ) = (cid:89) i P ( h i | x ) P ( h i = 1 | x ) = sigmoid (cid:32)(cid:88) j W ji x j + d i (cid:33) Very successful deep learning algorithms stack multiple RBMs together, wherethe hidden units h from the visible input data x become the new input data foranother RBM for an arbitrary number of layers.There are a few drawbacks to the probabilistic approach to deep architectures:1. The posterior distribution P ( h i | x ) becomes incredibly complicated if themodel has more than a few interconnected layers. We are forced to resortto sampling or approximate inference techniques to solve the distribution,which has computational and approximation error prices.1Figure 3.4: Stacked RBM.2. Calculating this distribution over latent variables still does not give a usablefeature vector to train a ﬁnal classiﬁer to make this algorithm useful for AItasks. For example, the calculations of these hidden distributions explain thevariations over the handwriting digit recognition problem, but they do notgive a ﬁnal classiﬁcation of a number. Actual feature values are normallyderived from the distribution, taking the latent variable’s expected value,which are then used as the input to a normal machine learning classiﬁer,such as logistic regression. To get around the problem of deriving useful feature values, an auto-encoder isa non-probabilistic alternative approach to deep learning where the hidden unitsproduce usable numeric feature values. An auto-encoder directly maps an input x to a hidden layer h through a parameterized closed-form equation called anencoder. Typically, this encoder function is a nonlinear transformation of theinput to h in the form: f Θ ( x ) = s f ( W x + b )This resulting transformation is the feature-vector or representation computedfrom input x . Conversely, a decoder function is then used to map from this featurespace h back to the input space, which results in a reconstruction x (cid:48) . This decoderis also a parameterized closed-form equation that is a nonlinear function undoing2the encoding function: g Θ ( h ) = s g ( W (cid:48) h + d )In both cases, the nonlinear function s is normally an element-wise sigmoid,hyperbolic tangent nonlinearity, or rectiﬁer linear unit.Thus, the goal of an auto-encoder is to minimize a loss function over the re-construction error given the training data. Model parameters Θ are { W, b, W (cid:48) , d } ,with the weight matrix W most often having tied weights such that W (cid:48) = W T .Stacking auto-encoders in layers is the same process as with RBMs.Figure 3.5: Stacked auto-encoder.One disadvantage of auto-encoders is that they can easily memorize the train-ing data (i.e., ﬁnd the model parameters that map every input seen to a perfectreconstruction with zero error) given enough hidden units h . To combat thisproblem, regularization is necessary, which gives rise to variants such as sparseauto-encoders, contractive auto-encoders, or denoising auto-encoders.A practical advantage of auto-encoder variants is that they deﬁne a simple,tractable optimization objective that can be used to monitor progress. Denoising auto-encoders [9, 22, 1] are a class of direct encoding models that usesynthetic noise over the inputs through a corruption process during training toprevent overﬁtting and simply learning the identity function. Given a knowncorruption process C ( (cid:101) X | X ) to corrupt an observed variable X , the denoising auto-encoder learns the reverse conditional P ( X | (cid:101) X ). Combining this estimator withthe known corruption process C , it can recover a consistent estimator of P ( X )3through a Markov chain that alternates sampling from C ( (cid:101) X | X ) and P ( X | (cid:101) X ). Thebasic algorithm is as follows: Algorithm 1:

Generalized Denoising Auto-encoder Training Algorithm

Input : training set D of examples X , a corruption process C ( (cid:101) X | X ), and aconditional distribution P Θ ( X | (cid:101) X ) to train. while training not converged do sample training example X ∼ D ;sample corrupted input (cid:101) X ∼ C ( (cid:101) X | X );use ( X , (cid:101) X ) as an additional training example towards minimizing theexpected value of − log P Θ ( X | (cid:101) X ), e.g., by a gradient step with respectto Θ in the encoding/decoding function; end The reconstruction distribution P ( X | (cid:101) X ) is easier to learn than the true datadistribution P ( X ) because P ( X | (cid:101) X ) is often dominated by a single or few ma-jor modes, where the data distribution P ( X ) would be highly multimodal andcomplex. Recent works [1, 9] provide proofs that denoising auto-encoders witharbitrary variables (discrete, continuous, or both), an arbitrary corruption (Gaus-sian or other; not necessarily asymptotically small), and an arbitrary loss function(as long as it is viewed as a log-likelihood) estimate the score (derivative of thelog-density with respect to the input) of the observed random variables.Another key idea presented in Bengio et al. [9] is walkback training. Thewalkback process generates additional training examples through a pseudo-Gibbssampling process from the current denoising auto-encoder Markov chain for acertain number of steps. These additional generated ( X, (cid:101) X ) pairs from the modeldecrease training time by actively correcting spurious modes (regions of the inputdata that have been insuﬃciently visited during training, which may therefore beincorrect in the learned reconstruction distribution). Both increasing the numberof training iterations and increasing corruption noise alleviate spurious modes, butwalkbacks are the most eﬀective.4 Algorithm 2:

Walkback Training Algorithm for Denoising Auto-encoders

Input : A given training example X , a corruption process C ( (cid:101) X | X ), and thecurrent model’s reconstruction conditional distribution P Θ ( X | (cid:101) X ).It also has a hyper-parameter p that controls the number ofgenerated samples. Output : A list L of additional training examples (cid:101) X ∗ . X ∗ ← X , L ← [];Sample (cid:101) X ∗ ∼ C ( (cid:101) X | X ∗ );Sample u ∼ Uniform(0 , while u < p do Append (cid:101) X ∗ to L , so ( X, (cid:101) X ∗ ) will be an additional training example forthe denoising auto-encoder.;Sample X ∗ ∼ P Θ ( X | (cid:101) X ∗ );Sample (cid:101) X ∗ ∼ C ( (cid:101) X | X ∗ );Sample u ∼ Uniform(0 , end Append (cid:101) X ∗ to L ;Return L ; hapter 4General Methodology: DeepGenerative Stochastic Networks(GSN) Generative stochastic networks are a generalization of the denoising auto-encoderand help solve the problem of mixing between many modes as outlined in theIntroduction. Each model presented in this thesis uses the GSN framework forlearning a more useful abstraction of the input distribution P ( X ). Denoising auto-encoders use a Markov chain to learn a reconstruction distribution P ( X | (cid:101) X ) given a corruption process C ( (cid:101) X | X ) for some data X . Denoising auto-encoders have been shown as generative models [9], where the Markov chain canbe iteratively sampled from: X t ∼ P Θ ( X | (cid:101) X t − ) (cid:101) X t ∼ C ( (cid:101) X | X t )As long as the learned distribution P Θ n ( X | (cid:101) X ) is a consistent estimator of the156true conditional distribution P ( X | (cid:101) X ) and the Markov chain is ergodic, then as n → ∞ , the asymptotic distribution π n ( X ) of the generated samples from thedenoising auto-encoder converges to the data-generating distribution P ( X ) (proofprovided in Bengio et al. [9]). A few restrictive conditions are necessary to guarantee ergodicity of the Markovchain - requiring C ( (cid:101) X | X ) > P ( X ) >

0. Particularly, a largeregion V containing any possible X is deﬁned such that the probability of movingbetween any two points in a single jump C ( (cid:101) X | X ) must be greater than 0. Thisrestriction requires that P Θ n ( X | (cid:101) X ) has the ability to model every mode of P ( X ),which is a problem this model was meant to avoid.To ease this restriction, Bengio et al. [8] proves that using a C ( (cid:101) X | X ) thatonly makes small jumps allows P Θ ( X | (cid:101) X ) to model a small part of the space V around each (cid:101) X . This weaker condition means that modeling the reconstructiondistribution P ( X | (cid:101) X ) would be easier since it would probably have fewer modes.However, the jump size σ between points must still be large enough to guaran-tee that one can jump often enough between the major modes of P ( X ) to overcomethe deserts of low probability: σ must be larger than half the largest distance oflow probability between two nearby modes, such that V has at least a single con-nected component between modes. This presents a tradeoﬀ between the diﬃcultyof learning P Θ ( X | (cid:101) X ) and the ease of mixing between modes separated by this lowprobability desert. While denoising auto-encoders can rely on X t alone for the state of the Markovchain, GSNs introduce a latent variable H t that acts as an additional state variable7in the Markov chain along with the visible X t [8]: H t +1 ∼ P Θ ( H | H t , X t ) X t +1 ∼ P Θ ( X | H t +1 )The resulting computational graph takes the form:Figure 4.1: GSN computational graph.The latent state variable H can be equivalently deﬁned as H t +1 = f Θ ( X t , Z t , H t ),a learned function f with an independent noise source Z t such that X t cannot bereconstructed exactly from H t +1 . If X t could be recovered from H t +1 , the re-construction distribution would simply converge to the Dirac at X . Denoisingauto-encoders are therefore a special case of GSNs, where f is ﬁxed instead oflearned.GSNs also use the notion of walkback to aid training. The resulting Markovchain of a GSN is inspired by Gibbs sampling, but with stochastic units at eachlayer that can be backpropagated [17].Figure 4.2: Unrolled GSN Markov chain.8 Similar to RBMs, sequences of GSNs can be learned by a recurrent step in theparameters or latent states over the sequence of input variables. This approachworks because deep architectures help solve the multi-modal problem of complexinput data explained in the Introduction and can easily mix between many modes.The main mixing problem comes from the complicated data manifold surfacesof the input space; transitioning from one MNIST digit to the next in the inputspace generally looks like a messy blend of the two numbers in the intermediatesteps. As more layers are learned, more abstract features lead to better disen-tangling of the input data, which ends up unfolding the manifolds to ﬁll a largerpart of the representation space. Because these manifolds become closer together,Markov Chain Monte Carlo (MCMC) sampling between them moves more easilybetween the modes of the input data and creates a much better mixing betweenthe modes. Figure 4.3: Better mixing via deep architectures [6].Because the data manifold space becomes less complicated at higher levels ofabstraction, transitioning between them over time becomes much easier. This9principle enables the models in the following three chapters to learn sequences ofcomplex input data over time. hapter 5Model 1: Temporal GSN (TGSN)

This ﬁrst approach is most similar to Sequential Deep Belief Networks in that itlearns a transition operator between hidden latent states H . This model uses thepower of GSN’s to learn hidden representations that reduce the complexity of theinput data space, making transitions between data manifolds at higher layers ofrepresentation much easier to model. Therefore, the transition step of learning H → H over time should be less complicated (i.e. only needing a single linearregression step between hidden states). This model trains by alternating over twoversions of the dataset:1. A generative Gibbs sampling pass for k samples on each input in arbitraryorder (for the GSN to learn the data manifolds)2. A real time-sequenced order of the input to learn the regression H → H Alternating between training the GSN parameters on the generative input se-quence through Gibbs sampling and learning the hidden state transition operatoron the real sequence of inputs allows the model to tune parameters quickly in anexpectation-maximization style of training.201

While GSNs are inherently recurrent and depend on the previous latent and vis-ible states to determine the current hidden state, H t ∼ P Θ ( H | H t − , X t − ), thisordering t is generated through the GSN Gibbs sampling process and does notreﬂect the real sequence of inputs over time. Using this sampling process, GSNsactively mix between modes that are close together in the input space, not thesequential space. For example, a GSN trained on MNIST data will learn to mixwell between the modes of digits that look similar in the pixel space - samplingfrom the digit “4” transitions to a “9”, etc.Figure 5.1: Samples from GSN after 290 training epochs. Good mixing betweenmajor modes in the input space.To learn transitions between sequential modes in the input space, both thesample step t from the GSN sampling process and the sequential input t real fromthe input data sequence need to be utilized.2 Figure 5.2: Temporal GSN architecture.This model’s training algorithm is similar to expectation maximization (EM) -ﬁrst optimizing GSN parameters over the input data, then learning the transition H → H parameters, and repeating. After the initial GSN pass over the data,training the GSN parameters becomes more powerful as the reconstruction cost ofthe current input as well as the next predicted input are both used for computingthe gradients.One diﬃculty with this training setup is that the ﬁrst few epochs will have thereconstruction cost after predicting the transition operator be incorrect until theGSN parameters are warmed up. The GSNs converge slowly and can get stuck inbad conﬁgurations due to the regression step (which is simple linear regression)being trained on a not-yet-useful hidden representation of the complex input data.If the regression step is trained poorly, it aﬀects the remaining GSN parametertraining steps by providing bad sequential predictions for the GSN to attempt toreconstruct. In practice we recommend waiting until the GSN reconstruction coststarts to converge before applying the reconstruction cost of the predicted nextstep to the GSN training operation. This algorithm was tested on artiﬁcially sequenced MNIST handwritten digit data.The dataset was sequenced by ordering the inputs 0-9 repeating. The GSN useshyperbolic tangent (tanh) activation with 3 hidden layers of 1500 nodes and sig-3

Algorithm 3:

Model 1 EM Algorithm

Input : training set D of examples X in sequential order, N layers, k walkbacksInitialize GSN parameters Θ GSN = { List(weights from one layer to thenext), List(bias for layer)) } ;Initialize transition parameters Θ transition = { List(weights from previouslayer to current), List(bias for layer) } ; while training not converged dofor each input X do Sample GSN for k walkbacks, creating k ∗ ( X, X recon ) training pairs;Transition from ending hidden states H to next predicted hiddenstates H (cid:48) with transition parameters Θ transition ;Sample GSN again for k walkbacks, creating k ∗ ( X (cid:48) , X (cid:48) recon ) trainingpairs;Train GSN parameters Θ GSN using these pairs, keeping Θ transition ﬁxed; endfor each input X do Sample GSN for k walkbacks, creating ending hidden states H ;Transition from ending hidden states H to next predicted hiddenstates H (cid:48) with transition parameters Θ transition ;Sample GSN again for k walkbacks, creating the ending ( X (cid:48) , X (cid:48) recon )pair;Train transition parameters Θ transition with this pair, keeping Θ GSN ﬁxed; endend moidal activation for the visible layer. For the GSN, Gaussian noise is added pre-and post-activation with a mean of 0 and a sigma of 2, and input corruption noiseis salt-and-pepper with p=0.4. Training was performed for 300 iterations over theinput data using a batch size of 100, with a learning rate of 0.25, annealing rateof 0.995, and momentum of 0.5.An interesting result is that the predicted reconstruction of the next digitsappears to be close to the average of that digit, which can be explained becausethe training set of sequences was shuﬄed and re-ordered after every epoch fromthe pool of available digits.Diﬀerences between the predicted next number and the average number seem tooccur when the GSN incorrectly reconstructs the original corrupted input. These4Figure 5.3: Model 1 reconstruction of digits and predicted next digits after 300iterations. Figure 5.4: Average MNIST training data by digit.results provide evidence that the original assumption is correct: the GSN learnsrepresentations that disentangle complex input data, which allows a simple regres-sion step to predict the next input in a linear manner. A comparison of results isincluded in the Discussion.Sampling in the input space is similar to a GSN:5Figure 5.5: Model 1 sampling after 90 training iterations; smooth mixing betweenmajor modes. hapter 6Model 2: Recurrent GSN(RNN-GSN)

While the Temporal GSN (TGSN) works well predicting the next input digit giventhe current one, it is limited by its regression transition operators learning a linearmapping H → H of a given context window. This inherently limits the lengthand complexity of sequences learnable in the latent space. The Recurrent GSNin this chapter introduces an additional recurrent latent parameter V to learn thesequence of GSN H s over time to tackle this problem. As Section 5.1 shows, GSNs inherently have a recurrent structure when the Gibbschain is unrolled. Instead of using Gibbs sampling to generate inputs of thesame class, a GSN can use the real time sequence of the input data to train itsparameters with respect to the predicted reconstruction of the next input in thesequence. The GSN becomes a generative sequence prediction model rather than agenerative data distribution model. This approach is not without drawbacks - theGSN loses its ability to utilize the walkback training principle for creating robustrepresentations by actively seeking out spurious modes in the model. However,this drawback is mitigated with more input data. Further, the GSN loses the267ability to mix between modes of the input space. Instead, it mixes between modesof the sequence space - learning to transition to the next most likely input giventhe current and previous data.Currently, GSNs use tied weights between layers to make backpropagationeasier. However, this approach prohibits the hidden representation from beingable to encode sequences. We must untie these weights to consider a GSN as anRNN variant, which makes training more diﬃcult.

Algorithm 4:

Untied GSN as an RNN

Input : training data X in sequential order, N layers, k ≥ ∗ N predictionsInitialize GSN parameters Θ GSN = { List(weights from one layer to the nexthigher), List(weights from one layer to the next lower), List(bias forlayer)) } ; for input data x received do Sample from GSN predicted x (cid:48) to create a list of the next k predictedinputs;Store these predictions in a memory buﬀer array of lists;Use the current input x to train GSN parameters with respect to thelist of predicted x (cid:48) through backpropagation; end Using the same general training parameters with regards to noise, learning rate,annealing, momentum, epochs, hidden layers, and activation as the TGSN, untyingthe GSN parameters performs similarly with regards to binary cross-entropy asthe TGSN on the artiﬁcially sequenced MNIST dataset. For the next immediatepredicted number, it achieved a binary cross-entropy of 0.2318. For the predictednumber six iterations ahead, it achieved a binary cross-entropy of 0.2268. Thiscross-entropy is lower because six iterations ahead can utilize higher layers ofrepresentation in the GSN due to the way the computational graph is formed.Even though cross-entropy is similar to the TGSN, reconstruction images painta diﬀerent picture. Due to untied weights taking longer to train, the next predicteddigits appear worse than the averages produced from the TGSN over the same8number of iterations. However, as the number of predictions ahead increases, thedigits begin to look more like the averages. This could be explained by furtherpredictions ahead utilizing the higher layers of representation based on the waythe computational graph is formed.Figure 6.1: Untied GSN reconstruction of predicted next digits and predicteddigits 3 iterations ahead after 300 iterations.Learning with untied weights is much slower, but still provides evidence thatthe hidden layers themselves can learn useful representations for complex inputsequences. Looking at the generated samples in Figure 6.2 after 300 trainingiterations, mixing between sequential modes is evident as the samples appear tobe generated in the same 0-9 order as the sequenced data.The quality of images shown here encourage the use of separate parameters todecouple sequential learning from input representation learning.

This online model loses the ability to use walkback training to reduce spuriousmodes. However, walkback could be generalized to the sequential case by sam-pling from possible variations of past hidden representations that could lead to thecurrent input. Intuitively, this idea comes from the method of explaining currentinputs with imperfect memory recall of past inputs. By sampling from the repre-sentation layer repeatedly, a series of potentially viable past representations that9Figure 6.2: Untied GSN sampling after 300 iterations.lead to the current input are created and used to train GSN parameters leadingto the current input. This method uses past inputs as context to create viablevariations of sequences in the representation space, which in turn acts to createmore robust mixing between the modes in the sequence space.The general process for creating sequential walkbacks described here is as fol-lows:

Algorithm 5:

Walkbacks for sequential input for k walkbacks do Given input x , take a backward step with the GSN using transposedweights and negated bias to create the previous hidden representation H ;Sample from the hidden representation H to form H (cid:48) ;Take a forward step with the GSN using H (cid:48) to create x (cid:48) ;Use this ( x (cid:48) , x ) pair as a training example for the GSN parameters; end Figure 6.3: Recurrent GSN architecture. H is GSN hiddens, V is RNN hiddens.The EM model is easier to train and appears to have better mixing in boththe input and sequence spaces compared to the online learning model. However,due to the simple regression step, it is unable to represent complex sequences inthe representation space. A more general model is necessary to encode complexdata in both input and representation spaces.Ultimately, this model generalizes the TGSN by alternating between ﬁnding agood representation for inputs and a good representation for sequences. Instead ofa direct encoding H t → H t +1 , this model learns the encoding of P ( H t +1 | H t ...H )This way, the GSNs can optimize speciﬁcally for reconstruction or predictionrather than making the hidden representation learn both. Further, by making thesequence prediction GSN layer recurrent over the top layer of the input reconstruc-tion GSN layer, this system can learn complex, nonlinear sequence representationsover the modes of the input space, capturing a very large possibility of sequentialdata distributions. These two speciﬁed layers can then be repeated to form deep,generalized representations of sequential data.1 This algorithm also alternates between training the reconstruction GSN parame-ters and prediction GSN for transitions.

Algorithm 6:

Recurrent GSN Algorithm

Input : training data X from a sequential distribution D Initialize reconstruction GSN parameters Θ gsn = { List(weights from onelayer to the next), List(bias for layer)) } ;Initialize transition RNN parameters Θ rnn = { List(weights from one layerto the next higher), List(weights from one layer to the next lower), List(biasfor layer)) } ; while training not converged dofor each input X do Sample from reconstruction GSN with walkback using X to create( X recon , X ) pairs for training parameters Θ gsn ;Compute RNN using the hidden representations H from thereconstruction GSN on the input X ;Store the predicted next hidden representations H (cid:48) and use themwith sampling from the next reconstruction GSN to train thetransition parameters Θ rnn ; endend The RNN-GSN uses the same general training parameters with regards to noise,learning rate, annealing, momentum, epochs, hidden layers, and activation as theTGSN. In addition, it has one recurrent (LSTM) hidden layer of 3000 units, receiv-ing input from layer 1 and layer 3 of the GSN below it. No sequential walkbacksteps were performed. The RNN-GSN performed worse with regards to binarycross-entropy of the predicted reconstruction than the TGSN (achieving a scoreof 0.2695, with the current reconstruction achieving a score of 0.1669). However,the reconstruction and predicted reconstruction after 300 training iterations qual-itatively looks like the model is learning the correct sequence. Further, becauseof the additional recurrent layer and parameters, this model should take longer totrain and slower progress to sequence prediction is expected. Further study of this2general model should be with hyper-parameter optimization and more trainingepochs.Figure 6.4: RNN-GSN reconstruction of current digits and predicted next digitsafter 300 iterations.Figure 6.5: RNN-GSN sampling after 300 iterations. hapter 7Model 3: Sequence EncoderNetwork (SEN)

The TGSN and RNN-GSN models have shown the idea so far of decoupling inputrepresentation from sequence representation. However, the sequence complexitylearned still has a limit by the RNN representation capacity over the input la-tent space. We can generalize this decoupling idea even further by creating analternating structure with these input representation and sequence representationlayers, inspired by convolutional neural networks with alternating convolutionaland dimensionality reduction layers [13]. The Sequence Encoder Network (SEN)is able to stack these input and sequence representational layers to learn combi-nations of representations for the sequence dynamics across many layers to enablea much higher capacity for complex inputs.

The SEN algorithm extends the RNN-GSN by continuing to learn representationson top of the sequence representations V :1. Use a GSN to learn the generative input representation H of the input X

2. Use an RNN to learn the sequence representation V over H H over the sequence representations V

4. Use another RNN to V over H

5. Repeat for desired representation layers n to get top-level sequence repre-sentations V n Intuitively, these extra layers enable the network to represent hierarchical se-quence dynamics from learning transitions between sequence representation states.This hierarchical property allows for much longer or more complex time series in-teractions.The sequence representations can also be interpreted as attractor networks[15]arranged in a hierarchical manner, learning combinations of local sequence statesto form global representations. The ﬁrst GSN over the learned RNN states V forms a localist-attractor module, where the further layered RNN and GSN hiddenstates reduce the dimensionality and learn increasingly global representations ofthe sequence state space.Because we are essentially stacking RNN-GSN layers, the EM approach fortraining reconstruction and sequences separately would beneﬁt from layerwisepretraining. For the SEN, we combine the forward passes and train both thereconstruction GSN parameters and sequence RNN parameters at the same timeto avoid this issue.5While the SEN presented here uses GSN and RNN layers, it can be imple-mented as any encoder-decoder model that stores hidden state (VAE, convolu-tional autoencoder, etc.), and then any recurrent model (LSTM, GRU, etc.) totransition between hidden states used by the decoder. Algorithm 7:

SEN Algorithm

Input : training data X from a sequential distribution D Initialize reconstruction GSN parameters Θ gsn n = { List(weights from onelayer to the next), List(bias for layer)) } for desired layers n ;Initialize transition RNN parameters Θ rnn n = { List(weights from one layerto the next higher), List(weights from one layer to the next lower), List(biasfor layer)) } for desired layers n ; while training not converged dofor each input X do Run X through the SEN, creating H to H n n ∗ ( H i , H ‘ i )reconstruction pairs, and H nt +1 expected next hiddens from theRNNs V n ;Calculate the reconstruction loss for H and prediction loss for V . endend hapter 8Discussion of Results The models were evaluated on two standard datasets, videos of bouncing ballsand motion capture data, and compared to the RNN-RBM and RTRBM modelsdiscussed in Chapter 2 (Related Work) as well as an LSTM.

Compared to the samples generated by the RNN-RBM, it is clear that the GSNframework has an easier time mixing between modes of the input data. It alsoappears to form better reconstructions of the input data. This improvement canbe attributed to a deeper representation of the input space, since the RNN-RBMonly had two layers - one for the RBM and one for the RNN.

This dataset generates videos of 3 balls bouncing and colliding in a box as describedin [10] . The videos have length of 128 frames with 15x15 resolution of pixels inthe range of [0, 1]. Training examples are generated artiﬁcially at runtime so eachsequence seen is unique, which helps reduce overﬁtting.The LSTM and Untied GSN models were trained with two layers of 500 hidden This dataset is a series of captured human joint angles, translations and rotationsaround the base of a spine as in [21] . There are 3826 samples of 49 real-valuedinputs, so input sampling was not used for the GSN and the visible layer had alinear activation.The train set was split as the ﬁrst 80% of each sequence, with the last 20%forming the test set.The LSTM and Untied GSN models were trained with two layers of 128 hiddenunits. The Temporal GSN used two layers of 128 units with tied weights, inputsalt-and-pepper noise of 0.1, hidden Gaussian noise of 0 mean and .5 standarddeviation, 4 walkback steps, and a history context window of 3 timesteps. TheRNN-GSN had two GSN layers of 128 units with tied weights and 4 walkbacks,and a single layer LSTM with 256 hidden units. The SEN had two GSN layersand two LSTM layers, where the GSN had 2 layers of 128 hidden units with tiedweights and 4 walkback steps, and the LSTM had 128 hidden units.All models were trained on subsequences with length 100 using the Adamoptimizer with a learning rate of 0.001, beta1 of 0.9, and beta2 of 0.999. Gradientswere scaled to clip the batchwise L2 norm at a maximum of 0.25.Bouncing Balls CMU Motion CaptureLSTM TGSN 5.57 9.27RNN-GSN 5.28 11.49SEN 19.0 50.8Table 8.1: Mean squared prediction error on bouncing balls videos and motioncapture data. RTRBM and RNN-RBM numbers from [10] https://github.com/sidsig/NIPS-2014 Notably, the baseline LSTM outperformed all other models on the videos of bounc-ing balls dataset, achieving a mean frame-level square prediction error of 0.11. TheUntied GSN had lower error than the RNN-RBM, but the TGSN and RNN-GSNboth did much worse. One possible explanation is the EM algorithm enteredbad RNN state transitions as discussed in Chapter 5.3. This can be seen in theRNN-GSN frame outputs, which diverged from a good state into a bad representa-tion. Another reason the GSN-based models (except the Untied GSN) performedpoorly is the injected salt-and-pepper and gaussian noise remaining relatively highthroughout the process. We would like to explore noise scheduling in the futureto help training convergence.Figure 8.2: RNN-GSN good state.Figure 8.3: RNN-GSN diverged bad state.0For the CMU motion capture dataset, the Untied GSN model had the lowestmean frame-level square prediction error at 6.90. The LSTM and TGSN weresimilar in error and all other models except the SEN outperformed the RTRBMand RNN-RBM baselines. This dataset has a much lower input dimensionality,so we are less likely for the optimization to diverge using the GSN-based models.Further, the added noise for GSNs were lower in this experiment than in thebouncing balls video dataset.Ultimately the SEN in both experiments was not able to converge. We believelearning reconstructions of higher-level sequence representations without thoserepresentations being in a relatively stable starting point leads to high training in-stability. Future work will explore layer-wise pretraining, or dynamically growingthe number of layers to encourage representation stability and training conver-gence. Further hyperparameter search for learning rate and gradient clippingshould also be performed to help stability.Figure 8.4: SEN frame predictions after 140 epochs.

Future work will focus on studying the Sequence Encoder Network class of ar-chitectures and their training stability. We would like to explore convolutionalautoencoders for image-based prediction tasks, and sequence-to-sequence modelsfor language tasks, with GRU’s as recurrent layers. Future work should also ex-1plore using adversarial loss during reconstruction to help stability and avoid modecollapse over sequences. hapter 9Conclusion

This thesis presents three models using GSNs to learn useful representations ofcomplex input data sequences. It corroborates that deep architectures, such as therelated work with RBMs, are extremely powerful ways to learn complex sequences,and that GSNs are an equally viable framework that improve upon training andinference of RBMs. Deep architectures derive most of their power from being ableto disentangle the underlying factors of variation in the input data - ﬂatteningthe data manifolds at higher representations to improve mixing between the manymodes.The Temporal GSN, an EM approach, takes advantage of the GSN’s ability toreduce the complexity of the input data at higher layer of representation, allowingfor simple linear regression to learn sequences of representations over time. Thismodel learns to reconstruct both the current input and the next predicted input.This reconstructed predicted input tends to look like an average of the next inputsin the sequence given the current input.The Recurrent GSN adds a recurrent hidden state to learn a sequential repre-sentation between the GSN’s latent spaces. This approach allows for more complextime series interactions to be learned over the TGSN.The Sequence Encoder Network generalizes the idea behind the RecurrentGSN. By alternating layers of encoder-decoder models that learn reconstructions423of the input, and recurrent layers that learn reconstruction of future prediction, itmodels hierarchical representations of both the input and sequence spaces. Train-ing is much more diﬃcult as layer numbers increase. eferences [1] Guillaume Alain, Yoshua Bengio, and Salah Rifai. Regularized auto-encodersestimate local statistics.

CoRR , abs/1211.4246, 2012.[2] Galen Andrew and Jeﬀ Bilmes. Sequential deep belief networks. In , 2012.[3] Galen Andrew and Jeﬀ Bilmes. Backpropagation in sequential deep neuralnetworks. In

NIPS Deep Learning Workshop , 2013.[4] Francis R. Bach and Michael I. Jordan. A probabilistic interpretation ofcanonical correlation analysis. Technical Report 688, University of California,Berkeley, April 2005.[5] Yoshua Bengio. Deep learning of representations: Looking forward.

CoRR ,abs/1305.0445, 2013.[6] Yoshua Bengio. Invited tutorial: Deep learning of representations. In

Pro-ceedings of the 27th Canadian Conference on Artiﬁcial Intelligence , 2014.[7] Yoshua Bengio, Aaron C. Courville, and Pascal Vincent. Unsupervised fea-ture learning and deep learning: A review and new perspectives.

CoRR ,abs/1206.5538, 2012.[8] Yoshua Bengio, Eric Thibodeau-Laufer, and Jason Yosinski. Deep generativestochastic networks trainable by backprop.

CoRR , abs/1306.1091, 2013.445[9] Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent. Generalizeddenoising auto-encoders as generative models.

CoRR , abs/1305.6663, 2013.[10] Nicolas Boulanger-Lewandowski, Yoshua Bengio, and Pascal Vincent. Mod-eling temporal dependencies in high-dimensional sequences: Application topolyphonic music generation and transcription. In

Proceedings of the 29thInternational Conference on Machine Learning (ICML 2012) , 2012.[11] Geoﬀrey E. Hinton, Simon Osindero, and Yee-Whye Teh. A fast learningalgorithm for deep belief nets.

Neural Computation , 18(7):1527–1554, July2006.[12] Alex Krizhevsky, Ilya Sutskever, and Geoﬀrey E. Hinton. Imagenet clas-siﬁcation with deep convolutional neural networks. In

NIPS 2012: NeuralInformation Processing Systems , 2012.[13] Yann LeCun, Leon Bottou, Yoshua Bengio, and Patrick Haﬀner. Gradient-based learning applied to document recognition. In

Intelligent Signal Pro-cessing , pages 306–351. IEEE Press, 2001.[14] Alan J Lockett and Risto Miikkulainen. Temporal convolution machines forsequence learning. Technical Report AI-09-04, Department of Computer Sci-ences, the University of Texas at Austin, 2009.[15] Donald W. Mathis and Michael C. Mozer. Conscious and unconscious per-ception: A computational theory. In

Proceedings of the Eighteenth AnnualConference of the Cognitive Science Society , pages 324–328. Erlbaum, 1996.[16] Marc’Aurelio Ranzato and Yann LeCun. A sparse and locally shift invariantfeature extractor applied to document images. In

International Conferenceon Document Analysis and Recognition (ICDAR 2007) , 2007.[17] Danilo J. Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic back-propagation and approximate inference in deep generateive models. In

Pro-ceedings of the 31st International Conference on Machine Learning , 2014.6[18] Frank Seide, Gang Li, and Dong Yu. Conversational speech transcriptionusing context-dependent deep neural networks. In

Interspeech 2011 . Interna-tional Speech Communication Association, August 2011.[19] Richard Socher, Cliﬀ C. Lin, Andrew Y. Ng, and Christopher D. Manning.Parsing natural scenes and natural language with recursive neural networks.In

Proceedings of the 26th International Conference on Machine Learning(ICML) , 2011.[20] Ilya Sutskever and Geoﬀrey Hinton. Learning multilevel distributed repre-sentations for high-dimensional sequences. Technical report, University ofToronto, 2006.[21] Ilya Sutskever, Geoﬀrey Hinton, and Graham Taylor. The recurrent temporalrestricted boltzmann machine. In

Advances in Neural Information ProcessingSystems 21 (NIPS 2008) , 2008.[22] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Man-zagol. Extracting and composing robust features with denoising autoencoders.In

Proceedings of the 25th International Conference on Machine Learning ,2008.[23] Mengqiu Wang and Christopher D. Manning. Eﬀect of non-linear deep ar-chitecture in sequence labeling. In