[PDF] Transformation Autoregressive Networks

Abstract

The fundamental task of general density estimation p(x) has been of keen interest to machine learning. In this work, we attempt to systematically characterize methods for density estimation. Broadly speaking, most of the existing methods can be categorized into either using: \textit{a}) autoregressive models to estimate the conditional factors of the chain rule, p( x i | x i−1 ,…) ; or \textit{b}) non-linear transformations of variables of a simple base distribution. Based on the study of the characteristics of these categories, we propose multiple novel methods for each category. For example we proposed RNN based transformations to model non-Markovian dependencies. Further, through a comprehensive study over both real world and synthetic data, we show for that jointly leveraging transformations of variables and autoregressive conditional models, results in a considerable improvement in performance. We illustrate the use of our models in outlier detection and image modeling. Finally we introduce a novel data driven framework for learning a family of distributions.

Full PDF

TTransformation Autoregressive Networks

Junier B Oliva

Avinava Dubey Manzil Zaheer Barnab´as P´oczos Ruslan Salakhutdinov Eric P Xing Jeff Schneider Abstract

The fundamental task of general density estima-tion p ( x ) has been of keen interest to machinelearning. In this work, we attempt to systemati-cally characterize methods for density estimation.Broadly speaking, most of the existing methodscan be categorized into either using: a ) autoregres-sive models to estimate the conditional factors ofthe chain rule, p ( x i ∣ x i − , . . . ) ; or b ) non-lineartransformations of variables of a simple base dis-tribution. Based on the study of the character-istics of these categories, we propose multiplenovel methods for each category. For examplewe propose RNN based transformations to modelnon-Markovian dependencies. Further, througha comprehensive study over both real world andsynthetic data, we show that jointly leveragingtransformations of variables and autoregressiveconditional models, results in a considerable im-provement in performance. We illustrate the useof our models in outlier detection and image mod-eling. Finally we introduce a novel data drivenframework for learning a family of distributions.

1. Introduction

Density estimation is at the core of a multitude of machinelearning applications. However, this fundamental task is dif-ﬁcult in the general setting due to issues like the curse ofdimensionality. Furthermore, for general data, unlike spa-tial/temporal data, we do not have known correlations apriori among covariates that may be exploited. For exam-ple, image data has known correlations among neighboringpixels that may be hard-coded into a model through con-volutions, whereas one must ﬁnd such correlations in adata-driven fashion with general data. Computer Science Department, University of North Car-olina, Chapel Hill, NC 27599 (Work completed while atCMU.) Machine Learning Department, Carnegie Mellon Uni-versity, Pittsburgh, PA 15213. Correspondence to: Junier Oliva < [email protected] > . Proceedings of the th International Conference on MachineLearning , Stockholm, Sweden, PMLR 80, 2018. Copyright 2018by the author(s). f o rr e s t ( d = ) p e n d i g i t s ( d = ) s u s y ( d = ) h i gg s ( d = ) h e p m a ss ( d = ) s a t i m a g e ( d = ) m u s i c ( d = ) Datasets10 -21 -14 -7 R e l a t i v e T e s t L i k e li h oo d NADENICEARCNLTTAN

Figure 1.

The proposed

TAN models for density estimation, whichjointly leverages non-linear transformation and autoregressive con-ditionals, shows considerable improvement over other methodsacross datasets of varying dimensions. The scatter plots shows thatonly utilizing autoregressive conditionals (ARC) without transfor-mations ( e.g. existing works like NADE (Uria et al., 2014) andother variants) or only using non-linear transformation (NLT) withsimple restricted conditionals ( e.g. existing works like NICE (Dinhet al., 2014) and other variants) is not sufﬁcient for all datasets.

In order to model high dimensional data, the main challengelies in constructing models that are ﬂexible enough whilehaving tractable learning algorithms. A variety of diversesolutions exploiting different aspects of the problems havebeen proposed in the literature. A large number of methodshave considered auto-regressive models to estimate the con-ditional factors p ( x i ∣ x i − , . . . , x ) , for i ∈ { , . . . , d } in thechain rule (Larochelle & Murray, 2011; Uria et al., 2013;2016; Germain et al., 2015; Gregor et al., 2014). While somemethods directly model the conditionals p ( x i ∣ x i − , . . . ) us-ing sophisticated semiparametric density estimates, othermethods apply sophisticated transformations of variables x ↦ z and take the conditionals over z to be a restricted,often independent base distribution p ( z i ∣ z i − , . . . ) ≈ f ( z i ) (Dinh et al., 2014; 2016). Further related works are dis-cussed in Sec. 3. However, looking across a diverse set ofdataset, as in Fig. 1, neither of these approaches have theﬂexibility required to accurately model real world data.In this paper we take a step back and start from the basics.If we only model the conditionals, the conditional factors p ( x i ∣ x i − , . . . ) , may become increasingly complicated as i increases to d . On the other hand if we use a complex a r X i v : . [ s t a t . M L ] O c t ransformation Autoregressive Networks transformation with restricted conditionals then the trans-formation has to ensure that the transformed variables areindependent. This requirement of independence on the trans-formed variables can be very restrictive. Now note that thetransformed space is homeomorphic to the original spaceand a simple relationship between the density of the twospaces exists through the Jacobian. Thus, we can employconditional modeling on the transformed variables to alle-viate the independence requirement, while being able torecover density in the original space in a straightforwardfashion. In other words, we propose transformation autore-gressive networks (TANs) which composes the complextransformations and autoregressive modeling of the condi-tionals. The composition not only increases the ﬂexibility ofthe model but also reduces the expressibility power neededfrom each of the individual components. This leads to animproved performance as can be seen from Fig. 1.In particular, ﬁrst we propose two ﬂexible autoregressivemodels for modeling conditional distributions: the linear au-toregressive model (LAM), and the recurrent autoregressivemodel (RAM) (Sec. 2.1). Secondly, we introduce severalnovel transformations of variables: 1) an efﬁcient methodfor learning a linear transformation on covariates; 2) an in-vertible RNN-based transformation that directly acts on co-variates; 3) an additive RNN-base transformation (Sec. 2.2).Extensive experiments on both synthetic (Sec. 4.1) and real-world (Sec. 4.2) datasets show the power of TANs for captur-ing complex dependencies between the covariates. We runan ablation study to demonstrate contributions of variouscomponents in TAN Sec. 4.3, Moreover, we show that thelearned model can be used for anomaly detection (Sec. 4.4)and learning a family of distributions (Sec. 4.5).

2. Transformation Autoregressive Networks

As mentioned above, TANs are composed of two modules: a ) an autoregressive module for modeling conditional fac-tors and b ) transformations of variables. We ﬁrst introduceour two proposed autoregressive models to estimate the con-ditional distribution of input covariates x ∈ R d . Later, weshow how to use such models over transformation z = q ( x ) ,while renormalizing to obtain density values for x . Autoregressive models decompose density estimation of amultivariate variable x ∈ R d into multiple conditional taskson a growing set of inputs through the chain rule: p ( x , . . . , x d ) = d ∏ i = p ( x i ∣ x i − , . . . , x ) . (1)That is, autoregressive models will estimate the d condi-tional distributions p ( x i ∣ x i − , . . . ) . A class of autoregressivemodels can be deﬁned by approximating conditional dis-tributions through a mixture model, MM ( θ ( x i − , . . . , x )) ,with parameters θ depending on x i − , . . . , x : p ( x i ∣ x i − , . . . , x ) = p ( x i ∣ MM ( θ ( x i − , . . . , x )) , (2) θ ( x i − , . . . , x ) = f ( h i ) (3) h i = g i ( x i − , . . . , x ) , (4)where f (⋅) is a fully connected network that may use aelement-wise non-linearity on inputs, and g i (⋅) is somegeneral mapping that computes a hidden state of features, h i ∈ R p , which help in modeling the conditional distributionof x i ∣ x i − , . . . , x . One can control the ﬂexibility of themodel through g i . It is important to be powerful enough tomodel our covariates while still generalizing. In order toachieve this we propose two methods for modeling g i . Linear Autoregressive Model (LAM):

This uses astraightforward linear map as g i in (4): g i ( x i − , . . . , x ) = W ( i ) x < i + b, (5)where W ( i ) ∈ R p ×( i − ) , b ∈ R p , and x < i = ( x i − , . . . , x ) T .Notwithstanding the simple form of (5), the resulting modelis quite ﬂexible as it may model consecutive conditionalproblems p ( x i ∣ x i − , . . . , x ) and p ( x i + ∣ x i , . . . , x ) verydifferently owing to different W ( i ) s. Recurrent Autoregressive Model (RAM):

This featuresa recurrent relation between g i ’s. As the set of covariates isprogressively fed into g i ’s, it is natural to consider a hiddenstate evolving according to an RNN recurrence relationship: h i = g ( x i − , g ( x i − , . . . , x )) = g ( x i − , h i − ) . (6)In this case g ( x, s ) is a RNN function for updating one’sstate based on an input x and prior state s . In the case ofgated-RNNs, the model will be able to scan through previ-ously seen dimensions remembering and forgetting informa-tion as needed for conditional densities without making anystrong Markovian assumptions.Both LAM and RAM are ﬂexible and able to adjust thehidden states, h i in (4), to model the distinct conditionaltasks p ( x i ∣ x i − , . . . ) . There is a trade-off of added ﬂexibil-ity and transferred information between the two models.LAM treats the conditional tasks for p ( x i ∣ x i − , . . . ) and p ( x i + ∣ x i , . . . ) in a largely independent fashion. This makesfor a very ﬂexible model, however the parameter size isalso large and there is no sharing of information amongthe conditional tasks. On the other hand, RAM provides aframework for transfer learning among the conditional tasksby allowing the hidden state h i to evolve through the distinctconditional tasks. This leads to fewer parameters and moresharing of information in respective tasks, but also yieldsless ﬂexibility since conditional estimates are tied, and mayonly change in a smooth fashion. Next we introduce the second module of TANs i.e. thetransformations. When using an invertible transformationof variables z = ( q ( x ) , . . . , q d ( x )) ∈ R d , one can establisha relationship between the pdf of x and z as: ransformation Autoregressive Networks p ( x , . . . , x d ) = ∣ det d q d x ∣ d ∏ i = p ( z i ∣ z i − , . . . , z ) , (7)where ∣ det d q d x ∣ is the Jacobian of the transformation. Foranalytical and computational considerations, we requiretransformations to be invertible, efﬁcient to compute and in-vert, and have a structured Jacobian matrix. In order to meetthese criteria we propose the following transformations. Linear Transformation:

It is an afﬁne map of the form: z = Ax + b, (8)where we take A to be invertible. Note that even though thislinear transformation is simple, it includes permutations,and may also perform a PCA-like transformation, capturingcoarse and highly varied features of the data before movingto more ﬁne grained details. In order to not incur a highcost for updates, we wish to compute the determinant of theJacobian efﬁciently. Thus, we propose to directly work overan LU decomposition A = LU where L is a lower triangularmatrix with unit diagonals and U is a upper triangular matrixwith arbitrary diagonals. As a function of L , U we have that det d z d x = ∏ di = U ii ; hence we may efﬁciently optimize theparameters of the linear map. Furthermore, inverting ourmapping is also efﬁcient through solving two triangularmatrix equations. Recurrent Transformation:

Recurrent neural networksare also a natural choice for variable transformations. Dueto their dependence on only previously seen dimensions,RNN transformations have triangular Jacobians, leading tosimple determinants. Furthermore, with an invertible outputunit, their inversion is also straight-forward. We considerthe following form to an RNN transformation: z i = r α ( yx i + w T s i − + b ) , s i = r ( ux i + v T s i − + a ) , (9) where r α is a leaky ReLU unit r α ( t ) = I { t < } αt + I { t ≥ } t , r is a standard ReLU unit, s ∈ R ρ is the hidden state y , u , b , a are scalars, and w, v ∈ R ρ are vectors. As comparedto the linear transformation, the recurrent transformationis able to transform the input with different dynamics de-pending on its values. Inverting (9) is a matter of invertingoutputs and updating the hidden state (where the initial state s is known and constant): x i = y ( r − α ( z ( r ) i ) − w T s i − − b ) ,s i = r ( ux i + v T s i − + a ) . (10) Furthermore, the determinant of the Jacobian for (9) is theproduct of diagonal terms: det d z d x = y d d ∏ i = r ′ α ( yx i + w T s i − + b ) , (11)where r ′ α ( t ) = I { t > } + α I { t < } . Recurrent Shift Transformation:

It is worth noting thatthe rescaling brought on by the recurrent transformation ef-fectively incurs a penalty through the log of the determinant(11). However, one can still perform a transformation that depends on the values of covariates through a shift opera-tion. In particular, we propose an additive shift based on arecurrent function on prior dimensions: z i = x i + m ( s i − ) , s i = g ( x i , s i − ) , (12)where g is recurrent function for updating states, and m is afully connected network. Inversion proceeds as before: x i = z i − m ( s i − ) , s i = g ( x i , s i − ) . (13)The Jacobian is again lower triangular, however due tothe additive nature of (12), we have a unit diagonal. Thus, det d z d x = . One interpretation of this transformation is thatone can shift the value of x k based on x k − , x k − , . . . forbetter conditional density estimation without any penaltycoming from the determinant term in (7). Composing Transformations:

Lastly, we consideringstacking (i.e. composing) several transformations q = q ( ) ○ . . . ○ q ( T ) and renormalizing: p ( x , . . . , x d ) = T ∏ t = ∣ det d q ( t ) d q ( t − ) ∣ d ∏ i = p ( q i ( x ) ∣ q i − ( x ) , . . . , q ( x )) , (14)where we take q ( ) to be x . We note that composing severaltransformations together allows one to leverage the respec-tive strengths of each transformation. Moreover, insertinga reversal mapping ( x , . . . , x d ↦ x d , . . . , x ) as one of the q i s yields bidirectional relationships. We combine the use of both transformations of variablesand rich autoregressive models by: 1) writing the density ofinputs, p ( x ) , as a normalized density of a transformation: p ( q ( x )) (14). Then we estimate the conditionals of p ( q ( x )) using an autoregressive model, i.e. , to learn our model weminimize the negative log likelihood: − log p ( x , . . . , x d ) =− T ∑ t = log ∣ det d q ( t ) d q ( t − ) ∣ − d ∑ i = log p ( q i ( x ) ∣ h i ) , (15) which is obtained by substituting (2) into (14) with h i asdeﬁned in (4).

3. Related Works

Nonparametric density estimation has been a well studiedproblem in statistics and machine learning (Wasserman,2007). Unfortunately, nonparametric approaches like kerneldensity estimation suffer greatly from the curse of dimen-sionality and do not perform well when data does not havea small number of dimensions ( d ≲ ). To alleviate this, sev-eral semiparametric approaches have been explored. Suchapproaches include forest density estimation (Liu et al.,2011), which assumes that the data has a forest (i.e. a collec-tion of trees) structured graph. This assumption leads to adensity which factorizes in a ﬁrst order Markovian fashion ransformation Autoregressive Networks through a tree traversal of the graph. Another common semi-parametric approach is to use a nonparanormal type model(Liu et al., 2009). This approach uses a Gaussian copula witha rank-based transformation and a sparse precision matrix.While both approaches are well-understood theoretically,their strong assumptions lead to inﬂexible models.In order to provide greater ﬂexibility with semiparametricmodels, recent work has employed deep learning for densityestimation. The use of neural networks for density estima-tion dates back to Bishop (1994) and has seen success inspeech (Zen & Senior, 2014; Uria, 2015), music (Boulanger-Lewandowski et al., 2012), etc. Typically such approachesuse a network to learn the parameters of a parametric modelfor data. Recent work has also explored the application ofdeep learning to build density estimates in image data (Oordet al., 2016; Dinh et al., 2016). However, such approachesare heavily reliant on exploiting structure in neighboring pix-els, often subsampling, reshaping or re-ordering data, andusing convolutions to take advantage of neighboring corre-lations. Modern approaches for general density estimationin real-valued data include Uria et al. (2013; 2016); Ger-main et al. (2015); Gregor et al. (2014); Dinh et al. (2014);Kingma et al. (2016); Papamakarios et al. (2017).NADE (Uria et al., 2013) is an RBM-inspired density es-timator with a weight-sharing scheme across conditionaldensities on covariates. It may be written as a special caseof LAM (5) with tied weights: q i ( x i − , . . . , x ) = W < i x < i + b, (16)where W < i ∈ R p × i − is the weight matrix compose of theﬁrst i − columns of a shared matrix W = ( w , . . . w d ) .We note also that LAM and NADE are both related to fullyvisible sigmoid belief networks (Frey, 1998; Neal, 1992).Even though the weight-sharing scheme in (16) reducesthe number of parameters, it also limits the types of dis-tributions one can model. Roughly speaking, the NADEweight-sharing scheme makes it difﬁcult to adjust condi-tional distributions when expanding the conditioning setwith a covariate that has a small information gain. Weillustrate this by considering a simple 3-dimensional dis-tribution: x ∼ N ( , ) , x ∼ N ( sign ( x ) , (cid:15) ) , x ∼N ( I {∣ x ∣ < C . } , (cid:15) ) , where C . is the conﬁdenceinterval of a standard Gaussian distribution, and (cid:15) > issome small constant. That is, x , and x are marginally dis-tributed as an equi-weighted bimodal mixture of Gaussianwith means − , and , , respectively. Due to NADE’sweight-sharing linear model, it will be difﬁcult to adjust h and h jointly to correctly model x and x respectively.However, given their additional ﬂexibility, both LAM andRAM are able to adjust hidden states to remember and trans-form features as needed.NICE (Dinh et al., 2014) and its successor Real NVP (Dinhet al., 2016) models assume that data is drawn from a latentindependent Gaussian space and transformed. The trans-formation uses several “additive coupling” shifting on the second half of dimensions, using the ﬁrst half of dimensions.For example NICE’s additive coupling proceeds by splittinginputs into halves x = ( x < d / , x ≥ d / ) , and transforming thesecond half as an additive function of the ﬁrst half: z = ( x < d / , x ≥ d / + m ( x < d / )) , (17)where m (⋅) is the output of a fully connected net-work. Inversion is simply a matter of subtraction x =( z < d / , z ≥ d / − m ( z < d / )) . The full transformation is theresult of stacking several of these additive coupling layerstogether followed by a ﬁnal rescaling operation. Further-more, as with the RNN shift transformation, the additivenature of (17) yields a simple determinant, det d z d x = .MAF (Papamakarios et al., 2017) identiﬁed that Gaussianconditional autoregressive models for density estimation canbe seen as transformations. This enabled them to stack mul-tiple autoregressive models that increases ﬂexibility. How-ever, stacking Gaussian conditional autoregressive modelsamounts to just stacking shift and scale transformations. Un-like MAF, in the TAN framework we not only propose noveland more complex equivalence like Recurrent Transforma-tion (Sec. 2.2), but also systematically composing stacks ofsuch transformations with ﬂexible autoregressive models.There are several methods for obtaining samples from anunknown distribution that by-pass density estimation. Forinstance, generative adversarial networks (GANs) applya (typically noninvertible) transformation of variables to abase distribution by optimizing a minimax loss (Goodfellow,2016; Kingma et al., 2016). Samples can also be obtain frommethods that compose graphical models with deep networks(Johnson et al., 2016; Al-Shedivat et al., 2017). Furthermore,one can also obtain samples with only limited informationabout the density of interest using methods such as Markovchain Monte Carlo (Neal, 1993), Hamiltonian Monte Carlo(Neal, 2010), stochastic variants (Dubey et al., 2016), etc.

4. Experiments

We now present empirical studies for our TAN framework inorder to establish (i) the superiority of TANs over one-prongapproaches (Sec. 4.1), (ii) that TANs are accurate on realworld datasets (Sec. 4.2), (iii) the importance of variouscomponents of TANs, (iv) that TANs are easily amenable tovarious tasks (Sec. 4.4), such as learning a parametric familyof distributions and being able to generalize over unseenparameter values (Sec. 4.5).

Methods

We study the performance of various instantia-tion of TANs using different combinations of conditionalmodels p ( q i ( x ) ∣ h i ) and various transformations q (⋅) . Inparticular the following conditional models were considered: LAM , RAM , Tied , MultiInd , and

SingleInd . Here,

LAM , RAM , and

Tied are as described in equations (5),(6), and (16), respectively.

MultiInd takes p ( q i ( x ) ∣ h i ) to be p ( q i ( x ) ∣ MM ( θ i )) , that is we shall use d distinct in-dependent mixtures to model the transformed covariates. ransformation Autoregressive Networks v a l u e s sample y sin(y g i +y ) y y y y y y v a l u e s sample y y y y y y y sin(y g i +y ) 0.20.00.20.40.6 v a l u e s sample y y y y y y y sin(y g i +y ) v a l u e s sample y y y y y y y sin(y g i +y ) Figure 2.

RNN+4xSRNN+Re & RAM model samples. Each plot shows a single sample. We plot the sample values of unpermuteddimensions y , . . . , y ∣ y , y , y in blue and the expected value of these dimensions (i.e. without the Markovian noise) in green. Onemay see that the model is able to correctly capture both the sinusoidal and random walk behavior of our data. Similarly,

SingleInd takes p ( q i ( x ) ∣ h i ) to be p ( q i ( x )) ,the density of a standard single component. For transfor-mations we considered: None , RNN , , , , RNN+4xAdd+Re , and

RNN+4xSRNN+Re . None indicates that no transformation of variables was per-formed.

RNN and perform a single recurrent transfor-mation (9), and two recurrent transformations with a rever-sal permutation in between, respectively. Following (Dinhet al., 2014), performs four additive couplingtransformations (17) with reversal permutations in betweenfollowed by a ﬁnal element-wise rescaling: x ↦ x ∗ exp ( s ) ,where s is a learned variable. Similarly, , in-stead performs four recurrent shift transformations (12). RNN+4xAdd+Re , and

RNN+4xSRNN+Re are as before,but performing an initial recurrent transformation. Further-more, we also considered performing an initial linear trans-formation (8). We ﬂag this by prepending an L to the trans-formation; e.g. L RNN denotes a linear transformation fol-lowed by a recurrent transformation.

Implementation

Models were implemented in Tensor-ﬂow (Abadi et al., 2016) . Both RAM conditional modelsas well as the RNN shift transformation make use of thestandard GRUCell

GRU implementation. We take the mix-ture models of conditionals (2) to be mixtures of 40 Gaus-sians. We optimize all models using the

AdamOptimizer (Kingma & Ba, 2014) with an initial learning rate of . .Training consisted of

30 000 iterations, with mini-batches ofsize . The learning rate was decreased by a factor of . ,or . (chosen via a validation set) every iterations.Gradient clipping with a norm of was used. After training,the best iteration according to the validation set loss wasused to produce the test set results. To showcase the strengths of TANs and short-comings ofonly conditional models & only transformations, we care-fully construct two synthetic datasets

Data Generation

Our ﬁrst dataset, consisting of a Marko-vian structure that features several exploitable correlationsamong covariates, is constructed as: y , y , y ∼ N ( , ) and y i ∣ y i − , . . . , y ∼ f ( i, y , y , y ) + (cid:15) i for i > where (cid:15) i ∼ N ( (cid:15) i − , σ ) , f ( i, y , y , x ) = y sin ( y g i + y ) , and g i ’s are equi-spaced points on the unit interval. That is,instances are sampled using random draws of amplitude, See https://github.com/lupalab/tan . frequency, and shift covariates y , y , y , which determinethe mean of the other covariates, y sin ( y g i + y ) , stem-ming from function evaluations on a grid, and randomnoise (cid:15) i with a Gaussian random walk. The resulting in-stances contain many correlations as visualized in Fig. 2.To further exemplify the importance of employing con-ditional and transformations in tandem, we construct asecond dataset with much fewer correlations. In particu-lar, we use a star-structured graphical model where fringenodes are very uninformative of each-other and estimat-ing the distribution of the fringe vertices are difﬁcult with-out conditioning on all the center nodes. To construct thedataset: divide the covariates into disjoint center and ver-tex sets C = { , . . . , } , V = { , . . . , d } respectively. Forcenter nodes j ∈ C , y j ∼ N ( , ) . Then, for j ∈ V , y j ∼N ( f j ( w Tj y C ) , σ ) where f j is a ﬁxed step function with 32intervals, w j ∈ R is a ﬁxed vector, and y C = ( y , y , y , y ) .In both datasets, to test robustness to correlations from dis-tant (by index) covariates, we observe covariates that areshufﬂed using a ﬁxed permutation π chosen ahead of time: x = ( y π , . . . , y π d ) . We take d = , and the number oftraining instances to be

100 000 . Observations

We detail the mean log-likelihoods on atest set for TANs using various combinations of conditionalmodels and transformations in Appendix, Tab. 2 and Tab. 3respectively. We see that both

LAM and

RAM conditionalsare providing most of the top models. We observe goodsamples from the best performing model as shown in Fig. 2.Particularly in second dataset, simpler conditional methodsare unable to model the data well, suggesting that the com-plicated dependencies need a two-prong TAN approach. Weobserve a similar pattern when learning over the star datawith d = (see Appendix, Tab. 4). We performed several real-world data experiments and com-pared to several state-of-the-art density estimation methodsto substantially improved performance of TAN.

Datasets

We carefully followed (Papamakarios et al.,2017) and code (MAF Git Repository) to ensure that we op-erated over the same instances and covariates for each of thedatasets considered in (Papamakarios et al., 2017). Speciﬁ-cally we performed unconditional density estimation on fourdatasets from UCI machine learning repository : power : http://archive.ics.uci.edu/ml/ ransformation Autoregressive Networks Table 1.

Average test log-likelihood comparison of TANs with baselines MADE, Real NVP, MAF as reported by (Papamakarios et al.,2017). For TANs the best model is picked using validation dataset and are reported here. Parenthesized numbers indicate number oftransformations used. Standard errors with σ are shown. Largest values per dataset are shown in bold . POWER d=6; N=2,049,280

GAS d=8; N=1,052,065

HEPMASS d=21; N=525,123

MINIBOONE d=43; N=36,488

BSDS300 d=63; N=1,300,000

MADE -3.08 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± . ± .

01 12 . ± . − . ± . − . ± .

48 159 . ± .

5x L+ReLU+SRNN+Re& RAM 5x L+ReLU+SRNN+Re& RAM 5x L+ReLU+SRNN+Re& RAM 4xSRNN + Re& RAM 5x L+ReLU+SRNN+Re& RAM

Containing electric power consumption in a household over47 months. gas : Readings of 16 chemical sensors exposedto gas mixtures. hepmass : Describing Monte Carlo sim-ulations for high energy physics experiments. minibone :Containing examples of electron neutrino and muon neu-trino. We also used

BSDS300 which were obtained fromextracting random × monochrome patches from theBSDS300 datasets of natural images (Martin et al., 2001).These are multivariate datasets from a varied set of sourcesmeant to provide a broad picture of performance across dif-ferent domains. Here, we used a batch size of 1024 with60K training iterations. We saw great performance by usingmultiple successions of a linear transformation, followed byan element-wise leaky transformation (as in eq. 9), a recur-rent shift transformation (12), and an element-wise rescaletransformation. Thus in addition, we used a model with 5such stacked transformations (

5x L+ReLU+SRNN+Re ). Fur-ther, to demonstrate that our proposed models can even beused to model high dimensional data and produce coherentsamples, we consider image modeling task, treating eachimage as a ﬂattened vector. We consider × grayscaleimages of MNIST digits and × natural colored imagesof CIFAR-10. Following Dinh et al. (2014), we dequantizepixel values by adding noise and rescaling. Figure 3.

Samples frombest TAN model.

Metric

We use the average testlog-likelihoods of the best TANmodel selected using a valida-tion set and compare to valuesreported by (Papamakarios et al.,2017) for MADE (Germain et al.,2015), Real NVP (Dinh et al.,2016), and MAF (Papamakarioset al., 2017) methods for eachdataset. For images, we use trans-formed version of test log-likelihood, called bits per pixel,which is more popular. In order to calculate bits per pixel,we need to convert the densities returned by a model backto image space in the range [0, 255], for which we use thesame logit mapping provided in Papamakarios et al. (2017,

L RNN

MNIST (d=784; N=70,000)

L RNN

CIFAR-10 (d=3072; N=105,000)

Figure 4.

Bits per pixel for models (lower is better) using logittransforms on MNIST & CIFAR-10. MADE, Real NVP, and MAFvalues are as reported by (Papamakarios et al., 2017). The bestachieved value is denoted by *.

Appendix E.2).

Observations

Tab. 1 and Fig. 4 shows our results on vari-ous multivariate datasets and images respectively, with errorbars computed over 5 runs. As can be seen, our TAN mod-els are considerably outperforming other state-of-the-artmethods across all multivariate as well as image datasets,justifying our claim of utilizing both complex transforma-tions and conditionals. Furthermore, we plot samples forMNIST case in Fig. 3. We see that TAN is able to capturethe structure of digits with very few artifacts in samples,which is also reﬂected in the likelihoods.

To study how different components of the models affect thelog-likelihood, we perform a comprehensive ablation studyacross different datasets.

Datasets

We used multiple datasets from the UCI ma-chine learning repository and Stony Brook outlier detectiondatasets collection (ODDS) to evaluate log-likelihoods ontest data. Broadly, the datasets can be divided into: Parti-cle acceleration : higgs , hepmass , and susy datasetswhere generated for high-energy physics experiments usingMonte Carlo simulations; Music : The music dataset con-tains timbre features from the million song dataset of mostly http://archive.ics.uci.edu/ml/ http://odds.cs.stonybrook.edu ransformation Autoregressive Networks -1.0 0.0 1.0 2.0 3.0Test Log-LikelihoodLAMRAMTiedMultiIndNADENICE 2.392.67*0.910.75-0.65-0.49 L RNN+4xAdd+ReL RNN+4xAdd+ReL 4xAdd+ReL 4xSRNN+Re forest (d=10; N=286,048) -9.0 -4.0 1.0 6.0 11.0Test Log-Likelihood6.92*3.901.44-5.01 1.44-6.49

None2xRNNNone4xAdd+Re pendigits (d=16; N=6,870) -6.0 4.0 14.0 24.0Test Log-Likelihood17.6718.94*15.4012.16-5.72 4.25

L 4xAdd+ReL RNN+4xSRNN+ReL 4xSRNN+ReL 4xSRNN+Re susy (d=18; N=5,000,000) -16.0 -11.0 -6.0 -1.0 4.0Test Log-Likelihood-3.40-0.34*-8.05-8.22-13.88-15.14

L RNN+4xAdd+ReRNNL RNN+4SRNN+ReL 4xSRNN+Re higgs (d=28; N=11,000,000) -12.0 -7.0 -2.0 3.0 8.0Test Log-LikelihoodLAMRAMTiedMultiIndNADENICE 3.914.93*-0.24-5.75-4.95-11.39

L RNN+4xAdd+ReL RNNL RNN+4xAdd+ReL 4xSRNN+Re hepmass (d=28; N=10,500,000) -18.0 -13.0 -8.0 -3.0 2.0Test Log-Likelihood-1.72-0.55*-2.14-1.57-9.30-17.98

NoneL 2xRNN2xRNNL None satimage2 (d=36; N=5,803) -99.0 -79.0 -59.0 -39.0Test Log-Likelihood-51.57*-55.66-58.88-69.48-98.05-83.52

L RNN+4xAdd+ReL 4xSRNN+ReL RNN+4xAdd+ReL RNN+4xAdd+Re music (d=90; N=515,345) -375.0 -325.0 -275.0 -225.0Test Log-Likelihood-247*-272.37-273.37-308.15-278.79-374.63

L 4xAdd+ReL 4xAdd+ReL 4xSRNN+ReL RNN+4xSRNN+Re wordvecs (d=300; N=3,000,000)

Figure 5.

Ablation Study of various components TAN. For each dataset and each conditional model, top transformations is selected usinglog-likelihoods on a validation set. The picked transformation is reported within the bars for each conditional. ∗ denotes the best model foreach dataset picked by validation. Simple conditional MultiInd , always lags behind sophisticated conditionals such as

LAM & RAM . commercial western song tracks from the year 1922 to 2011;(Bertin-Mahieux et al., 2011). Word2Vec : wordvecs con-sists of 3 million words from a Google News corpus. Eachword represented as a 300 dimensional vector trained using aword2vec model . ODDS datasets : We used several ODDSdatasets– forest , pendigits , satimage2 . These aremultivariate datasets from varied set of sources meant toprovide a broad picture of performance across anomaly de-tection tasks. To not penalize models for low likelihoods onoutliers in ODDS, we removed anomalies from test sets.As noted in (Dinh et al., 2014), data degeneracies andother corner-cases may lead to arbitrarily low negative log-likelihoods. Thus, we remove discrete features, standardize,and add Gaussian noise (stddev of . ) to training sets. Observations

We report average test log-likelihoods inFig. 5 for each dataset and conditional model for the toptransformations picked on a validation dataset. The tableswith test log-likelihoods for all combinations of conditionalmodels and transformations for each dataset is in AppendixTab. 6-12. We observe that the best performing models inreal-world datasets are those that incorporate a ﬂexible trans-formation and conditional model. In fact, the best modelin each of the datasets considered always has

LAM or RAM autoregressive components. Each row of these tables showthat using a complex conditional is always better than usingrestricted, independent conditionals. Similarly, each columnof the table shows that for a given conditional, it is bet-ter to pick a complex transformation rather than having notransformation. It is interesting to note that many of these https://code.google.com/archive/p/word2vec/ top models also contain a linear transformation. Of course,linear transformations of variables are common to mostparametric models, however they have been under-exploredin the context of autoregressive density estimation. Ourmethodology for efﬁciently learning linear transformationscoupled with their strong empirical performance encouragestheir inclusion in autoregressive models for most datasets.Finally, we pick the “overall” winning combination of trans-formations and conditionals. For this we compute the frac-tion of the top likelihood achieved by each transformation t and conditional model m for dataset D : s ( t, m, D ) = exp ( l t,m,D )/ max a,b exp ( l a,b,D ) , where l t,m,D is the testlog-likelihood for t, m on D . We then average S over thedatasets: S ( t, m ) = T ∑ D S ( t, m, D ) , where T is the totalnumber of datasets and reported all these score in AppendixTab. 5. This provides a summary of which models performedbetter over multiple datasets. In other words, the closer thisscore is to 1 for a model means the more datasets for whichthe model is the best performer. We see that RAM conditionalwith

L RNN transformation, and

LAM conditional with

LRNN+4xAdd+Re were the two best performers.

Next, we apply density estimates to anomaly detection. Typ-ically anomalies or outliers are data-points that are unlikelygiven a dataset. In terms of density estimations, such a taskis framed by identifying which instances in a dataset havea low corresponding density. That is, we shall label an in-stance x , as an anomaly if ˆ p ( x ) ≤ t , where t ≥ is somethreshold and ˆ p is the density estimate based on trainingdata. Note that this approach is trained in an unsupervisedfashion. Density estimates were evaluated on test data with ransformation Autoregressive Networks O r i g i n a l S a m p l e s Figure 6.

Qualitative samples obtained from TANs for the task of learning parametric family of distributions where we treat each categoryof objects as a family and each point cloud for an object as the sample set. Top row shows unseen test point clouds and bottom rowrepresents samples produced from TANs for these inputs. Presence of few artifacts in samples of unseen objects indicates a good ﬁt. anomaly/non-anomaly labels on instances. We used thresh-olded log-likelihoods on the test set to compute precisionand recall. We use the average-precision metric and showour results in Fig. 7. TAN performs the best on all threedatasets. Beyond providing another interesting use for ourdensity estimates, seeing good performance in these outlierdetection tasks further demonstrates that our models arelearning semantically meaningful patterns.

To further demonstrate ﬂexibility of TANs, we consider anew task of learning parametric family of distributions to-gether. Suppose we have a family of density P θ . We assumein training data there are N sets X , ..., X N , where the n -thset X n = { x n, , ..., x n,m n } consists of m n i.i.d. samplesfrom density P θ n , i.e. X n is a set of sample points, and x n,j ∼ P θ n , j = , ..., m n . We assume that we do not haveaccess to underlying true parameters θ n . We want to jointlylearn the density estimate and parameterization of the setsto predict even for sets coming from unseen values of θ .We achieve this with a novel approach that models each set X i with p (⋅∣ φ ( X i )) where p is a shared TAN model for thefamily of distributions and φ ( X i ) are a learned embedding(parameters) for the i th set with DeepSets (Zaheer et al.,2017). In particular, we use a permutation invariant networkof DeepSets parameterized by W to extract the embedding φ ( X ) for the given sample set X . The embedding is thenfed along with sample set to TAN model parameterized by forestO=2,747 pendigitsO=156 satimage2O=710.80.91.0 A v e r a g e P r e c i s i o n NADENICETAN

Figure 7.

Average precision score on outlier detection datasets.For each dataset, the best performing TAN model picked usinglikelihood on a validation set, is shown. W . We then optimize the following modiﬁed objective: min W ,W − N ∑ i m i ∑ j log p W ( x ij ∣ φ W ( X i ∖ j )) . (18)We attempt to model point-cloud representation of objectsfrom ShapeNet (Chang et al., 2015). We produce point-clouds with 1000 particles each ( x, y, z -coordinates) fromthe mesh representation of objects using the point-cloud-library’s sampling routine (Rusu & Cousins, 2011). Weconsider each category of objects ( e.g. aeroplane, chair, car)as a family and each point cloud for each object in thecategory as a sample set. We train a TAN and only showsamples in Fig. 6 produced for unseen test sets, as thereare neither any baselines for this task nor ground truth forlikelihood. From the samples, we see that our model isable to capture the structure of different kinds of unseenaeroplanes and chairs, with very few artifacts in samples,which reﬂects a good ﬁt.Note that this task is subtly different from conditional den-sity estimation as we do not have access to class/parametervalues during training. Also we want to caution users againstusing this method when the test sample set is very differentfrom training or comes from a different family distribution.

5. Conclusion

In this work, we showed that we can signiﬁcantly improvedensity estimation for real valued data by jointly leveragingtransformations of variables with autoregressive models andproposed novel modules for both. We systematically charac-terized various modules and evaluated their contributions ina comprehensive ablation study. This exercise not only re-emphasized the beneﬁts of joint modeling, but also revealedsome straightforward modules and combinations thereof,which are empirically good, but were missed earlier, e.g. theuntied linear conditionals. Finally we introduced a noveldata driven framework for learning a family of distributions. ransformation Autoregressive Networks

Acknowledgements

This research is partly funded by DOE grant DESC0011114,NSF IIS1563887, NIH R01GM114311, NSF IIS1447676,and the DARPA D3M program.

References

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z.,Citro, C., Corrado, G. S., Davis, A., Dean, J., Devin,M., et al. Tensorﬂow: Large-scale machine learningon heterogeneous distributed systems. arXiv preprintarXiv:1603.04467 , 2016.Al-Shedivat, M., Dubey, A., and Xing, E. P. Contextualexplanation networks. arXiv preprint arXiv:1705.10301 ,2017.Bertin-Mahieux, T., Ellis, D. P., Whitman, B., and Lamere,P. The million song dataset. In

ISMIR , volume 2, pp. 10,2011.Bishop, C. M. Mixture density networks.

Technical Report ,1994.Boulanger-Lewandowski, N., Bengio, Y., and Vincent, P.Modeling temporal dependencies in high-dimensionalsequences: Application to polyphonic music generationand transcription.

International Conference on MachineLearning , 2012.Chang, A. X., Funkhouser, T., Guibas, L., Hanrahan, P.,Huang, Q., Li, Z., Savarese, S., Savva, M., Song, S., Su,H., et al. Shapenet: An information-rich 3d model reposi-tory. arXiv preprint arXiv:1512.03012 , 2015.Dinh, L., Krueger, D., and Bengio, Y. Nice: Non-linearindependent components estimation. arXiv preprintarXiv:1410.8516 , 2014.Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density esti-mation using real nvp. arXiv preprint arXiv:1605.08803 ,2016.Dubey, K. A., Reddi, S. J., Williamson, S. A., Poczos, B.,Smola, A. J., and Xing, E. P. Variance reduction instochastic gradient langevin dynamics. In

Advances inneural information processing systems , pp. 1154–1162,2016.Frey, B. J.

Graphical models for machine learning anddigital communication . MIT press, 1998.Germain, M., Gregor, K., Murray, I., and Larochelle, H.Made: masked autoencoder for distribution estimation.In

Proceedings of the 32nd International Conference onMachine Learning (ICML-15) , pp. 881–889, 2015.Goodfellow, I. Nips 2016 tutorial: Generative adversarialnetworks. arXiv preprint arXiv:1701.00160 , 2016. Gregor, K., Danihelka, I., Mnih, A., Blundell, C., and Wier-stra, D. Deep autoregressive networks.

ICML , 2014.Johnson, M., Duvenaud, D. K., Wiltschko, A., Adams, R. P.,and Datta, S. R. Composing graphical models with neuralnetworks for structured representations and fast inference.In

Advances in neural information processing systems ,pp. 2946–2954, 2016.Kingma, D. and Ba, J. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980 , 2014.Kingma, D. P., Salimans, T., and Welling, M. Improvingvariational inference with inverse autoregressive ﬂow. arXiv preprint arXiv:1606.04934 , 2016.Larochelle, H. and Murray, I. The neural autoregressivedistribution estimator. In

Proceedings of the FourteenthInternational Conference on Artiﬁcial Intelligence andStatistics , pp. 29–37, 2011.Liu, H., Lafferty, J., and Wasserman, L. The nonparanor-mal: Semiparametric estimation of high dimensional undi-rected graphs.

Journal of Machine Learning Research ,10(Oct):2295–2328, 2009.Liu, H., Xu, M., Gu, H., Gupta, A., Lafferty, J., and Wasser-man, L. Forest density estimation.

Journal of MachineLearning Research , 12(Mar):907–951, 2011.MAF Git Repository. The maf git repository. https://github.com/gpapamak/maf . Accessed: 2017-12-03.Martin, D., Fowlkes, C., Tal, D., and Malik, J. A databaseof human segmented natural images and its application toevaluating segmentation algorithms and measuring eco-logical statistics. In

Computer Vision, 2001. ICCV 2001.Proceedings. Eighth IEEE International Conference on ,volume 2, pp. 416–423. IEEE, 2001.Neal, R. M. Connectionist learning of belief networks.

Artiﬁcial intelligence , 56(1):71–113, 1992.Neal, R. M. Probabilistic inference using markov chainmonte carlo methods. 1993.Neal, R. M. MCMC using Hamiltonian dynamics.

Hand-book of Markov Chain Monte Carlo , 54:113–162, 2010.Oord, A. v. d., Kalchbrenner, N., and Kavukcuoglu,K. Pixel recurrent neural networks. arXiv preprintarXiv:1601.06759 , 2016.Papamakarios, G., Pavlakou, T., and Murray, I. Maskedautoregressive ﬂow for density estimation. arXiv preprintarXiv:1705.07057 , 2017.Rusu, R. B. and Cousins, S. 3D is here: Point CloudLibrary (PCL). In

IEEE International Conference onRobotics and Automation (ICRA) , Shanghai, China, May9-13 2011. ransformation Autoregressive Networks

Uria, B. Connectionist multivariate density-estimation andits application to speech synthesis., 2015.Uria, B., Murray, I., and Larochelle, H. Rnade: The real-valued neural autoregressive density-estimator. In

Ad-vances in Neural Information Processing Systems , pp.2175–2183, 2013.Uria, B., Murray, I., and Larochelle, H. A deep and tractabledensity estimator. In

ICML , pp. 467–475, 2014.Uria, B., Cˆot´e, M.-A., Gregor, K., Murray, I., and Larochelle,H. Neural autoregressive distribution estimation.

Journalof Machine Learning Research , 17(205):1–37, 2016.Wasserman, L. All of nonparametric statistics, 2007.Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B.,Salakhutdinov, R. R., and Smola, A. J. Deep sets. In

Advances in Neural Information Processing Systems , pp.3394–3404, 2017.Zen, H. and Senior, A. Deep mixture density networksfor acoustic modeling in statistical parametric speechsynthesis. In

Acoustics, Speech and Signal Processing(ICASSP), 2014 IEEE International Conference on , pp.3844–3848. IEEE, 2014. ransformation Autoregressive Networks

A. Appendix

Below we detail the results on several datasets using different combinations of transformations and autoregressive conditionalmodels. Each additive coupling transformation uses a fully connected network with two hidden layers of 256 units. RNNtransformations use a hidden state with 16 units. SingleInd conditional models modeled each dimension’s conditional as astandard Gaussian. MultiInd modeled each dimension’s conditional as independent mixtures with 40 components (eachwith mean, scale, and weight parameter). RAM, LAM, and Tied conditional models each had a hidden state with 120 unitsthat was fed through two fully connected layers each with 120 units to produce the parameters of the mixtures with 40components. The RAM hidden state was produced by a GRU with 256 units. LAM and Tied hidden states came through alinear map as discussed above.

Table 2.

Held out test log-likelihoods for the Markovian dataset. The superscripts denote rankings of log-likelihoods on the validationdataset. Note that NADE is TIED conditional with None Transform and NICE is Add+Re Transformation with SingleInd Conditional. Inparenthesis is the top-10 picks using valiation set.

Transformation LAM RAM TIED MultiInd SingleInd

None . − . − . − . − − − L None . ( ) .

538 10 .

906 5 . − . RNN . − .

716 11 . − . − . L RNN . ( ) .

354 10 .

910 5 .

370 3 . .

683 13 .

698 11 . − . − . L 2xRNN . ( ) . ( ) .

316 5 .

385 3 . .

269 12 .

257 12 .

912 12 .

446 11 . L 4xAdd+Re . ( ) .

594 13 .

845 12 .

768 12 . .

829 14 .

381 11 .

798 11 .

738 12 . L 4xSRNN+Re . . ( ) . . ( ) . RNN+4xAdd+Re .

171 12 .

991 14 .

455 11 .

467 10 . L RNN+4xAdd+Re .

078 12 .

655 14 .

415 12 .

886 12 . RNN+4xSRNN+Re . . ( ) . . ( ) . L RNN+4xSRNN+Re . . ( ) .

179 14 .

528 13 . Table 3.

Held out test log-likelihoods for star 32d dataset. The superscript denotes ranking of log-likelihood on cross validation dataset.Note that NADE is TIED conditional with None Transform and NICE is Add+Re Transformation with SingleInd Conditional. Inparenthesis is the top-10 picks using valiation set.

Transformation LAM RAM TIED MultiInd SingleInd

None − .

041 2 . − . − . − − − L None .

454 8 . − . − . − . RNN − .

276 2 . − . − . − . L RNN .

775 6 . − . − . − . .

705 8 . − . − . − . L 2xRNN . ( ) .

946 0 . − . − . . ( ) . ( ) . − . − . L 4xAdd+Re . ( ) . ( ) . − . − . .

496 8 . − . − . − . L 4xSRNN+Re . ( ) . ( ) . − . − . RNN+4xAdd+Re . ( ) . ( ) . − . − . L RNN+4xAdd+Re . ( ) .

253 2 . − . − . RNN+4xSRNN+Re − .

679 3 . − . − . − . L RNN+4xSRNN+Re .

433 7 .

324 3 . − . − . ransformation Autoregressive Networks Table 4.

Held out test log-likelihood for Star 128d dataset.The superscript denotes ranking of log-likelihood on crossvalidation dataset.Note that NADE is TIED conditional with None Transform and NICE is Add+Re Transformation with SingleInd Conditional. Inparenthesis is the top-10 picks using valiation set.

Transformation LAM RAM TIED MultiInd SingleInd

None .

671 15 . − . − . − − − L None . − . − . − . − . RNN .

766 48 . − . − . − . L RNN . ( ) − .

084 31 . − . − . .

295 45 . − . − . − . L 2xRNN . ( ) − .

524 30 . − . − . . ( ) . ( ) . − . − . L 4xAdd+Re . ( ) − .

882 20 . − . − . . − .

796 3 . − . − . L 4xSRNN+Re . ( ) . ( ) . − . − . RNN+4xAdd+Re . ( ) − . − . − . − . L RNN+4xAdd+Re . ( ) .

104 21 . − . − . RNN+4xSRNN+Re . − . − . − . − . L RNN+4xSRNN+Re . ( ) .

201 26 . − . − . Table 5.

Average performance percentage score for each model across all datasets. Note that this measure is not over a logarithmic space.

Transformation LAM RAM TIED MultiInd SingleInd

MAXNone .

218 0 .

118 0 .

006 0 .

000 0 .

000 0 . L None .

154 0 .

179 0 .

026 0 .

051 0 .

001 0 . RNN .

086 0 .

158 0 .

014 0 .

001 0 .

000 0 . L RNN . . .

014 0 .

040 0 .

013 0 . .

151 0 .

101 0 .

045 0 .

001 0 .

000 0 . L 2xRNN .

118 0 .

330 0 .

015 0 .

045 0 .

025 0 . .

036 0 .

047 0 .

015 0 .

010 0 .

006 0 . L 4xAdd+Re .

153 0 .

096 0 .

025 0 .

014 0 .

009 0 . .

086 0 .

051 0 .

031 0 .

010 0 .

008 0 . L 4xSRNN+Re .

109 0 .

143 0 .

023 0 .

021 0 .

018 0 . RNN+4xAdd+Re .

121 0 .

096 0 .

023 0 .

011 0 .

011 0 . L RNN+4xAdd+Re .

336 0 .

165 0 .

024 0 .

016 0 .

013 0 . RNN+4xSRNN+Re .

102 0 .

151 0 .

017 0 .

012 0 .

014 0 . L RNN+4xSRNN+Re .

211 0 .

288 0 .

024 0 .

018 0 .

016 0 . MAX .

336 0 .

540 0 .

045 0 .

051 0 . ransformation Autoregressive Networks Table 6.

Held out test log-likelihood for forest dataset.The superscript denotes ranking of log-likelihood on crossvalidation dataset.Note that NADE is TIED conditional with None Transform and NICE is Add+Re Transformation with SingleInd Conditional. Inparenthesis is the top-10 picks using valiation set.

Transformation LAM RAM TIED MultiInd SingleInd

None . − . − . − . − − − L None .

910 1 . − . − . − . RNN .

395 0 .

053 0 . − . − . L RNN . ( ) . − . − . − . .

832 1 .

830 0 . − . − . L 2xRNN . ( ) . ( ) . − . − . .

106 1 .

430 0 . − . − . L 4xAdd+Re .

043 1 .

979 0 .

909 0 . − . .

178 1 .

428 0 . − . − . L 4xSRNN+Re . ( ) . ( ) .

611 0 .

754 0 . RNN+4xAdd+Re . . ( ) .

857 0 .

081 0 . L RNN+4xAdd+Re . ( ) . ( ) .

852 0 .

450 0 . RNN+4xSRNN+Re .

599 1 .

545 0 .

510 0 .

182 0 . L RNN+4xSRNN+Re . ( ) . ( ) .

804 0 .

600 0 . Table 7.

Held out test log-likelihood for pendigits dataset. The superscript denotes ranking of log-likelihood on crossvalidationdataset. Note that NADE is TIED conditional with None Transform and NICE is Add+Re Transformation with SingleInd Conditional. Inparenthesis is the top-10 picks using valiation set.

Transformation LAM RAM TIED MultiInd SingleInd

None . ( ) . ( ) . − . − − − L None . ( ) . − . − . − . RNN . ( ) . − . − . − . L RNN . ( ) . − . − . − . . ( ) . ( ) − . − . − . L 2xRNN .

987 0 . − . − . − . − . − . − . − . − . L 4xAdd+Re − . − . − . − . − . . ( ) . − . − . − . L 4xSRNN+Re . − . − . − . − . RNN+4xAdd+Re − . − . − . − . − . L RNN+4xAdd+Re − . − . − . − . − . RNN+4xSRNN+Re . ( ) . − . − . − . L RNN+4xSRNN+Re . ( ) . − . − . − . ransformation Autoregressive Networks Table 8.

Held out test log-likelihood for susy dataset.The superscript denotes ranking of log-likelihood on crossvalidation dataset. Notethat NADE is TIED conditional with None Transform and NICE is Add+Re Transformation with SingleInd Conditional. In parenthesis isthe top-10 picks using valiation set.

Transformation LAM RAM TIED MultiInd SingleInd

None . − . − . − . − − − L None . . ( ) . − . − . RNN .

784 3 .

347 6 . − . − . L RNN . . ( ) . − . − . .

052 14 .

362 3 . − . − . L 2xRNN . . ( ) . − . − . .

835 8 .

033 7 .

238 6 .

031 4 . L 4xAdd+Re . ( ) . ( ) .

613 10 .

941 9 . .

798 13 .

235 1 .

234 6 .

936 3 . L 4xSRNN+Re . . ( ) .

397 12 .

161 13 . RNN+4xAdd+Re .

408 12 .

480 9 .

409 7 .

619 5 . L RNN+4xAdd+Re . ( ) .

376 13 .

765 10 .

951 8 . RNN+4xSRNN+Re . . ( ) .

136 10 .

088 7 . L RNN+4xSRNN+Re . ( ) . ( ) .

469 12 .

105 12 . Table 9.

Held out test log-likelihood for higgs dataset.The superscript denotes ranking of log-likelihood on crossvalidation dataset. Notethat NADE is TIED conditional with None Transform and NICE is Add+Re Transformation with SingleInd Conditional. In parenthesis isthe top-10 picks using valiation set.

Transformation LAM RAM TIED MultiInd SingleInd

None − . − . − . − . − − − L None − . ( ) − . − . − . − . RNN − . − . ( ) − . − . − . L RNN − . ( ) − . ( ) − . − . − . − . − . − . − . − . L 2xRNN − . − . ( ) − . − . − . − . − . − . − . − . L 4xAdd+Re − . − . − . − . − . − . − . − . − . − . L 4xSRNN+Re − . − . − . − . − . RNN+4xAdd+Re − . ( ) − . − . − . − . L RNN+4xAdd+Re − . ( ) − . ( ) − . − . − . RNN+4xSRNN+Re − . − . ( ) − . − . − . L RNN+4xSRNN+Re − . ( ) − . − . − . − . ransformation Autoregressive Networks Table 10.

Held out test log-likelihood for hepmass dataset.The superscript denotes ranking of log-likelihood on crossvalidation dataset.Note that NADE is TIED conditional with None Transform and NICE is Add+Re Transformation with SingleInd Conditional. Inparenthesis is the top-10 picks using valiation set.

Transformation LAM RAM TIED MultiInd SingleInd

None . . ( ) − . − . − − − L None . ( ) . − . − . − . RNN . . ( ) − . − . − . L RNN . ( ) . ( ) − . − . − . .

774 0 . − . − . − . L 2xRNN . . ( ) − . − . − . .

678 1 . − . − . − . L 4xAdd+Re .

961 2 . − . − . − . .

443 2 . − . − . − . L 4xSRNN+Re .

072 2 . − . − . − . RNN+4xAdd+Re .

817 0 . − . − . − . L RNN+4xAdd+Re . ( ) − . − . − . − . RNN+4xSRNN+Re . . ( ) − . − . − . L RNN+4xSRNN+Re . ( ) . ( ) − . − . − . Table 11.

Held out test log-likelihood for satimage2 dataset.The superscript denotes ranking of log-likelihood on crossvalidationdataset. Note that NADE is TIED conditional with None Transform and NICE is Add+Re Transformation with SingleInd Conditional. Inparenthesis is the top-10 picks using valiation set.

Transformation LAM RAM TIED MultiInd SingleInd

None − . ( ) − . ( ) − . − . − − − L None − . − . ( ) − . − . ( ) − . RNN − . − . − . − . − . L RNN − . − . ( ) − . − . ( ) − . − . − . ( ) − . − . − . L 2xRNN − . − . ( ) − . − . ( ) − . ( ) − . − . − . − . − . L 4xAdd+Re − . − . − . − . − . − . − . − . − . − . L 4xSRNN+Re − . − . − . − . − . RNN+4xAdd+Re − . − . − . − . − . L RNN+4xAdd+Re − . − . − . − . − . RNN+4xSRNN+Re − . − . − . − . − . L RNN+4xSRNN+Re − . − . − . − . − . ransformation Autoregressive Networks Table 12.

Held out test log-likelihood for music dataset.The superscript denotes ranking of log-likelihood on crossvalidation dataset. Notethat NADE is TIED conditional with None Transform and NICE is Add+Re Transformation with SingleInd Conditional. In parenthesis isthe top-10 picks using valiation set.

Transformation LAM RAM TIED MultiInd SingleInd

None − . − . − . − . − − − L None − . ( ) − . − . − . − . RNN − . ( ) − . − . − . − . L RNN − . ( ) − . − . − . − . − . − . − . − . − . L 2xRNN − . ( ) − . − . − . − . − . − . − . − . − . L 4xAdd+Re − . ( ) − . − . − . − . − . − . − . − . − . L 4xSRNN+Re − . ( ) − . − . − . − . RNN+4xAdd+Re − . ( ) − . − . − . − . L RNN+4xAdd+Re − . ( ) − . − . − . − . RNN+4xSRNN+Re − . ( ) − . − . − . − . L RNN+4xSRNN+Re − . ( ) − . − . − . − . Table 13.

Held out test log-likelihood for wordvecs dataset.The superscript denotes ranking of log-likelihood on validation dataset. Dueto time constraints only models with linear transformations were trained.

Transformation LAM RAM TIED MultiInd SingleInd

L None − . ( ) − . − . − . − . L RNN − . ( ) − . − . − . − . L 2xRNN − . ( ) − . − . − . − . L 4xAdd+Re − . ( ) − . ( ) − . − . − . L 4xSRNN+Re − . ( ) − . − . ( ) − .

735 0 . L RNN+4xAdd+Re − . ( ) − . − . ( ) − . − . L RNN+4xSRNN+Re − . ( ) − . − . − . − ..