[PDF] Structured Disentangled Representations

Abstract

Deep latent-variable models learn representations of high-dimensional data in an unsupervised manner. A number of recent efforts have focused on learning representations that disentangle statistically independent axes of variation by introducing modifications to the standard objective function. These approaches generally assume a simple diagonal Gaussian prior and as a result are not able to reliably disentangle discrete factors of variation. We propose a two-level hierarchical objective to control relative degree of statistical independence between blocks of variables and individual variables within blocks. We derive this objective as a generalization of the evidence lower bound, which allows us to explicitly represent the trade-offs between mutual information between data and representation, KL divergence between representation and prior, and coverage of the support of the empirical data distribution. Experiments on a variety of datasets demonstrate that our objective can not only disentangle discrete variables, but that doing so also improves disentanglement of other variables and, importantly, generalization even to unseen combinations of factors.

Full PDF

SStructured Disentangled Representations

Babak Esmaeili

Northeastern University [email protected]

Hao Wu

Northeastern University [email protected]

Sarthak Jain

Northeastern University [email protected]

Alican Bozkurt

Northeastern University [email protected]

N. Siddharth

University of Oxford [email protected]

Brooks Paige

Alan Turing InstituteUniversity of Cambrdige [email protected]

Dana H. Brooks

Northeastern University [email protected]

Jennifer Dy

Northeastern University [email protected]

Jan-Willem van de Meent

Northeastern University [email protected]

Abstract

Deep latent-variable models learn representationsof high-dimensional data in an unsupervised man-ner. A number of recent efforts have focused onlearning representations that disentangle statisti-cally independent axes of variation by introducingmodiﬁcations to the standard objective function.These approaches generally assume a simple diag-onal Gaussian prior and as a result are not able toreliably disentangle discrete factors of variation.We propose a two-level hierarchical objective tocontrol relative degree of statistical independencebetween blocks of variables and individual vari-ables within blocks. We derive this objective as ageneralization of the evidence lower bound, whichallows us to explicitly represent the trade-offs be-tween mutual information between data and repre-sentation, KL divergence between representationand prior, and coverage of the support of the em-pirical data distribution. Experiments on a varietyof datasets demonstrate that our objective can notonly disentangle discrete variables, but that do-ing so also improves disentanglement of othervariables and, importantly, generalization even tounseen combinations of factors.

Preliminary work. Under review by AISTATS 2019. Do not dis-tribute.

Deep generative models represent data x using a low-dimensional variable z (sometimes referred to as a code).The relationship between x and z is described by a condi-tional probability distribution p θ ( x | z ) parameterized bya deep neural network. There have been many recentsuccesses in training deep generative models for complexdata types such as images [Gatys et al., 2015, Gulrajaniet al., 2017], audio [Oord et al., 2016], and language [Bow-man et al., 2016]. The latent code z can also serve as acompressed representation for downstream tasks such astext classiﬁcation [Xu et al., 2017], Bayesian optimization[Gómez-Bombarelli et al., 2018, Kusner et al., 2017], andlossy image compression [Theis et al., 2017]. The settingin which an approximate posterior distribution q φ ( z | x ) issimultaneously learnt with the generative model via opti-mization of the evidence lower bound (ELBO) is known as avariational autoencoder (VAE), where q φ ( z | x ) and p θ ( x | z ) represent probabilistic encoders and decoders respectively.In contrast to VAE, inference and generative models canalso be learnt jointly in an adversarial setting [Makhzaniet al., 2015, Dumoulin et al., 2016, Donahue et al., 2016].While deep generative models often provide high-ﬁdelity re-constructions, the representation z is generally not directlyamenable to human interpretation. In contrast to classicalmethods such as principal components or factor analysis,individual dimensions of z don’t necessarily encode anyparticular semantically meaningful variation in x . This hasmotivated a search for ways of learning disentangled repre-sentations, where perturbations of an individual dimensionof the latent code z perturb the corresponding x in an in-terpretable manner. Various strategies for weak supervisionhave been employed, including semi-supervision of latent a r X i v : . [ s t a t . M L ] D ec anuscript under review by AISTATS 2019 L ( θ, φ ) = E q φ ( z , x ) (cid:20) log p θ ( x , z ) p θ ( x ) p ( z ) + log q φ ( z ) q ( x ) q φ ( z , x ) + log p θ ( x ) q ( x ) + log p ( z ) q φ ( z ) (cid:21) , = E q φ ( z , x ) (cid:34) log p θ ( x | z ) p θ ( x ) (cid:124) (cid:123)(cid:122) (cid:125) − log q φ ( z | x ) q φ ( z ) (cid:124) (cid:123)(cid:122) (cid:125) (cid:35) − KL ( q ( x ) || p θ ( x )) (cid:124) (cid:123)(cid:122) (cid:125) − KL ( q φ ( z ) || p ( z )) (cid:124) (cid:123)(cid:122) (cid:125) . Figure 1:

ELBO decomposition . The VAE objective can be deﬁned in terms of KL divergence between a generativemodel p θ ( x , z ) = p θ ( x | z ) p ( z ) and an inference model q φ ( z , x ) = q φ ( z | x ) q ( x ) . We can decompose this objectiveinto 4 terms. Term , which can be intuitively thought of as the uniqueness of the reconstruction, is regularized by themutual information , which represents the uniqueness of the encoding. Minimizing the KL in term is equivalent tomaximizing the marginal likelihood E q ( x ) [log p θ ( x )] . Combined maximization of + is equivalent to maximizing E q φ ( z , x ) [log p θ ( x | z )] . Term matches the inference marginal q φ ( z ) to the prior p ( z ) , which in turn ensures realisticsamples x ∼ p θ ( x ) from the generative model.variables [Kingma et al., 2014, Siddharth et al., 2017], tripletsupervision [Karaletsos et al., 2015, Veit et al., 2016], orbatch-level factor invariances [Kulkarni et al., 2015, Boucha-court et al., 2017]. There has also been a concerted effortto develop fully unsupervised approaches that modify theVAE objective to induce disentangled representations. Awell-known example is β -VAE [Higgins et al., 2016]. Thishas prompted a number of approaches that modify the VAEobjective by adding, removing, or altering the weight ofindividual terms [Kumar et al., 2017, Zhao et al., 2017, Gaoet al., 2018, Achille and Soatto, 2018].In this paper, we introduce hierarchically factorized VAEs(HFVAEs). The HFVAE objective is based on a two-levelhierarchical decomposition of the VAE objective, whichallows us to control the relative levels of statistical inde-pendence between groups of variables and for individualvariables in the same group. At each level, we induce statis-tical independence by minimizing the total correlation (TC),a generalization of the mutual information to more than twovariables. A number of related approaches have also consid-ered the TC [Kim and Mnih, 2018, Chen et al., 2018, Gaoet al., 2018], but do not employ the two-level decompositionthat we consider here. In our derivation, we reinterpret thestandard VAE objective as a KL divergence between a gen-erative model and its corresponding inference model. Thishas the side beneﬁt that it provides a uniﬁed perspective ontrade-offs in modiﬁcations of the VAE objective.We illustrate the power of this decomposition by disentan-gling discrete factors of variation from continuous variables,which remains problematic for many existing approaches.We evaluate our methodology on a variety of datasets includ-ing dSprites, MNIST, Fashion MNIST (F-MNIST), CelebAand 20NewsGroups. Inspection of the learned represen-tations conﬁrms that our objective uncovers interpretablefeatures in an unsupervised setting, and quantitative metricsdemonstrate improvement over related methods. Crucially,we show that the learned representations can recover com-binations of latent features that were not present in any examples in the training set, which has long been an im-plicit goal in learning disentangled representations that isnow considered explicitly. Variational autoencoders jointly optimize two models. Thegenerative model p θ ( x , z ) deﬁnes a distribution on a set oflatent variables z and observed data x in terms of a prior p ( z ) and a likelihood p θ ( x | z ) , which is often referred to asthe decoder model. This distribution is estimated in tandemwith an encoder , a conditional distribution q φ ( z | x ) thatperforms approximate inference in this model. The encoderand decoder together deﬁne a probabilistic autoencoder.The VAE objective is traditionally deﬁned as sum over data-points x n of the expected value of the per-datapoint ELBO,or alternatively as an expectation over an empirical distribu-tion q ( x ) that approximates an unknown data distributionwith a ﬁnite set of data points, L VAE ( θ, φ ) := E q ( x ) (cid:20) E q φ ( z | x ) (cid:20) log p θ ( x , z ) q φ ( z | x ) (cid:21)(cid:21) ,q ( x ) := 1 N N (cid:88) n =1 δ x n ( x ) (1)To better understand the various modiﬁcations of the VAEobjective, which have often been introduced in an ad hocmanner, we here consider an alternate but equivalent deﬁni-tion of the VAE objective as a KL divergence between thegenerative model p θ ( x , z ) and inference model q φ ( z , x ) = q ( z | x ) q ( x ) , L ( θ, φ ) := − KL ( q φ ( z , x ) || p θ ( x , z )) (2) = E q φ ( z , x ) (cid:20) log p θ ( x , z ) q φ ( z | x ) (cid:21) − E q ( x ) [log q ( x )] . This deﬁnition differs from the expression in Equation (1)only by a constant term log N , which is the entropy of anuscript under review by AISTATS 2019 Figure 2: Illustration of the role of each term in the decomposition from Figure 1. A: Shows the default objective, whereaseach subsequent ﬁgure shows the effect of removing one term from the objective. B: Removing means that we no longerrequire a unique z for each x n . Term will then minimize I ( x ; z ) which means that each x n is mapped to the prior. C: Removing eliminates the constraint that I ( x ; z ) must be small under the inference model, causing each x n to be mappedto a unique region in z space. D: Removing eliminates the constraint that p θ ( x ) must match q ( x ) . E: Removing eliminates the constraint that q φ ( z ) must match p ( z ) .the empirical data distribution q ( x ) . The advantage ofthis interpretation as a KL divergence is that it becomesmore apparent what it means to optimize the objectivewith respect to the generative model parameters θ and theinference model parameters φ . In particular, it is clearthat the KL is minimized when p θ ( x , z ) = q φ ( z , x ) ,which in turn implies that marginal distributions on data p θ ( x ) = q ( x ) and latent code q φ ( z ) = p ( z ) must alsomatch. We will refer to q φ ( z ) , as the inference marginal ,which is the average over the data of the encoder distribution q φ ( z ) = (cid:82) q φ ( z , x ) d x = N (cid:80) Nn =1 q φ ( z | x n ) .To more explicitly represent the trade-offs that are implicitin optimizing the VAE objective, we perform a decomposi-tion (Figure 1) similar to the one obtained by Hoffman andJohnson [2016]. This decomposition yields 4 terms. Terms and enforce consistency between the marginal distri-butions over x and z . Minimizing the KL in term max-imizes the marginal likelihood E q ( x ) [log p θ ( x )] , whereasminimizing ensures that the inference marginal q φ ( z ) approximates the prior p ( z ) . Terms and enforce con-sistency between the conditional distributions. Intuitivelyspeaking, term maximizes the identiﬁability of the values z that generate each x n ; when we sample z ∼ q φ ( z | x n ) ,then the likelihood p θ ( x n | z ) under the generative modelshould be higher than the marginal likelihood p θ ( x n ) . Term regularizes term by minimizing the mutual infor-mation I ( z ; x ) in the inference model, which means that q φ ( z | x n ) maps each x n to less identiﬁable values.Note that term is intractable in practice, since we arenot able to pointwise evaluate p θ ( x ) . We can circumventthis intractability by combining + into a single term,which recovers the likelihood argmax θ,φ E q φ ( z , x ) (cid:20) log p θ ( x | z ) p θ ( x ) + log p θ ( x ) q ( x ) (cid:21) = argmax θ,φ E q φ ( z , x ) [log p θ ( x | z )] . To build intuition for the impact of each of these terms,Figure 2 shows the effect of removing each term from theobjective. When we remove or we can learn modelsin which p θ ( x ) deviates from q ( x ) , or q φ ( z ) deviates from p ( z ) . When we remove , we eliminate the requirementthat p θ ( x n | z ) should be higher when z ∼ q φ ( z | x n ) thanwhen z ∼ p ( z ) . Provided the decoder model is sufﬁcientlyexpressive, we would then learn a generative model thatignores the latent code z . This undesirable type of solutiondoes in fact arise in certain cases, even when is includedin the objective, particularly when using auto-regressivedecoder architectures [Chen et al., 2016b].When we remove , we learn a model that minimizes theoverlap between q φ ( z | x n ) for different data points x n ,in order to maximize . This maximizes the mutual in-formation I ( x ; z ) , which is upper-bounded by log N . Inpractice often saturates to log N , even when included inthe objective, which suggests that maximizing outweighsthis cost, at least for the encoder/decoder architectures thatare commonly considered in present-day models. In this paper, we are interested in deﬁning an objective thatwill encourage statistical independence between features.The β -VAE objective [Higgins et al., 2016] aims to achievethis goal by deﬁning the objective L β -VAE ( θ, φ ) = E q ( x ) (cid:104) E q φ ( z | x ) [log p θ ( x | z )] − β KL ( q φ ( z | x ) || p ( z )) (cid:105) . We can express this objective in the terms of Figure 1 as + + β ( + ) . In order to induce disentangledrepresentations, the authors set β > . This works well incertain cases, but it has the drawback that it also increasesthe strength of , which means that the encoder model may anuscript under review by AISTATS 2019 − KL ( q φ ( z ) || p ( z )) = − E q φ ( z ) (cid:20) log q φ ( z ) (cid:81) d q φ ( z d ) + log (cid:81) d q φ ( z d ) (cid:81) d p ( z d ) + log (cid:81) d p ( z d ) p ( z ) (cid:21) = E q φ ( z ) (cid:20) log p ( z ) (cid:81) d p ( z d ) − log q φ ( z ) (cid:81) d q φ ( z d ) (cid:21)(cid:124) (cid:123)(cid:122) (cid:125) A − (cid:88) d KL ( q φ ( z d ) || p ( z d )) (cid:124) (cid:123)(cid:122) (cid:125) B . B = E q φ ( z d ) (cid:20) log p ( z d ) (cid:81) e p ( z d,e ) − log q φ ( z d ) (cid:81) e q φ ( z d,e ) (cid:21)(cid:124) (cid:123)(cid:122) (cid:125) i − (cid:88) e KL ( q φ ( z d,e ) || p ( z d,e )) (cid:124) (cid:123)(cid:122) (cid:125) ii Figure 3:

Hierarchical KL decomposition . We can decompose term into subcomponents A and B . Term A matches thetotal correlation between variables in the inference model relative to the total correlation under the generative model. Term B minimizes the KL divergence between the inference marginal and prior marginal for each variable z d . When the variable z d contains sub-variables z d,e , we can recursively decompose the KL on the marginals z d into term i , which matches thetotal correlation, and term ii , which minimizes the per-dimension KL divergence.discard more information about x in order to minimize themutual information I ( x ; z ) .Looking at the β -VAE objective, it seems intuitive thatincreasing the weight of term is likely to aid disentangle-ment. One notion of disentanglement is that there shouldbe a low degree of correlation between different latent vari-ables z d . If we choose a mean-ﬁeld prior p ( z ) = (cid:81) d p ( z d ) ,then minimizing the KL term should induce an inferencemarginal q φ ( z ) = (cid:81) d q φ ( z d ) in which z d are also indepen-dent. However, in addition to being sensitive to correlations,the KL will also be sensitive to discrepancies in the shape ofthe distribution. When our primary interest is to disentanglerepresentations, then we may wish to relax the constraintthat the shape of the distribution matches the prior in favorof enforcing statistical independence.To make this intuition explicit, we decompose into twoterms A and B (Figure 3). As with term + , term A consists of two components. The second of these takes theform of a total correlation, which is the generalization ofthe mutual information to more than two variables, T C ( z ) = E q φ ( z ) (cid:20) log q φ ( z ) (cid:81) d q φ ( z d ) (cid:21) = KL (cid:16) q φ ( z ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:81) d q φ ( z d ) (cid:17) (3)Minimizing the total correlation yields a q φ ( z ) in whichdifferent z d are statistically independent, hereby providing apossible mechanism for inducing disentanglement. In caseswhere z d itself represents a group of variables, rather than asingle variable, we can continue to decompose to anotherset of terms i and ii which match the total correlationfor z d and the KL divergences for constituent variables z d,e . This provides an opportunity to induce hierarchiesof disentangled features. We can in principle continue thisdecomposition for any number of levels to deﬁne an HFVAEobjective. We here restrict ourselves to the two-level case, which corresponds to an objective of the form L HFVAE ( θ, φ ) = + + ii + α + β A + γ i . (4)In this objective, α controls the I ( x ; z ) regularization, β controls the TC regularization between groups of variables,and γ controls the TC regularization within groups. This ob-jective is similar to, but more general than, the one recentlyproposed by Kim and Mnih [2018] and Chen et al. [2018].Our objective admits these objectives as a special case cor-responding to a non-hierarchical decomposition in which β = γ . The ﬁrst component of A is not present in theseobjectives, which implicitly assume that p ( z ) = (cid:81) d p ( z d ) .In the more general case where p ( z ) (cid:54) = (cid:81) d p ( z d ) , maxi-mizing A with respect to φ will match the total correlationin q ( z ) to that in p ( z ) . In order to optimize this objective, we need to approximatethe inference marginals q φ ( z ) , q φ ( z d ) , and q φ ( z d,e ) . Com-puting these quantities exactly requires a full pass over thedataset, since q φ ( z ) is a mixture over all data points in thetraining set. We approximate q φ ( z ) with a Monte Carloestimate ˆ q φ ( z ) over the same batch of samples that we useto approximate all other terms in the objective L HFVAE ( θ, φ ) .For simplicity we will consider the term E q φ ( z , x ) [log q φ ( z )] (cid:39) B B (cid:88) b =1 log q φ ( z b ) , (5) z b ∼ q φ ( z | x b ) , b ∼ Uniform (1 , . . . , N ) We deﬁne the estimate of ˆ q φ ( z b ) as (see Appendix A.1) ˆ q φ ( z b ) := 1 N q φ ( z b | x b ) + N − N ( B − (cid:88) b (cid:48) (cid:54) = b q φ ( z b | x b (cid:48) ) . (6)This estimator differs from the one in Kim and Mnih [2018],which is based on adversarial-style estimation of the density anuscript under review by AISTATS 2019 Orientation Smiling Sunglasses H F VAE β - VAE z varying z varying z varying Figure 4: Interpretable factors in CelebA for a HFVAE ( β = 5 . , γ = 3 . ) and a β -VAE ( β = 8 . )ratio, and is also distinct from the estimators in Chen et al.[2018] who employ different approximations. We can thinkof this approximation as a partially stratiﬁed sample, inwhich we deterministically include the term x n = x b andcompute a Monte Carlo estimate over the remaining terms,treating indices b (cid:48) (cid:54) = b as samples from the distribution q ( x | x (cid:54) = x b ) . We now substitute log ˆ q φ ( z ) for log q φ ( z ) in Equation (5). By Jensen’s inequality this yields a lowerbound on the original expectation. While this induces abias, the estimator is consistent. In practice, the bias islikely to be small given the batch sizes (512-1024) neededto approximate the inference marginal. In addition to the work of Kim and Mnih [2018] and Chenet al. [2018], our objective is also related to, and generalizes,a number of recently-proposed modiﬁcations to the VAEobjective (see Table 1 for an overview). Zhao et al. [2017]considers an objective that eliminates the mutual informa-tion in entirely and assigns an additional weight to theKL divergence in . Kumar et al. [2017] approximate theKL divergence in by matching the covariance of q φ ( z ) and p ( z ) . Recent work by Gao et al. [2018] connects VAEsto the principle of correlation explanation, and deﬁnes an ob-jective that reduces the mutual information regularization in for a subset of “Anchor” variables z a . Achille and Soatto[2018] interpret VAEs from an information-bottleneck per-spective and introduce an additional TC term into the objec-tive. In addition to VAEs, generative adversarial networks(GANs) have also been used to learn disentangled repre-sentations. The InfoGAN [Chen et al., 2016a] achievesdisentanglement by maximizing the mutual information be-tween individual features and the data under the generativemodel .In settings where we are not primarily interested in inducingdisentangled representations, the β -VAE objective has alsobeen used with β < in order to improve the quality ofreconstructions [Alemi et al., 2016, Engel et al., 2017, Lianget al., 2018]. While this also decreases the relative weight of Paper Objective

Kingma and Welling [2013],Rezende et al. [2014] + + + Higgins et al. [2016] + + β ( + ) Kumar et al. [2017] + + + λ Zhao et al. [2017] + + λ Alemi et al. [2018],Burgess et al. [2018] + + γ | ( + ) − C | Gao et al. [2018] + + + − λ a Achille and Soatto [2018] + + β + γ A ∗ Kim and Mnih [2018],Chen et al. [2018] + + + B + β A ∗ HFVAE (this paper) + + ii + α + β A + γ i Table 1: Comparison of objectives in autoencoding deepgenerative models. The asterisk A ∗ indicates that the priorfactorizes, i.e. p ( z ) = (cid:81) d p ( z d ) . The notation a refersto restriction of the mutual information to a subset of"Anchor" variables z a . , in practice it does not inﬂuence the learned representa-tion in cases where I ( x ; z ) saturates anyway. The tradeoffbetween likelihood and the KL term and the inﬂuence of pe-nalizing the KL term on mutual information has been studiedmore in depth in Alemi et al. [2018], Burgess et al. [2018].In other recent work, Dupont [2018] considered modelscontaining both Concrete and Gaussian variables. However,the objective was not decomposed to get A , but was basedon the objective proposed by Burgess et al. [2018]. To assess the quality of disentangled representations thatthe HFVAE induces, we evaluate a number of tasks anddatasets. We consider

CelebA [Liu et al., 2015] and dSprites [Higgins et al., 2016] as exemplars of datasets that are typ-ically used to demonstrate general-purpose disentangling.As speciﬁc examples of datasets that require a discrete vari-able, we consider

MNIST [LeCun et al., 2010] and

F-MNIST anuscript under review by AISTATS 2019

MNIST F-MNIST I ( x ; z d ) MNISTF-MNIST

Figure 5: Left: MNIST and F-MNIST reconstructions for z d ∈ ( − , . Rows contain both different samples from dataand different dimensions d . Right: The mutual information for each individual dimension I ( x ; z d ) , ranked in ascendingorder, with the Concrete variable shown last. The HFVAE prunes Correlation Between Topic ActivationsVAE HFVAEjesus, scripture, god, christ, bible, sin, christian, doctrine, faith, churchteam, game, player, score, league, play, leafs, hockey, season, nhlscsi, ide, controller, drive, bus, subject, lines, organization, card, problem VA E armenian, turk, civilian, turkish, soldier, kill, extermination, armenia, israel, jewjesus, belief, god, christian, christ, faith, scripture, moral, truth, singun, crime, people, law, defense, government, criminal, shoot, assault, ﬁ re H F VA E I( x ; z d ) NPMI Top 10 Words in Topic

Figure 6: Learned topics in the 20NewsGroups dataset using the HFVAE and the VAE objective. The middle column showsfrequent words for the 3 most informative topics of the VAE, and the 3 most correlated topics in the HFVAE. The leftcolumn lists their corresponding mutual information with x and the topic coherence score. The right column shows thecorrelations between topics. The HFVAE learns 2 groups of topics that are internally uncorrelated (top-left and bottom-rightquadrants), whilst uncovering sparse correlations between groups (top-right and bottom-left quadrants).[Xiao et al., 2017]. Finally, we consider an example thatextends beyond image-based domains by using the HFVAEobjective to train neural topic models on the [Lang, 2007] dataset. We compare a number of objectivesand priors on these datasets, including the standard VAEobjective [Kingma and Welling, 2013, Rezende et al., 2014],the β -VAE objective [Higgins et al., 2016], the β -TCVAEobjective [Chen et al., 2018], and our HFVAE objective.A crucial feature of our experiments is the realization thatHFVAE objective serves to induce representations in whichcorrelations in the inference marginal q(z) match correla-tions in the prior p(z). The two-level decomposition allowsus to control the level at which we induce independence.For example, when considering appropriate priors for theMNIST and F-MNIST data, we can take into account thefact that they each contain 10 explicit classes. Likewisefor dSprites, containing 3 explicit shape classes. We modelthese distinct classes using a Concrete distribution [Maddi-son et al., 2017, Jang et al., 2017] of appropriate dimension.We can also assume that these datasets have a multidimen-sional continuous style variable. HFVAE allows us to in-duce a stronger independence between the style and theclass, while allowing some correlation between the indi-vidual style dimensions. We can also use the two-leveldecomposition to induce correlations between variables by decreasing the strength of A . As an example, we considerthe task of uncovering correlations between topics. A full list of priors employed is given in Table 2, and theassociated model architectures are described in Appendix B.Note that in connection to Figure 3, subscript ‘ d ’ refersto the variables class or style represented by Concrete andnormal distributions respectively, while subscript ‘ e ’ refersto individual dimension within the normal variable. In allmodels, we use a single implementation of the objectivebased on the Probabilistic Torch [Siddharth et al., 2017]library for deep generative models . We begin with a qualitative evaluation of the features that areidentiﬁed when training with the HFVAE objective. Figure 5shows results for MNIST and F-MNIST datasets. For theMNIST dataset the representation recovers 7 distinct inter-pretable features—slant, width, height, openness, stroking,thickness, and roundness, while choosing to ignore the re- https://github.com/probtorch/probtorch VAE HFVAE

Normal Normal ConcreteMNIST 10 10 10F-MNIST 10 10 10dSprites 10 10 3CelebA 20 20 2Table 2: Dimensionality of latent variables. anuscript under review by AISTATS 2019

Input β -VAE HFVAE I ( y ; z ) I ( y ; z ) ( β = 4 ) ( β = 12 , γ = 4 ) ( β -VAE, β = 4 ) (HFVAE, β = 12 , γ = 4 ) Figure 7: Left: Manipulation of the thickness variable over the range -3 to 3. The β -VAE is not able to maintain digitidentity as we vary thickness. The HFVAE, which incorporates a discrete variable into the prior is able to maintain the digitidentity across the entire range. Right: Mutual information between label y and individual dimensions of z . I ( y ; c ) − max d T ( y ; z d ) I ( x ; z , c )0510 T C ( z , c ) =1.02.03.04.06.08.012.0 TCVAE

Figure 8: On the left, we plot I ( x ; { z, c } ) vs T C ( { z, c } ) for 50 restarts with 7 values γ = β . On the right, we keep γ = 3 ﬁxed and vary β , plotting mutual information gap (as proposed by Chen et al. [2018]) between the Concrete andthe largest continuous variable. While there is generally no “best” value for γ , T C ( { z, c } ) decreases by 5 nats over range γ = β = { , , , } while I ( x ; { z, c } ) decreases by 0.5 nats. Moreover, increasing β relative to γ also increases the mutualinformation gap.maining dimensions available. Observing the mutual infor-mation I ( x ; z d ) between the data x and individual dimen-sions of the latent space z d conﬁrms the separation of latentspace into useful and ignored subspaces.As can be seen from Figure 5(right), the mutual informationdrops to zero for the unused dimensions. We observe a sim-ilar trend for the F-MNIST and CelebA datasets, recoveringdistinct interpretable factors. For F-MNIST (Figure 5(mid)),this can be width, length, brightness, etc. For CelebA (Fig-ure 4), we uncover interpretable features such as the orienta-tion of the face, variation from smiling to non-smiling, andthe presence of sunglasses.As a quantitative assessment of the quality of learned rep-resentations, we evaluate the metrics proposed by Kim andMnih [2018] and Eastwood and Williams [2018] on thedSprites dataset with 10 random restarts for each model.For the Eastwood and Williams metric, we used a randomforest to regress from features to the (known) ground-truthfactors for the data. In Table 3, we list these metrics on thedSprites dataset for each of the model types and objectivesdeﬁned above, noting that the HFVAE performs similar toother approaches. Research on disentangled representations has thus far al-most exclusively considered visual (image) data. Explo-ration of disentangled representations for text is still in its infancy, and generally relies on weak supervision [Jain et al.,2018]. To assess whether the HFVAE objective can aid in-terpretability in textual domains, we consider documentsin the 20NewsGroups dataset, containing e-mail messagesfrom a number of internet newsgroups; some groups aremore closely related (e.g. religion vs politics) whereas oth-ers are not related at all (e.g. science vs sports). We analyzethis data using ProdLDA [Srivastava and Sutton, 2017] andneural variational document model (NVDM) [Miao et al.,2016], both of which are autoencoding neural topic mod-els. ProdLDA approximates a Dirichlet prior using samplesfrom a Gaussian that are normalized with a softmax func-tion, and then combines this prior with log-linear likelihoodmodel for the words. NVDM includes a MLP encoder and alog-linear decoder with the assumption of a Gaussian prior.To evaluate whether HFVAEs are able to uncover these

Model Kim Eastwood

VAE 0.63 ± . ± . β -VAE ( β ± . ± . β -TCVAE ( β ± . ± . HFVAE ( β γ ± . ± . Table 3: Disentanglement scores ( ± one standard deviation)for the dSprites dataset using the metrics proposed by Kimand Mnih [2018] and Eastwood and Williams [2018]. anuscript under review by AISTATS 2019 Pruning Generalization Pruning Generalization

Figure 9: Generalization to unseen combinations of factors. A HFVAE is trained on the full dataset, and then retrained aftera subset of the data is pruned. We then test generalization on the removed portion of the data.correlations between topics, we compare a normal ProdLDAmodel with 50 topics, which we train with a standard VAEobjective, to a HFVAE implementation in which we assumetwo latent variables with 25 dimensions each. We use β =0 . and γ = 4 . . This means that we relax the constrainton the TC at the group level. In other words, the HFVAEmodel should learn two groups of 25 topics in a manner that allows correlations between topics in different groups, and prevents correlations within a group.Figure 6 shows that this approach indeed works as expected.The HFVAE learns correlations between topics that are dis-tinct yet not unrelated, such as religion and politics in themiddle east, whereas the VAE does not uncover any signiﬁ-cant correlations between topics. As with the MNIST andF-MNIST examples, there is also a signiﬁcant degree ofpruning. The HFVAE learns topics with a signiﬁcant mu-tual information I ( x ; z d ) , whereas the VAE has a nonzeromutual information for all 50 topics.Additionally, we train an NVDM with a 50-dimensionallatent variable using the normal VAE objective, comparingthis baseline to a HFVAE with two 25-dimensional latentvariables, trained with β = 7 , and γ = 4 — allowing cor-relations within a group but preventing correlations acrossgroups. The details of this experiment are in Appendix D.2and Figure 12. We note that the latent dimensions of theHFVAE achieve a higher degree of disentanglement. An interesting question relating to the use of discrete latentvariables in our framework is how effective these variablesare at improving disentanglement. In general, a discrete vari-able over K dimensions expresses a sparsity constraint overthose dimensions, as any choice made from that variableis constrained to lie on the vertices of the K -dimensionalsimplex it represents. This interpretation broadly carries over even in the case of continuous relaxations such as theConcrete distribution that we employ to enable reparameter-ization for gradient-based methods.The ability of our framework to disentangle discrete factorsof variation using the discrete variables depends then, onhow separable the underlying classes or identities actuallyare. For example, in the case of F-MNIST, a number ofclasses (i.e., shoes, trousers, dresses, etc.) are visually dis-tinctive enough that they can be captured faithfully by thediscrete latent variable. However, in the case of dSprites,while the shapes (squares, ellipses, and hearts) are con-ceptually distinct, visually, they can often be difﬁcult todistinguish, with actual differences in the scale of a fewpixels. Here, our approach is less effective at disentanglingthe relevant factors with the discrete latent. The MNISTdataset lies in the middle of these two extremes, allowingfor clear separation in terms of the digits themselves, butalso blurring the lines partially with how similar some ofthem may appear—a 9 can appear quite similar to a 4, forexample, under some small perturbation.Figure 7 showcases the ability of the HFVAE to disentanglediscrete latent variables from continuous factors of varia-tion in an unsupervised manner. Here, we have sampleda single data point for each digit and vary the “thickness”-encoding dimension for each of the β -VAE and HFVAEmodels. Clearly, HFVAE does a better job at disentanglingdigit vs. thickness. To quantify this ability, as before, wemeasure the mutual information I ( x ; z d ) between the labeland each latent dimension, and conﬁrm that the HFVAElearns to encode the information on digits in the discretevariable. In the case of the β -VAE, this information is lessclearly captured, spread out across the available dimensions. anuscript under review by AISTATS 2019 Here, we analyze the inﬂuence of γ and β on the trade-offbetween total correlation and mutual information. We alsoshow why setting γ (cid:54) = β is necessary for better disentan-glement, in comparison to previous objectives [Chen et al.,2018, Kim and Mnih, 2018] where γ = β . We investigateboth these facets on the MNIST dataset. Figure 8 (left)shows the results of running our model for 50 restarts, foreach of seven different values of γ = β , indicating positivecorrelation between mutual information I ( x ; z, c ) and totalcorrelation T C ( z, c ) . In this ﬁgure, the ideal is the lower-right corner, corresponding to high mutual information andlow total correlation. Figure 8 (right) shows how the mutualinformation gap (MIG), deﬁned by Chen et al. [2018] tobe I ( y ; c ) − max i I ( y ; z i ) , varies as a function of β , for aﬁxed γ = 3 . We observe that higher values of β result inthe concrete variable better capturing the label information. A particular feature of disentangled representations is their utility , with evidence from human cognition Lake et al.[2017], suggesting that learning independent factors canaid in generalization to previously unseen combinations offactors. For example, one can imagine a pink elephant evenif one has (sadly) not encountered such an entity previously.To evaluate if the representations learned using HFVAEsexhibit such properties, we introduce a novel measure ofdisentanglement quality. Having ﬁrst trained a model withthe chosen data and objective, here MNIST and the HFVAE,we prune the dataset, removing data containing some partic-ular combinations of factors, say images depicting a thicknumber 7, or a narrow 0. We then re-train with the modiﬁeddataset, using the pruned data as unseen test data.Figure 9 shows the results of this experiment. As can beseen, the model trained on pruned data is able to successfullyreconstruct digits with values for the stroke and characterwidth that were never encountered during training. The his-tograms for the feature values show the ability of HFVAE tocorrectly encode features from previously unseen examples.

Much of the work on learning disentangled representationsthus far has focused on cases where the factors of varia-tion are uncorrelated scalar variables. As we begin to applythese techniques to real world datasets, we are likely to en-counter correlations between latent variables, particularlywhen there are causal dependencies between them. Thiswork is a ﬁrst step towards learning of more structureddisentangled representations. By enforcing statistical inde-pendence between groups of variables, or relaxing this con-straint, we now have the capability to disentangle variablesthat have higher-dimensional representations. An avenue offuture work is to develop datasets that allow us to more rig-orously test our ability to characterize correlations betweenhigher-dimensional variables.

References

A. Achille and S. Soatto. Information Dropout: Learning OptimalRepresentations Through Noisy Computation.

IEEE Transac-tions on Pattern Analysis and Machine Intelligence , PP(99):1–1,2018.Alexander Alemi, Ben Poole, Ian Fischer, Joshua Dillon, Rif ASaurous, and Kevin Murphy. Fixing a broken elbo. In

Interna-tional Conference on Machine Learning , pages 159–168, 2018.Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, andKevin Murphy. Deep Variational Information Bottleneck. arXiv:1612.00410 [cs, math] , December 2016.Diane Bouchacourt, Ryota Tomioka, and Sebastian Nowozin.Multi-level variational autoencoder: Learning disentangledrepresentations from grouped observations. arXiv preprintarXiv:1705.08841 , 2017.Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai,Rafal Jozefowicz, and Samy Bengio. Generating sentencesfrom a continuous space.

CoNLL 2016 , page 10, 2016.Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey,Nick Watters, Guillaume Desjardins, and Alexander Lerch-ner. Understanding disentangling in β -vae. arXiv preprintarXiv:1804.03599 , 2018.Tian Qi Chen, Xuechen Li, Roger Grosse, and David Duvenaud.Isolating sources of disentanglement in variational autoencoders. arXiv preprint arXiv:1802.04942 , 2018.Xi Chen, Yan Duan, Rein Houthooft, John Schulman, IlyaSutskever, and Pieter Abbeel. Infogan: Interpretable representa-tion learning by information maximizing generative adversarialnets. In Advances in Neural Information Processing Systems ,pages 2172–2180, 2016a.Xi Chen, Diederik P. Kingma, Tim Salimans, Yan Duan, PrafullaDhariwal, John Schulman, Ilya Sutskever, and Pieter Abbeel.Variational lossy autoencoder. arXiv preprint arXiv:1611.02731 ,2016b.Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarialfeature learning. arXiv preprint arXiv:1605.09782 , 2016.Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropi-etro, Alex Lamb, Martin Arjovsky, and Aaron Courville. Ad-versarially learned inference. arXiv preprint arXiv:1606.00704 ,2016.Emilien Dupont. Joint-vae: Learning disentangled joint continuousand discrete representations. arXiv preprint arXiv:1804.00104 ,2018.Cian Eastwood and Christopher K. I. Williams. A Framework forthe Quantitative Evaluation of Disentangled Representations. In

International Conference on Learning Representations , Febru-ary 2018.Jesse Engel, Matthew Hoffman, and Adam Roberts. Latent Con-straints: Learning to Generate Conditionally from Uncondi-tional Generative Models. arXiv:1711.05772 [cs, stat] , Novem-ber 2017.Shuyang Gao, Rob Brekelmans, Greg Ver Steeg, and Aram Gal-styan. Auto-encoding total correlation explanation. arXivpreprint arXiv:1802.05822 , 2018.Leon A Gatys, Alexander S Ecker, and Matthias Bethge. A neuralalgorithm of artistic style. arXiv preprint arXiv:1508.06576 ,2015.Rafael Gómez-Bombarelli, Jennifer N Wei, David Duvenaud,José Miguel Hernández-Lobato, Benjamín Sánchez-Lengeling,Dennis Sheberla, Jorge Aguilera-Iparraguirre, Timothy D anuscript under review by AISTATS 2019

Hirzel, Ryan P Adams, and Alán Aspuru-Guzik. Automaticchemical design using a data-driven continuous representationof molecules.

ACS Central Science , 2018.Ishaan Gulrajani, Kundan Kumar, Faruk Ahmed, Adrien Ali Taiga,Francesco Visin, David Vazquez, and Aaron Courville. Pixel-VAE: A latent variable model for natural images. In

Interna-tional Conference on Machine Learning , 2017.Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, XavierGlorot, Matthew Botvinick, Shakir Mohamed, and AlexanderLerchner. beta-VAE: Learning basic visual concepts with aconstrained variational framework. In

International Conferenceon Learning Representations , 2016.Matthew D. Hoffman and Matthew J. Johnson. Elbo surgery: Yetanother way to carve up the variational evidence lower bound.In

Workshop in Advances in Approximate Bayesian Inference,NIPS , 2016.Sarthak Jain, Edward Banner, Jan-Willem van de Meent, Iain J.Marshall, and Byron C. Wallace. Learning Disentangled Repre-sentations of Texts with Application to Biomedical Abstracts. arXiv:1804.07212 [cs] , April 2018.Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameteri-zation with Gumbel-softmax. In

International Conference onLearning Representations , 2017.Theofanis Karaletsos, Serge Belongie, and Gunnar Rätsch.Bayesian representation learning with oracle constraints. arXivpreprint arXiv:1506.05011 , 2015.Hyunjik Kim and Andriy Mnih. Disentangling by factorising. arXiv preprint arXiv:1802.05983 , 2018.Diederik P. Kingma and Max Welling. Auto-encoding variationalbayes. In

International Conference on Learning Representa-tions , 2013.Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende,and Max Welling. Semi-supervised learning with deep gener-ative models. In

Advances in Neural Information ProcessingSystems , pages 3581–3589, 2014.Tejas D Kulkarni, William F Whitney, Pushmeet Kohli, and JoshTenenbaum. Deep convolutional inverse graphics network. In

Advances in Neural Information Processing Systems , pages2539–2547, 2015.Abhishek Kumar, Prasanna Sattigeri, and Avinash Balakrishnan.Variational inference of disentangled latent concepts from unla-beled observations. arXiv preprint arXiv:1711.00848 , 2017.Matt J Kusner, Brooks Paige, and José Miguel Hernández-Lobato.Grammar variational autoencoder. In

International Conferenceon Machine Learning , 2017.Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, andSamuel J Gershman. Building machines that learn and thinklike people.

Behavioral and Brain Sciences , 40, 2017.Ken Lang. 20 newsgroups data set, 2007. URL .[Online; accessed 18-May-2018].Yann LeCun, Corinna Cortes, and CJ Burges. Mnist handwrittendigit database.

AT&T Labs [Online]. Available: http://yann.lecun. com/exdb/mnist , 2, 2010.Dawen Liang, Rahul G. Krishnan, Matthew D. Hoffman, and TonyJebara. Variational Autoencoders for Collaborative Filtering. arXiv:1802.05814 [cs, stat] , February 2018.Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deeplearning face attributes in the wild. In

Proceedings of the IEEEInternational Conference on Computer Vision , pages 3730–3738, 2015. Chris J Maddison, Andriy Mnih, and Yee Whye Teh. The concretedistribution: A continuous relaxation of discrete random vari-ables. In

International Conference on Learning Representations ,2017.Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Good-fellow, and Brendan Frey. Adversarial autoencoders. arXivpreprint arXiv:1511.05644 , 2015.Yishu Miao, Lei Yu, and Phil Blunsom. Neural variational in-ference for text processing. In

International Conference onMachine Learning , pages 1727–1736, 2016.Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Si-monyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, An-drew Senior, and Koray Kavukcuoglu. Wavenet: A generativemodel for raw audio. arXiv preprint arXiv:1609.03499 , 2016.Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.Stochastic backpropagation and approximate inference in deepgenerative models. In

Proceedings of The 31st InternationalConference on Machine Learning , pages 1278–1286, 2014.N Siddharth, Brooks Paige, Jan-Willem Van de Meent, AlbanDesmaison, Frank Wood, Noah D Goodman, Pushmeet Kohli,and Philip HS Torr. Learning disentangled representations withsemi-supervised deep generative models. In

Advances in NeuralInformation Processing Systems , 2017.Akash Srivastava and Charles Sutton. Autoencoding variationalinference for topic models. arXiv preprint arXiv:1703.01488 ,2017.Lucas Theis, Wenzhe Shi, Andrew Cunningham, and FerencHuszár. Lossy image compression with compressive autoen-coders. arXiv preprint arXiv:1703.00395 , 2017.Andreas Veit, Serge Belongie, and Theofanis Karaletsos. Disen-tangling Nonlinear Perceptual Embeddings With Multi-QueryTriplet Networks. arXiv preprint arXiv:1603.07810 , 2016.Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: anovel image dataset for benchmarking machine learning algo-rithms, 2017.Weidi Xu, Haoze Sun, Chao Deng, and Ying Tan. Variationalautoencoder for semi-supervised text classiﬁcation. In

AAAI ,pages 3358–3364, 2017.Shengjia Zhao, Jiaming Song, and Stefano Ermon. InfoVAE: In-formation maximizing variational autoencoders. arXiv preprintarXiv:1706.02262 , 2017. anuscript under review by AISTATS 2019

A Appendix

A.1 Approximating the Inference Marginal

We will here derive a Monte Carlo estimator for the entropy of the marginal q φ ( z ) of the inference model H φ [ z ] = − E q φ ( z ) [log q φ ( z )] . (7)As with other terms in the objective, we can approximate this expectation by sampling z b ∼ q φ ( z ) using, x b ∼ q ( x ) , b = 1 , . . . , B, (8) z b ∼ q φ ( z | x b ) . (9)We now additionally need to approximate the values, log q φ ( z b ) = log (cid:34) N N (cid:88) n =1 q φ ( z b | x n ) (cid:35) . (10)We will do so by pulling the term for which x n = x b out of the sum q φ ( z b ) = 1 N q φ ( z b | x b ) + 1 N (cid:88) x n (cid:54) = x b q φ ( z b | x n ) . As also noted by Chen et al. [2018], the intuition behind this decomposition is that q φ ( z b | x b ) will in general be muchlarger than q φ ( z b | x n ) .We can approximate the second term using a Monte Carlo estimate from samples x ( b,c ) ∼ q ( x | x (cid:54) = x b ) , N − (cid:88) x n (cid:54) = x b q φ ( z b | x n ) (cid:39) C C (cid:88) c =1 q φ ( z b | x ( b,c ) ) . Note here that we have written / ( N − instead of /N in order to ensure that the sum deﬁnes an expected value over thedistribution q ( x | x (cid:54) = x b ) .In practice, we can replace the samples x ( b,c ) with the samples b (cid:48) (cid:54) = b from the original batch, which yields an estimatorover C = B − samples ˆ q ( z b ) = 1 N q φ ( z b | x b ) + N − N ( B − (cid:88) b (cid:48) (cid:54) = b q φ ( z b | x b (cid:48) ) . Note that this estimator is unbiased, which is to say that E [ˆ q ( z b )] = q ( z b ) . (11)In order to compute the entropy, we now deﬁne an estimator ˆ H φ ( z ) , which deﬁnes a upper bound on H φ ( z )ˆ H φ [ z ] (cid:39) − B B (cid:88) b =1 log ˆ q φ ( z b ) ≥ H φ [ z ] . (12)The upper bound relationship follows from Jensen’s inequality which states that E [log ˆ q φ ( z )] ≤ log E [ˆ q φ ( z )] = log q φ ( z ) . (13) A.2 Mutual Information between label y and representation z We quantize each individual dimension z d into 10 bins based on the CDF of the empirical distribution. In other words,dimension z d is divided in a way that each bin contains 10% of the training data. We then compute the mutual information I ( x ; z d ) as: I ( z ∈ bin i , y = k ) = q ( z ∈ bin i , y = k ) (cid:20) log q ( z ∈ bin i , y = k ) q ( z ∈ bin i ) q ( y = k ) (cid:21) anuscript under review by AISTATS 2019 For the case where z is a concrete variable, we use the following formulation: I ( z = l, y = k ) = q ( z = l, y = k ) (cid:20) log q ( z = l, y = k ) q ( z = l ) q ( y = k ) (cid:21) q ( z = l, y = k ) = q ( y = k ) q ( z = l | y = k )= N k N q ( z = l | y = k ) q ( z = l | y = k ) = (cid:88) x q ( z = l, x | y = k )= (cid:88) x q ( z = l | x , y = k ) q ( x | y = k )= 1 N k (cid:88) x q ( z = l | x , y = k ) q ( z = l ) = (cid:88) x q ( z = l, x )= (cid:88) x q ( z = l | x ) q ( x )= 1 N (cid:88) x q ( z = l | x ) Finally, for the overall mutual information we have: I ( z , y ) = (cid:88) l (cid:88) k I ( z = l, y = k ) anuscript under review by AISTATS 2019 B Model Architectures

We considered 4 datasets: dSprites [Higgins et al., 2016]: 737,280 binary × images of 2D shapes with ground truth factors, MNIST [LeCun et al., 2010]: 60000 gray-scale × images of handwritten digits, F-MNIST [Xiao et al., 2017]: 60000 gray-scale × images of clothing items divided in 10 classes, CelebA [Liu et al., 2015]: 202,599 RGB × × images of celebrity faces.As mentioned in the main text, we used two hidden variables for each of the datasets. One variable is modeled as a normaldistribution which representing continuous (denoted as z c ), and one modeled as a Concrete distribution to detect categories(denoted as z d ). We used Adam optimizer with learning rate e -3 and the default settings.Table 4: Encoder and Decoder architecture for MNIST and F-MNIST datasets. Encoder Decoder

Input × grayscale image Input z = Concat (cid:2) z c ∈ R , z d ∈ (0 , (cid:3) FC. 400 ReLU FC. 200 ReLUFC. × ReLU, FC. 10 ( z d ) FC. 400 ReLUFC. ×

10 ( z c ) FC. × SigmoidTable 5: Encoder and Decoder architecture for dSprites data.

Encoder Decoder

Input × binary image Input z = Concat (cid:2) z c ∈ R , z d ∈ (0 , (cid:3) FC. 1200 ReLU FC. 400 TanhFC. 1200 ReLU FC. 1200 TanhFC. × ReLU, FC. 3 ( z d ) FC. 1200 TanhFC. ×

10 ( z c ) FC. × SigmoidTable 6: Encoder and Decoder architecture for CelebA data.

Encoder Decoder

Input × RGB image Input z = Concat (cid:2) z c ∈ R , z d ∈ { , } (cid:3) × conv, 32 BatchNorm ReLU, stride 2 FC. 256 ReLU × conv, 32 BatchNorm ReLU, stride 2 FC. ( × × ) Tanh × conv, 64 BatchNorm ReLU, stride 2 × upconv, 64 BatchNorm ReLU, stride 2 × conv, 64 BatchNorm ReLU, stride 2 × upconv, 32 BatchNorm ReLU, stride 2FC. × ReLU, 2 FC. ( z d ) × upconv, 32 BatchNorm ReLU, stride 2FC. × ReLU × upconv, 3, stride 2 anuscript under review by AISTATS 2019 C Latent Traversals

Figure 10: Qualitative results for disentanglement in MNIST dataset. In each case, one particular z d is varying from -3 to 3while the others are ﬁxed at 0. For this particular set of traversals, we used 10% supervision in order to extract the digitmore reliably, therefore visualizing all ‘style’ features present in MNIST. anuscript under review by AISTATS 2019 Figure 11: Qualitative results for disentanglement in F-MNIST dataset. In each case, one particular z d is varying from -3 to3 while the others are ﬁxed at 0 anuscript under review by AISTATS 2019 muslim, catholic, islamic, religion, morality, jewish, jew, church, moral, constitution VA E H F VA E division, detroit, adams, braves, cup, wings, dog, leafs, coach, teamcleveland, listen, jewish, western, city, news, leader, law, a ﬀ air, man windows, dos, graphic, vga, pc, mb, ram, ftp, microsoft, ﬂ oppyrangers, sale, hockey, playo ﬀ , nhl, wings, stanley, leafs, winnipeg, pittsburghengine, bike, car, gas, vehicle, ride, bmw, tank, mph, rear I(y k ; z d )Top 10 Words in Topic I(x; z d ) Figure 12: Learned topics in 20NewsGroups dataset using HFVAE objective and VAE objective. The middle column showsfrequent words for the 3 most informative dimensions of the latent space. The left column lists their corresponding mutualinformation with x . The right column shows the mutual information between the latent code and binary indicator variablefor the document category. D Disentangled Representation for Text

D.1 Model Architectures

We consider the following dataset: : 11314 newsgroup documents which are partitioned in 20 categories. We used bag-of-words representationwhere vocabulary size is 2000, after removing stopwords using Mallet stopwords list.With HFVAE objective, we used two hidden variables (denoted as z c and z c ) with 25 dimensions each. In ProdLDA,we used Adam optimizer with β = 0 . , β = 0 . , and learning rate e -3 ; In NVDM, we used Adam optimizer withlearning rate e -5 and default settings. Encoder Decoder

Input × document Input z c ∈ R , z c ∈ R FC. 100 Softplus Softmax DropoutFC. 100 Softplus Dropout FC. 2000 BatchNorm SoftmaxFC. × BatchNorm ( z c ) FC. × BatchNorm ( z c ) Table 7: Encoder and Decoder architecture in ProdLDA.

Encoder Decoder

Input × document Input z c ∈ R , z c ∈ R FC. 500 ReLU FC. 2000 SoftmaxFC. ×

25 ( z c ) FC. ×

25 ( z c ) Table 8: Encoder and Decoder architecture in NVDM.

D.2 Neural variational document model

We train a standard NVDM with a 50-dimensional latent variable using the normal VAE objective. We compare this baselineto a HFVAE with two 25-dimensional latent variables, trained with β = 7 , and γ = 4 — allowing correlations within a groupbut preventing correlations across groups.Figure 12 (right) shows the mutual information between the latent code and binary indicator variables for the documentcategory. We see that the latent dimensions of the HFVAE (columns) achieve a higher degree of disentanglement as is evidentfrom the fact that indicator labels (shown as rows) correlate generally with only one latent feature (shown in columns). Note anuscript under review by AISTATS 2019 that a single feature can capture two distinct topics in this model (of which only one is shown), which correspond to negativeand positive weights in the likelihood model. D.3 Binary Indicator Variables for Document Category