[PDF] Unsupervised Model Selection for Variational Disentangled Representation Learning

Abstract

Disentangled representations have recently been shown to improve fairness, data efficiency and generalisation in simple supervised and reinforcement learning tasks. To extend the benefits of disentangled representations to more complex domains and practical applications, it is important to enable hyperparameter tuning and model selection of existing unsupervised approaches without requiring access to ground truth attribute labels, which are not available for most datasets. This paper addresses this problem by introducing a simple yet robust and reliable method for unsupervised disentangled model selection. Our approach, Unsupervised Disentanglement Ranking (UDR), leverages the recent theoretical results that explain why variational autoencoders disentangle (Rolinek et al, 2019), to quantify the quality of disentanglement by performing pairwise comparisons between trained model representations. We show that our approach performs comparably to the existing supervised alternatives across 5,400 models from six state of the art unsupervised disentangled representation learning model classes. Furthermore, we show that the ranking produced by our approach correlates well with the final task performance on two different domains.

Full PDF

PPublished as a conference paper at ICLR 2020 U NSUPERVISED M ODEL S ELECTION FOR V ARIATIONAL D ISENTANGLED R EPRESENTATION L EARNING

Sunny Duan ∗ DeepMind [email protected]

Loic Matthey

DeepMind [email protected]

Andre Saraiva

DeepMind [email protected]

Nick Watters

DeepMind [email protected]

Chris Burgess

DeepMind [email protected]

Alexander Lerchner

DeepMind [email protected]

Irina Higgins ∗ DeepMind [email protected] A BSTRACT

Disentangled representations have recently been shown to improve fairness, dataefﬁciency and generalisation in simple supervised and reinforcement learning tasks.To extend the beneﬁts of disentangled representations to more complex domains andpractical applications, it is important to enable hyperparameter tuning and model se-lection of existing unsupervised approaches without requiring access to ground truthattribute labels, which are not available for most datasets. This paper addresses thisproblem by introducing a simple yet robust and reliable method for unsupervised dis-entangled model selection. Our approach, Unsupervised Disentanglement Ranking(UDR) , leverages the recent theoretical results that explain why variational autoen-coders disentangle (Rolinek et al., 2019), to quantify the quality of disentanglementby performing pairwise comparisons between trained model representations. Weshow that our approach performs comparably to the existing supervised alternativesacross 5400 models from six state of the art unsupervised disentangled representationlearning model classes. Furthermore, we show that the ranking produced by ourapproach correlates well with the ﬁnal task performance on two different domains. NTRODUCTION

Happy families are all alike; every unhappy family is unhappy in its own way. —Leo Tolstoy, Anna KareninaDespite the success of deep learning in the recent years (Hu et al., 2018; Espeholt et al., 2018;Silver et al., 2018; Lample et al., 2018; Hessel et al., 2017; Oord et al., 2016), the majority of stateof the art approaches are still missing many basic yet important properties, such as fairness, dataefﬁcient learning, strong generalisation beyond the training data distribution, or the ability to transferknowledge between tasks (Lake et al., 2016; Garnelo et al., 2016; Marcus, 2018). The idea that agood representation can help with such shortcomings is not new, and recently a number of papers havedemonstrated that models with disentangled representations show improvements in terms of theseshortcomings (Higgins et al., 2017b; 2018b; Achille et al., 2018; Steenbrugge et al., 2018; Nair et al.,2018; Laversanne-Finot et al., 2018; van Steenkiste et al., 2019; Locatello et al., 2019). A commonintuitive way to think about disentangled representations is that it should reﬂect the compositional ∗ Equal contribution. We have released the code for our method as part of disentanglement_lib a r X i v : . [ c s . L G ] F e b ublished as a conference paper at ICLR 2020 Roof heightColourRotationCar type Seed 1 Seed 2 Seed 3

UDR = 0.806

Seed 1

UDR = 0.204 z z z z z -3 3 -3 3 -3 3 -3 3z z z z z z z z z z Figure 1: Latent traversals for one of the best and worst ranked trained β -VAE models using theUnsupervised Disentanglement Ranking (UDR L ) method on the 3D Cars dataset. For each seed imagewe ﬁx all latents z i to the inferred value, then vary the value of one latent at a time to visualise itseffect on the reconstructions. The high scoring model (left 3 blocks) appears well disentangled, sinceindividual latents have consistent semantic meaning across seeds. The low scoring model (right block)is highly entangled, since the latent traversals are not easily interpretable.structure of the world. For example, to describe an object we often use words pertaining to its colour,position, shape and size. We can use different words to describe these properties because they relate toindependent factors of variation in our world, i.e. properties which can be compositionally recombined.Hence a disentangled representation of objects should reﬂect this by factorising into dimensions whichcorrespond to those properties (Bengio et al., 2013; Higgins et al., 2018a).The ability to automatically discover the compositional factors of complex real datasets can be ofgreat importance in many practical applications of machine learning and data science. However, it isimportant to be able to learn such representations in an unsupervised manner, since most interestingdatasets do not have their generative factors fully labelled. For a long time scalable unsuperviseddisentangled representation learning was impossible, until recently a new class of models based onVariational Autoencoders (VAEs) (Kingma & Welling, 2014; Rezende et al., 2014) was developed.These approaches (Higgins et al., 2017a; Burgess et al., 2017; Chen et al., 2018; Kumar et al., 2017; Kim& Mnih, 2018) scale reasonably well and are the current state of the art in unsupervised disentangledrepresentation learning. However, so far the beneﬁts of these techniques have not been widely exploitedbecause of two major shortcomings: First, the quality of the achieved disentangling is sensitive tothe choice of hyperparameters, however, model selection is currently impossible without havingaccess to the ground truth generative process and/or attribute labels, which are required by all thecurrently existing disentanglement metrics (Higgins et al., 2017a; Kim & Mnih, 2018; Chen et al., 2018;Eastwood & Williams, 2018; Ridgeway & Mozer, 2018). Second, even if one could apply any of theexisting disentanglement metrics for model selection, the scores produced by these metrics can vary alot even for models with the same hyperparameters and trained on the same data (Locatello et al., 2018).While a lot of this variance is explained by the actual quality of the learnt representations, some of itis introduced by the metrics themselves. In particular, all of the existing supervised disentanglementmetrics assume a single “canonical” factorisation of the generative factors, any deviation from which ispenalised. Such a “canonical” factorisation, however, is not chosen in a principled manner. Indeed, forthe majority of datasets, apart from the simplest ones, multiple equally valid disentangled representa-tions may be possible (see Higgins et al. (2018a) for a discussion). For example, the intuitive way thathumans reason about colour is in terms of hue and saturation. However, colour may also be representedin RGB, YUV, HSV, HSL, CIELAB. Any of the above representations are as valid as each other, yetonly one of them is allowed to be “canonical” by the supervised metrics. Hence, a model that learnsto represent colour in a subspace aligned with HSV will be penalised by a supervised metric whichassumes that the canonical disentangled representation of colour should be in RGB. This is despitethe fact that both representations are equal in terms of preserving the compositional property at thecore of what makes disentangled representations useful (Higgins et al., 2018a). Hence, the ﬁeld ﬁndsitself in a predicament. From one point of view, there exists a set of approaches capable of reasonablyscalable unsupervised disentangled representation learning. On the other hand, these models are hardto use in practice, because there is no easy way to do a hyperparameter search and model selection.This paper attempts to bridge this gap. We propose a simple yet effective method for unsupervised modelselection for the class of current state-of-the-art VAE-based unsupervised disentangled representationlearning methods. Our approach, Unsupervised Disentanglement Ranking (UDR), leverages the recent2ublished as a conference paper at ICLR 2020theoretical results that explain why variational autoencoders disentangle (Rolinek et al., 2019), to quan-tify the quality of disentanglement by performing pairwise comparisons between trained model repre-sentations. We evaluate the validity of our unsupervised model selection metric against the four best ex-isting supervised alternatives reported in the large scale study by Locatello et al. (2018): the β -VAE met-ric (Higgins et al., 2017a), the FactorVAE metric (Kim & Mnih, 2018), Mutual Information Gap (MIG)(Chen et al., 2018) and DCI Disentanglement scores (Eastwood & Williams, 2018). We do so for allexisting state of the art disentangled representation learning approaches: β -VAE (Higgins et al., 2017a),CCI-VAE (Burgess et al., 2017), FactorVAE (Kim & Mnih, 2018), TC-VAE (Chen et al., 2018) and twoversions of DIP-VAE (Kumar et al., 2017). We validate our proposed method on two datasets with fullyknown generative processes commonly used to evaluate the quality of disentangled representations:dSprites (Matthey et al., 2017) and 3D Shapes (Burgess & Kim, 2018), and show that our unsupervisedmodel selection method is able to match the supervised baselines in terms of guiding a hyperparametersearch and picking the most disentangled trained models both quantitatively and qualitatively. We alsoapply our approach to the 3D Cars dataset (Reed et al., 2014), where the full set of ground truth attributelabels is not available, and conﬁrm through visual inspection that the ranking produced by our methodis meaningful (Fig. 1). Overall we evaluate 6 different model classes, with 6 separate hyperparametersettings and 50 seeds on 3 separate datasets, totalling 5400 models and show that our method is both ac-curate and consistent across models and datasets. Finally, we validate that the model ranking producedby our approach correlates well with the ﬁnal task performance on two recently reported tasks: a classiﬁ-cation fairness task (Locatello et al., 2019) and a model-based reinforcement learning (RL) task (Watterset al., 2019). Indeed, on the former our approach outperformed the reported supervised baseline scores. PERATIONAL DEFINITION OF DISENTANGLING

Given a dataset of observations X = { x ,..., x N } , we assume that there exist a number of plausiblegenerative processes g i that produce the observations from a small set of corresponding K i independentgenerative factors c i . For each choice of i , g : c n (cid:55)→ x n , where p ( c n ) = (cid:81) Kj =1 p ( c jn ) . For example, adataset containing images of an object, which can be of a particular shape and colour, and which canbe in a particular vertical and horizontal positions, may be created by a generative process that assumesa ground truth disentangled factorisation into shape x colour x position , or shape x hue x saturation x position X x position Y . We operationalise a model as having learnt a disentangled representation, if itlearns to invert of one of the generative processes g i and recover a latent representation z ∈ R L , so thatit best explains the observed data p ( z , x ) ≈ p ( c i , x ) , and factorises the same way as the correspondingdata generative factors c i . The choice of the generative process can be determined by the interactionbetween the model class and the observed data distribution p ( x ) , as discussed next in Sec. 3.1. ARIATIONAL UNSUPERVISED DISENTANGLING

The current state of the art approaches to unsupervised disentangled representation learning are basedon the Variational Autoencoder (VAE) framework (Rezende et al., 2014; Kingma & Welling, 2014).VAEs attempt to estimate the lower bound on the joint distribution of the data and the latent factors p ( x , z ) by optimising the following objective: L V AE = E p ( x ) [ E q φ ( z | x ) [log p θ ( x | z )] − KL ( q φ ( z | x ) || p ( z )) ] (1)where, in the usual case, the prior p ( z ) is chosen to be an isotropic unit Gaussian. In order to encouragedisentangling, different approaches decompose the objective in Eq. 1 into various terms and changetheir relative weighting. In this paper we will consider six state of the art approaches to unsuperviseddisentangled representation learning that can be grouped into three broad classes based on how theymodify the objective in Eq. 1: 1) β -VAE (Higgins et al., 2017a) and CCI-VAE (Burgess et al., 2017)upweight the KL term; 2) FactorVAE (Kim & Mnih, 2018) and TC-VAE (Chen et al., 2018) introducea total correlation penalty; and 3) two different implementations of DIP-VAE (-I and -II) (Kumar et al.,2017) penalise the deviation of the the marginal posterior from a factorised prior (see Sec. A.4.1 inSupplementary Material for details). 3ublished as a conference paper at ICLR 20203.1 W HY DO

VAE

S DISENTANGLE ?In order to understand the reasoning behind our proposed unsupervised disentangled model selectionmethod, it is ﬁrst important to understand why VAEs disentangle. The objective in Eq. 1 does not initself encourage disentangling, as discussed in Rolinek et al. (2019) and Locatello et al. (2018). Indeed,any rotationally invariant prior makes disentangled representations learnt in an unsupervised settingunidentiﬁable when optimising Eq. 1. This theoretical result is not surprising and has been known for awhile in the ICA literature (Comon, 1994), however what is surprising is that disentangling VAEs appearto work in practice. The question of what makes VAEs disentangle was answered by Rolinek et al.(2019), who were able to show that it is the peculiarities of the VAE implementation choices that allowdisentangling to emerge (see also discussion in Burgess et al. (2017); Mathieu et al. (2019)). In particular,the interactions between the reconstruction objective (the ﬁrst term in Eq. 1) and the enhanced pressureto match a diagonal prior created by the modiﬁed objectives of the disentangling VAEs, force the decoderto approximate PCA-like behaviour locally around the data samples. Rolinek et al. (2019) demonstratedthat during training VAEs often enter the so-called “polarised regime”, where many of the latentdimensions of the posterior are effectively switched off by being reduced to the prior q φ ( z j ) = p ( z j ) (this behaviour is often further encouraged by the extra disentangling terms added to the ELBO). Whentrained in such a regime, a linear approximation of the Jacobian of the decoder around a data sample x i , J i = ∂Dec θ ( µ φ ( x i )) ∂µ φ ( x i ) , is forced to have orthogonal columns, and hence to separate the generative factorsbased on the amount of reconstruction variance they induce. Given that the transformations inducedby different generative factors will typically have different effects on the pixel space (e.g. changing theposition of a sprite will typically affect more pixels than changing its size), such local orthogonalisationof the decoder induces an identiﬁable disentangled latent space for each particular dataset. Anequivalent statement is that for a well disentangled VAE, the SVD decomposition J = U Σ V (cid:62) of theJacobian J calculated as above, results in a trivial V , which is a signed permutation matrix. NSUPERVISED DISENTANGLED MODEL SELECTION

We now describe how the insights from Sec. 3.1 motivate the development of our proposedUnsupervised Disentanglement Ranking (UDR) method. Our method relies on the assumption thatfor a particular dataset and a VAE-based unsupervised disentangled representation learning modelclass, disentangled representations are all alike, while every entangled representation is entangledin its own way, to rephrase Tolstoy. We justify this assumption next.

Disentangled representations are similar

According to Rolinek et al. (2019) for a givennon-adversarial dataset a disentangling VAE will likely keep converging to the same disentangledrepresentation (up to permutation and sign inverse). Note that this representation will correspond toa single plausible disentangled generative process g i using the notation we introduced in Sec. 2. Thisis because any two different disentangled representations z a and z b learnt by a VAE-based modelwill only differ in terms of the corresponding signed permutation matrices V a and V b of the SVDdecompositions of the locally linear approximations of the Jacobians of their decoders. Entangled representations are different

Unfortunately the ﬁeld of machine learning has littletheoretical understanding of the nature and learning dynamics of internal representations in neuralnetworks. The few pieces of research that have looked into the nature of model representations (Raghuet al., 2017; Li et al., 2016; Wang et al., 2018; Morcos et al., 2018) have been empirical rather thantheoretical in nature. All of them suggest that neural networks tend to converge to different hiddenrepresentations despite being trained on the same task with the same hyperparameters and architectureand reaching similar levels of task performance. Furthermore, the theoretical analysis and the empiricaldemonstrations in Rolinek et al. (2019) suggest that the entangled VAEs learn representations thatare different at least up to a rotation induced by a non-degenerate matrix V in the SVD decompositionof the local linear approximation of the decoder Jacobian J i .The justiﬁcations presented above rely on the theoretical work of Rolinek et al. (2019), which wasempirically veriﬁed only for the β -VAE. We have reasons to believe that the theory also holds for theother model classes presented in this paper, apart from DIP-VAE-I. We empirically verify that thisis the case in Sec. A.10 in Supplementary Materials. Furthermore, in Sec. 5 we show that our proposedmethod works well in practice across all model classes, including DIP-VAE-I.4ublished as a conference paper at ICLR 2020 Unsupervised Disentanglement Ranking

Our proposed UDR method consists of four steps(illustrated in Fig. 4 in Supplementary Material):1. Train M = H × S models, where H is the number of different hyperparameter settings, and S isthe number of different initial model weight conﬁgurations (seeds).2. For each trained model i ∈ { ,...,M } , sample without replacement P ≤ S other trained modelswith the same hyperparameters but different seeds.3. Perform P pairwise comparisons per trained model and calculate the respective UDR ij scores,where i ∈ { ,...,M } is the model index, and j ∈ { ,...,P } is its unique pairwise match from Step 2.4. Aggregate UDR ij scores for each model i to report the ﬁnal UDR i = avg j ( UDR ij ) scores, whereavg j ( · ) is the median over P scores.The key part of the UDR method is Step 3, where we calculate the UDR ij score that summarises howsimilar the representations of the two models i and j are. As per the justiﬁcations above, two latentrepresentations z i and z j should be scored as highly similar if they axis align with each other up to permutation (the same ground truth factor c k may be encoded by different latent dimensions withinthe two models, z i,a and z j,b where a (cid:54) = b ), sign inverse (the two models may learn to encode the valuesof the generative factor in the opposite order compared to each other, z i,a = − z j,b ), and subsetting (one model may learn a subset of the factors that the other model has learnt if the relevant disentanglinghyperparameters encourage a different number of latents to be switched off in the two models). Inorder for the UDR to be invariant to the ﬁrst scenario, we propose calculating a full L × L similaritymatrix R ij between the individual dimensions of z i ∈ R L and z j ∈ R L (see Fig. 5 in SupplementaryMaterial). In order to address the second point, we take the absolute value of the similarity matrix | R ij | .Finally, to address the third point, we divide the UDR score by the average number of informativelatents discovered by the two models. Note that even though disentangling often happens when theVAEs enter the “polarised regime”, where many of the latent dimensions are switched off, the rankingsproduced by UDR are not affected by whether the model operates in such a regime or not.To populate the similarity matrix R ij we calculate each matrix element as the similarity betweentwo vectors z i,a and z j,b , where z i,a is a response of a single latent dimension z a of model i overthe entire ordered dataset or a ﬁxed number of ordered mini-batches if the former is computationallyrestrictive (see Sec. A.5 in Supplementary Material for details). We considered two versions of the UDRscore based on the method used for calculating the vector similarity: the non-parametric UDR S , usingSpearman’s correlation; and the parametric UDR L , using Lasso regression following past work onevaluating representations (Eastwood & Williams, 2018; Li et al., 2016). In practice the Lasso regressionversion worked slightly better, so the remainder of the paper is restricted to UDR L (we use UDR L andUDR interchangeably to refer to this version), while UDR S is discussed in the Supplementary Materials.Given a similarity matrix R ij , we want to ﬁnd one-to-one correspondence between all the informativelatent dimensions within the chosen pair of models. Hence, we want to see at most a single strongcorrelation in each row and column of the similarity matrix. To that accord, we step through the matrix R = | R ij | , one column and row at a time, looking for the strongest correlation, and weighting it bythe proportional weight it has within its respective column or row. We then average all such weightedscores over all the informative row and column latents to calculate the ﬁnal UDR ij score: d a + d b (cid:34)(cid:88) b r a · I KL ( b ) (cid:80) a R ( a,b ) + (cid:88) a r b · I KL ( a ) (cid:80) b R ( a,b ) (cid:35) (2)where r a = max a R ( a, b ) and r b = max b R ( a, b ) . I KL indicates an “informative” latent within amodel and d is the number of such latents: d a = (cid:80) a I KL ( a ) and d b = (cid:80) b I KL ( b ) . We deﬁne a latentdimension as “informative” if it has learnt a latent posterior which diverges from the prior: I KL ( a ) = (cid:26) KL ( q φ ( z a | x ) || p ( z a )) > . otherwise (3) UDR variations

We explored whether doing all-to-all pairwise comparisons, with models inStep 2 sampled from the set of all M models rather than the subset of S models with the samehyperparameters, would produce more accurate results. Additionally we investigated the effect ofchoosing different numbers of models P for pairwise comparisons by sampling P ∼ U [5 , .5ublished as a conference paper at ICLR 2020 dSprites S c o r e S c o r e

3D Shapes S c o r e S c o r e Figure 2: Hyperparameter search results for six unsupervised disentangling model classes evaluatedusing the unsupervised UDR and the supervised β -VAE, FactorVAE, MIG and DCI Disentanglingmetrics and trained on either dSprites ( top ) or 3D Shapes ( bottom ) datasets. “Hyper” corresponds tothe particular hyperparameter setting considered (see Tbl. 5 in Supplementary Materials for particularvalues). The box and whisker plots for each hyperparameter setting are summarising the scores for50 different model seeds. Higher median values indivate better hyperparameters. The ranking ofhyperparameters tends to be similar between the different metrics, including UDR. UDR assumptions and limitations

Note that our approach is aimed at the current state of theart disentangling VAEs, for which the assumptions of our metric have been demonstrated to hold(Rolinek et al., 2019). It may be applied to other model classes, however the following assumptionsand limitations need to be considered:1.

Disentangled representations produced by two models from the same class trained on thesame dataset are likely to be more similar than entangled representations – this holds fordisentangling VAEs (Rolinek et al., 2019), but may not hold more broadly.2.

Continuous, monotonic and scalar factors – UDR assumes that these properties hold for the datagenerative factors and their representations. This is true for the disentangling approaches describedin Sec. 3, but may not hold more generally. It is likely that UDR can be adapted to work with otherkinds of generative factors (e.g. factors with special or no geometry) by exchanging the similaritycalculations in Step 3 with an appropriate measure, however we leave this for future work.6ublished as a conference paper at ICLR 2020 M ODEL CLASS D S PRITES

HAPES L ASSO S PEARMAN L ASSO S PEARMAN H YPER A LL -2- ALL H YPER A LL -2- ALL H YPER A LL -2- ALL H YPER A LL -2- ALL β -VAE 0.60 0.76 0.54 0.72 0.71 0.68 0.70 0.71TC-VAE 0.40 0.67 0.37 0.60 0.81 0.79 0.81 0.75DIP-VAE 0.61 0.69 0.65 0.72 0.75 0.74 0.75 0.78 Table 1: Rank correlations between MIG and different versions of UDR across two datasets and threemodel classes. The performance is comparable across datasets, UDR versions and model classes. SeeFig. 6 in Supplementary Materials for comparisons with other supervised metrics.3.

Herd effect – since UDR detects disentangled representations through pairwise comparisons, thescore it assigns to each individual model will depend on the nature of the other models involvedin these comparisons. This means that UDR is unable to detect a single disentangled model withina hyperparameter sweep. It also means that when models are only compared within a single hyper-parameter setting, individual model scores may be over/under estimated as they tend to be drawntowards the mean of the scores of the other models within a hyperparameter group. Thus, it is prefer-able to perform the UDR-A2A during model selection and UDR during hyperparameter selection.4.

Explicitness bias – UDR does not penalise models that learn a subset of the data generativefactors. In fact, such models often score higher than those that learn the full set of generativefactors, because the current state of the art disentangling approaches tend to trade-off the numberof discovered factors for cleaner disentangling. As discussed in Sec. 2, we provide the practitionerwith the ability to choose the most disentangled model per number of factors discovered byapproximating this with the d score in Eq. 2.5. Computational cost – UDR requires training a number of seeds per hyperparameter setting and M × P pairwise comparisons per hyperparameter search, which may be computationally expensive.Saying this, training multiple seeds per hyperparameter setting is a good research practice toproduce more robust results and UDR computations are highly parallelisable.To summarise, UDR relies on a number of assumptions and has certain limitations that we hope torelax in future work. However, it offers improvements over the existing supervised metrics. Apartfrom being the only method that does not rely on supervised attribute labels, its scores are oftenmore representative of the true disentanglement quality (e.g. see Fig. 3 and Fig. 9 in SupplementaryMaterials), and it does not assume a single “canonical” disentangled factorisation per dataset. Hence,we believe that UDR can be a useful method for unlocking the power of unsupervised disentangledrepresentation learning to real-life practical applications, at least in the near future. XPERIMENTS

Our hope was to develop a method for unsupervised disentangled model selection with the followingproperties: it should 1) help with hyperparameter tuning by producing an aggregate score that can beused to guide evolutionary or Bayesian methods (Jaderberg et al., 2018; Snoek et al., 2012; Thorntonet al., 2012; Bergstra et al., 2011; Hutter et al., 2011; Miikkulainen et al., 2017); 2) rank individualtrained models based on their disentanglement quality; 3) correlate well with ﬁnal task performance.In this section we evaluate our proposed UDR against these qualities. For the reported experimentswe use the trained model checkpoints and supervised scores from Locatello et al. (2018) to evaluate β -VAE, CCI-VAE, FactorVAE, TC-VAE, DIP-VAE-I and DIP-VAE-II on two benchmark datasets:dSprites (Matthey et al., 2017) and 3D Shapes (Burgess & Kim, 2018) (see Sec. A.3 for details). Eachmodel is trained with H = 6 different hyperparameter settings (detailed in Sec. A.4.1 in SupplementaryMaterial), with S = 50 seeds per setting, and P = 50 pairwise comparisons. UDR correlates well with the supervised metrics.

To validate UDR, we calculate Spearman’scorrelation between its model ranking and that produced by four existing supervised disentanglementmetrics found to be the most meaningful in the large scale comparison study by Locatello et al.(2018): the original β -VAE metric (Higgins et al., 2017a), FactorVAE metric (Kim & Mnih, 2018),Mutual Information Gap (MIG) (Chen et al., 2018) and DCI Disentanglement (Eastwood & Williams,2018) (see Sec. A.6 in Supplementary Material for metric details). The average correlation for UDR7ublished as a conference paper at ICLR 2020 z z -33 z z -33 z z -33 z z -33 Figure 3: Latent traversals of the top ranked trained DIP-VAE-I, TC-VAE, CCI-VAE and β -VAEaccording to the UDR method. At the top of each plot the two presented scores are UDR/FactorVAEmetric. Note that the FactorVAE metric scores visually entangled models very highly. d is the numberof informative latents. The uninformative latents are greyed out.is . ± . and for UDR-A2A is . ± . . This is comparable to the average Spearman’scorrelation between the model rankings produced by the different supervised metrics: . ± . .The variance in rankings produced by the different metrics is explained by the fact that the metricscapture different aspects of disentangling (see Sec. A.2 in Supplementary Materials for a discussionof how UDR relates to other representation comparison methods). Tbl. 1 provides a breakdown ofcorrelation scores between MIG and the different versions of UDR for different model classes anddatasets. It is clear that the different versions of UDR perform similarly to each other, and this holdsacross datasets and model classes. Note that unlike the supervised metrics, UDR does not assumea “canonical” disentangled representation. Instead, it allows any one of the many equivalent possibleground truth generative processes to become the “canonical” one for each particular dataset and modelclass, as per the theoretical results by Rolinek et al. (2019) summarised in Sec. 3.1. UDR is useful for hyperparameter selection.

Fig. 2 compares the scores produced by UDR andthe four supervised metrics for 3600 trained models, split over six model classes, two datasets and sixhyperparameter settings. We consider the median score proﬁles across the six hyperparameter settings toevaluate whether a particular setting is better than others. It can be seen that UDR broadly agrees with thesupervised metrics on which hyperparameters are more promising for disentangling. This holds acrossdatasets and model classes. Hence, UDR may be useful for evaluating model ﬁtness for disentangledrepresentation learning as part of an evolutionary algorithm or Bayesian hyperparameter tuning.

UDR is useful for model selection.

Fig. 2 can also be used to examine whether a particular trainedmodel has learnt a good disentangled representation. We see that some models reach high UDR scores.For example, more models score highly as the value of the β hyperparameter is increased in the β -VAEmodel class. This is in line with the previously reported results (Higgins et al., 2017a). Note that the 0thhyperparameter setting in this case corresponds to β = 1 , which is equivalent to the standard VAE objec-tive (Kingma & Welling, 2014; Rezende et al., 2014). As expected, these models score low in terms ofdisentangling. We also see that for some model classes (e.g. DIP-VAE-I, DIP-VAE-II and FactorVAE ondSprites) no trained model scores highly according to UDR. This suggests that none of the hyperparam-eter choices explored were good for this particular dataset, and that no instance of the model class learntto disentangle well. To test this, we use latent traversals to qualitatively evaluate the level of disentangle-ment achieved by the models, ranked by their UDR scores. This is a common technique to qualitativelyevaluate the level of disentanglement on simple visual datasets where no ground truth attribute labels areavailable. Such traversals involve changing the value of one latent dimension at a time and evaluating itseffect on the resulting reconstructions to understand whether the latent has learnt to represent anythingsemantically meaningful. Fig. 3 demonstrates that the qualitative disentanglement quality is reﬂectedwell in the UDR scores. The ﬁgure also highlights that the supervised metric scores can sometimes beoveroptimistic. For example, compare TC-VAE and β -VAE traversals in Fig. 3. These are scored simi-larly by the supervised metric (0.774 and 0.751) but differently by UDR (0.444 and 0.607). Qualitativeevaluation of the traversals clearly shows that β -VAE learnt a more disentangled representation than TC-VAE, which is captured by UDR but not by the supervised metric. Fig. 9 in Supplementary Material pro-vides more examples. We also evaluated how well UDR ranks models trained on more complex datasets,CelebA and ImageNet, and found that it performs well (see Sec. A.9 in Supplementary Materials).8ublished as a conference paper at ICLR 2020 S AMPLE P ) 5 10 15 20 25 30 35 40 45C ORRELATION . ± .

07 0 . ± .

03 0 . ± .

05 0 . ± .

03 0 . ± .

02 0 . ± .

01 0 . ± . Table 2: Rank correlations of the UDR score with the β -VAE metric on the dSprites dataset for a β -VAEhyperparameter search as the number of pairwise comparisons P per model were changed. UDR works well even with ﬁve pairwise comparisons.

We test the effect of the number ofpairwise comparisons P on the variance and accuracy of the UDR scores. Tbl. 2 reports the changesin the rank correlation with the β -VAE metric on the dSprites dataset as P is varied between 5 and 45.We see that the correlation between the UDR and the β -VAE metric becomes higher and the variancedecreases as the number of seeds is increased. However, even with P = 5 the correlation is reasonable. UDR generalises to a dataset with no attribute labels.

We investigate whether UDR can be usefulfor selecting well disentangled models trained on the 3D Cars (Reed et al., 2014) dataset with poorlylabelled attributes, which makes it a bad ﬁt for supervised disentanglement metrics. Fig. 1 shows thata highly ranked model according to UDR appears disentangled, while a poorly ranked one appearsentangled. Fig. 10 in Supplementary Material provides more examples of high and low scoring modelsaccording to the UDR method.

UDR predicts ﬁnal task performance.

We developed UDR to help practitioners use disentangledrepresentations to better solve subsequent tasks. Hence, we evaluate whether the model ranking pro-duced by UDR correlates with task performance on two different domains: the fairness on a classiﬁca-tion task introduced by Locatello et al. (2018), and data efﬁciency on a clustering task for a model-basedreinforcement learning agent introduced by Watters et al. (2019) (see Sec. A.8 in Supplementary Mate-rials for more details). We found that UDR had an average of 0.8 Spearman correlation with the fairnessscores, which is higher than the average of 0.72 correlation between fairness and supervised scoresreported by Locatello et al. (2018). We also found that UDR scores had 0.56 Spearman correlation withdata efﬁciency of the COBRA agent. The difference between the best and the worst models accordingto UDR amounted to around 66% reduction in the number of steps to 90% success rate on the task.

ONCLUSION

We have introduced UDR, the ﬁrst method for unsupervised model selection for variationaldisentangled representation learning. We have validated our approach on 5400 models coveringall six state of the art VAE-based unsupervised disentangled representation learning model classes.We compared UDR to four existing supervised disentanglement metrics both quantitatively andqualitatively, and demonstrated that our approach works reliably well across three different datasets,often ranking models more accurately than the supervised alternatives. Moreover, UDR avoids oneof the big shortcomings of the supervised disentangling metrics – the arbitrary choice of a “canonical”disentangled factorisation, instead allowing any of the equally valid disentangled generative processesto be accepted. Finally, we also demonstrated that UDR is useful for predicting ﬁnal task performanceusing two different domains. Hence, we hope that UDR can be a step towards unlocking the powerof unsupervised disentangled representation learning to real-life applications. A CKNOWLEDGEMENTS

We thank Olivier Bachem and Francesco Locatello for helping us re-use their code and modelcheckpoints, and Neil Rabinowitz, Avraham Ruderman and Tatjana Chavdarova for useful feedback. R EFERENCES

Alessandro Achille, Tom Eccles, Loic Matthey, Christopher P Burgess, Nick Watters, Alexander Lerchner, andIrina Higgins. Life-long disentangled representation learning with cross-domain latent homologies.

NIPS , 2018.Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives.

IEEE transactions on pattern analysis and machine intelligence , 35(8):1798–1828, 2013.

James Bergstra, Remi Bardenet, Yoshua Bengio, and Balazs Kegl. Algorithms for hyper-parameter optimization.

NIPS , 2011.Chris Burgess and Hyunjik Kim. 3d shapes dataset. https://github.com/deepmind/3dshapes-dataset/, 2018.Christopher P. Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and AlexanderLerchner. Understanding disentangling in β -VAE. NIPS Workshop of Learning Disentangled Features , 2017.Christopher P Burgess, Loic Matthey, Nick Watters, Rishabh Kabra, Irina Higgins, Matt Botvinick, and AlexanderLerchner. MONet: Unsupervised scene decomposition and representation. arXiv preprint , January 2019.Tian Qi Chen, Xuechen Li, Roger Grosse, and David Duvenaud. Isolating sources of disentanglement invariational autoencoders.

NIPS , 2018.Taco Cohen and Max Welling. Group equivariant convolutional networks.

ICML , 2016.Pierre Comon. Independent component analysis, a new concept?

Signal Processing , 36:287–314, 1994.Cian Eastwood and Christopher K. I. Williams. A framework for the quantitative evaluation of disentangledrepresentations.

ICLR , 2018.Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymyr Mnih, Tom Ward, Yotam Doron, VladFiroiu, Tim Harley, Iain Dunning, Shane Legg, and Koray Kavukcuoglu. IMPALA: Scalable distributed deep-rlwith importance weighted actor-learner architectures. arxiv , 2018.Marta Garnelo, Kai Arulkumaran, and Murray Shanahan. Towards deep symbolic reinforcement learning. arXivpreprint arXiv:1609.05518 , 2016.Robert Gens and Pedro M. Domingos. Deep symmetry networks.

NIPS , 2014.David R. Hardoon, Sandor Szedmak, and John Shawe-Taylor. Canonical correlation analysis; an overview withapplication to learning methods.

Neural Computation , 16(12):2639 – 2664, 2004.Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan,Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcementlearning. arxiv , 2017.Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed,and Alexander Lerchner. β -VAE: Learning basic visual concepts with a constrained variational framework. ICLR , 2017a.Irina Higgins, Arka Pal, Andrei Rusu, Loic Matthey, Christopher Burgess, Alexander Pritzel, Matthew Botvinick,Charles Blundell, and Alexander Lerchner. DARLA: Improving zero-shot transfer in reinforcement learning.

ICML , 2017b.Irina Higgins, David Amos, David Pfau, Sebastien Racaniere, Loic Matthey, Danilo Rezende, and AlexanderLerchner. Towards a deﬁnition of disentangled representations. arXiv , 2018a.Irina Higgins, Nicolas Sonnerat, Loic Matthey, Arka Pal, Christopher Burgess, Matko Bosnjak, Murray Shanahan,Matthew Botvinick, Demis Hassabis, and Alexander Lerchner. SCAN: Learning hierarchical compositionalvisual concepts.

ICLR , 2018b.Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation networks.

CVPR , 2018.Frank Hutter, Holger Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general algorithmconﬁguration.

Learning and Intelligent Optimization , 2011.Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M. Czarnecki, Jeff Donahue, Ali Razavi,Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, Chrisantha Fernando, and Koray Kavukcuoglu.Population based training of neural networks. arXiv , 2018.Hyunjik Kim and Andriy Mnih. Disentangling by factorising.

ICLR , 2018.Diederik P. Kingma and Max Welling. Auto-encoding variational bayes.

ICLR , 2014.Nikolaus Kriegeskorte, Marieke Mur, and Peter Bandettini. Representational similarity analysis – connectingthe branches of systems neuroscience.

Front Syst Neurosci. , 4(2), 2008.Abhishek Kumar, Prasanna Sattigeri, and Avinash Balakrishnan. Variational inference of disentangled latentconcepts from unlabeled observations. arxiv , 2017.

Brenden M. Lake, Tomer D. Ullman, Joshua B. Tenenbaum, and Samuel J. Gershman. Building machines thatlearn and think like people.

Behavioral and Brain Sciences , pp. 1–101, 2016.Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. Phrase-based& neural unsupervised machine translation. arxiv , 2018.Adrien Laversanne-Finot, Alexandre Péré, and Pierre-Yves Oudeyer. Curiosity driven exploration of learneddisentangled goal spaces. arxiv , 2018.Yixuan Li, Jason Yosinski, Jeff Clune, Hod Lipson, and John Hopcroft. Convergent learning: Do different neuralnetworks learn the same representations?

ICLR , 2016.Francesco Locatello, Stefan Bauer, Mario Lucic, Sylvain Gelly, Bernhard Schölkopf, and Olivier Bachem.Challenging common assumptions in the unsupervised learning of disentangled representations.

ICML , 97:4114–4124, 2018.Francesco Locatello, Gabriele Abbati, Tom Rainforth, Stefan Bauer, Bernhard Schölkopf, and Olivier Bachem.On the fairness of disentangled representations. arxiv , 2019.Gary Marcus. Deep learning: A critical appraisal. arxiv , 2018.Emile Mathieu, Tom Rainforth, N. Siddharth, and Yee Whye Teh. Disentangling disentanglement in variationalautoencoders.

ICML , 2019.Loic Matthey, Irina Higgins, Demis Hassabis, and Alexander Lerchner. dsprites: Disentanglement testing spritesdataset, 2017. URL https://github.com/deepmind/dsprites-dataset/ .Risto Miikkulainen, Jason Liang, Elliot Meyerson, Aditya Rawal, Dan Fink, Olivier Francon, Bala Raju, HormozShahrzad, Arshak Navruzyan, Nigel Duffy, and Babak Hodjat. Evolving Deep Neural Networks. arxiv , 2017.Ari S. Morcos, Maithra Raghu, and Samy Bengio. Insights on representational similarity in neural networkswith canonical correlation.

NIPS , 2018.Ashvin Nair, Vitchyr Pong, Murtaza Dalal, Shikhar Bahl, Steven Lin, and Sergey Levine. Visual reinforcementlearning with imagined goals. arxiv , 2018.Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, NalKalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXivpreprint arXiv:1609.03499 , 2016.Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. Svcca: Singular vector canonicalcorrelation analysis for deep learning dynamics and interpretability.

NIPS , 2017.Scott Reed, Kihyuk Sohn, Yuting Zhang, and Honglak Lee. Learning to disentangle factors of variation withmanifold interaction.

ICML , 2014.Danilo J Rezende and Fabio Viola. Generalized elbo with constrained optimization, geco.

Workshop on BayesianDeep Learning, NeurIPS , 2018.Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximateinference in deep generative models.

ICML , 32(2):1278–1286, 2014.Karl Ridgeway and Michael C Mozer. Learning deep disentangled embeddings with the f-statistic loss.

NIPS , 2018.Michal Rolinek, Dominik Zietlow, and Georg Martius. Variational autoencoders pursue pca directions (byaccident).

CVPR , 2019.Jürgen Schmidhuber. Learning factorial codes by predictability minimization.

Neural Computation , 4(6):863–869, 1992.David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot,Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis.A general reinforcement learning algorithm that masters chess, shogi, and go through self-play.

Science , 362(6419):1140–1144, 2018. doi: 10.1126/science.aar6404.J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian Optimization of Machine Learning Algorithms. arXiv , 2012.Stefano Soatto. Steps toward a theory of visual information.

Technical Report UCLA-CSD100028 , 2010.

Xander Steenbrugge, Sam Leroux, Tim Verbelen, and Bart Dhoedt. Improving generalization for abstractreasoning tasks using disentangled feature representations. arxiv , 2018.Raphael Suter, Dorde Miladinovic, Stefan Bauer, and Bernhard Scholkopf. Interventional robustness of deeplatent variable models. arxiv , 2018.C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown. Auto-WEKA: Combined Selection and HyperparameterOptimization of Classiﬁcation Algorithms. arXiv , 2012.Sjoerd van Steenkiste, Francesco Locatello, Jürgen Schmidhuber, and Olivier Bachem. Are disentangledrepresentations helpful for abstract visual reasoning? arxiv , 2019.Liwei Wang, Lunjia Hu, Jiayuan Gu, Yue Wu, Zhiqiang Hu, Kun He, and John Hopcroft. Towards understandinglearning representations: To what extent do different neural networks learn the same representation.

NeurIPS ,2018.Nicholas Watters, Loic Matthey, Matko Bosnjak, Christopher P. Burgess, and Alexander Lerchner. Cobra: Data-efﬁcient model-based rl through unsupervised object discovery and curiosity-driven exploration. arxiv , 2019.

A S

UPPLEMENTARY M ATERIAL

A.1 U

SEFUL PROPERTIES OF DISENTANGLED REPRESENTATIONS

Disentangled representations are particularly useful because they re-represent the information contained in thedata in a way that enables semantically meaningful compositionality . For example, having discovered that thedata is generated using two factors, colour and shape, such a model would be able to support meaningful reasoningabout ﬁctitious objects, like pink elephants, despite having never seen one during training (Higgins et al., 2017b;2018b). This opens up opportunities for counterfactual reasoning, more robust and interpretable inference andmodel-based planning (Higgins et al., 2018a; Suter et al., 2018). Furthermore, such a representation would supportmore data efﬁcient learning for subsequent tasks, like a classiﬁcation objective for differentiating elephants fromcats. This could be achieved by ignoring the nuisance variables irrelevant for the task, e.g. the colour variations,by simply masking out the disentangled subspaces that learnt to represent such nuisances, while only payingattention to the task-relevant subspaces, e.g. the units that learnt to represent shape (Cohen & Welling, 2016; Gens& Domingos, 2014; Soatto, 2010). Hence, the semantically meaningful compositional nature of disentangledrepresentations is perhaps the most sought after aspect of disentangling, due to its strong implications forgeneralisation, data efﬁciency and interpretability (Schmidhuber, 1992; Bengio et al., 2013; Higgins et al., 2018a).

A.2 A

SPECTS OF DISENTANGLEMENT MEASURED BY DIFFERENT METRICS

Methods for evaluating and comparing representations have been proposed in the past. The most similarapproaches to ours are the DCI Disentanglement score from Eastwood & Williams (2018) and the axis alignmentcomparison of representations in trained classiﬁers proposed in Li et al. (2016). The former is not directlyapplicable for unsupervised disentangled model selection, since it requires access to the ground truth attributelabels. Even when adapted to compare two latent representations, our preliminary experiments suggested thatthe entropy based aggregation proposed in Eastwood & Williams (2018) is inferior to our aggregation in Eq. 2when used in the UDR setup. The approach by Li et al. (2016) shares the similarity matrix calculation step withus, however they never go beyond that quantitatively, opting for qualitative evaluations of model representationsinstead. Hence, their approach is not directly applicable to quantitative unsupervised disentangled model ranking.Other related approaches worth mentioning are the Canonical Correlation Analysis (CCA) and its modiﬁcations(Hardoon et al., 2004; Raghu et al., 2017; Morcos et al., 2018). These approaches, however, tend to be invariantto invertible afﬁne transformations and therefore to the axis alignment of individual neurons, which makes themunsuitable for evaluating disentangling quality. Finally, Representation Similarity Matrix (RSM) (Kriegeskorteet al., 2008) is a commonly used method in Neuroscience for comparing the representations of different systemsto the same set of stimuli. This technique, however, is not applicable for measuring disentangling, because itignores dimension-wise response properties.When talking about disentangled representations, three properties are generally considered: modularity , compactness and explicitness (Ridgeway & Mozer, 2018). Modularity measures whether each latent dimensionencodes only one data generative factor, compactness measures whether each data generative factor is encoded bya single latent dimension, and explicitness measures whether all the information about the data generative factorscan be decoded from the latent representation. We believe that modularity is the key aspect of disentangling,since it measures whether the representation is compositional, which gives disentangled representations themajority of their beneﬁcial properties (see Sec. A.1 in Supplementary Materials for more details).

Compactness ,on the other hand, may not always be desirable. For example, according to a recent deﬁnition of disentangledrepresentations (Higgins et al., 2018a), it is theoretically impossible to represent 3D rotation in a single dimension(see also Ridgeway & Mozer (2018)). Finally, while explicitness is clearly desirable for preserving informationabout the data that may be useful for subsequent tasks, in practice models often fail to discover and representthe full set of the data generative factors due to restrictions on both the observed data distribution and the modelcapacity (Mathieu et al., 2019). Hence, we suggest noting the explicitness of a representation, but not necessarilypunishing its disentanglement ranking if it is not fully explicit. Instead, we suggest that the practitioner shouldhave the choice to select the most disentangled model given a particular number of discovered generative factors.Hence, in the rest of the paper we will use the term “disentanglement” to refer to the compositional propertyof a representation related to the modularity measure. Tbl. 3 provides a summary of how the different metricsconsidered in the paper compare in terms of modularity, compactness and explicitness.

A.3 D

ATASET DETAILS dSprites

A commonly used unit test for evaluating disentangling is the dSprites dataset (Matthey et al., 2017).This dataset consists of images of a single binary sprite pasted on a blank background and can be fully describedby ﬁve generative factors: shape (3 values), position x (32 values), position y (32 values), size (6 values) and Similar properties have also been referred to as disentanglement , completeness and informativeness respec-tively in the independent yet concurrent paper by Eastwood & Williams (2018). M ETRIC

M C E β -VAE √ × √ F ACTOR

VAE √ √ √

MIG √ √ √

DCI D

ISENTANGLEMENT √ × ×

UDR √ × × rotation (40 values). All the generative factors are sampled from a uniform distribution. Rotation is sampledfrom the full 360 degree range. The generative process for this dataset is fully deterministic, resulting in 737,280total images produced from the Cartesian product of the generative factors.

3D Shapes

A more complex dataset for evaluating disentangling is the 3D Shapes dataset (Burgess & Kim,2018). This dataset consists of images of a single 3D object in a room and is fully speciﬁed by six generativefactors: ﬂoor colour (10 values), wall colour (10 values), object colour (10 values), size (8 values), shape (4 values)and rotation (15 values). All the generative factors are sampled from a uniform distribution. Colours are sampledfrom the circular hue space. Rotation is sampled from the [-30, 30] degree angle range.

3D Cars

This dataset was adapted from Reed et al. (2014). The full data generative process for this dataset isunknown. The labelled factors include 199 car models and 24 rotations sampled from the full 360 degree out of planerotation range. An example of an unlabelled generative factor is the colour of the car – this varies across the dataset.

A.4 U

NSUPERVISED DISENTANGLED REPRESENTATION LEARNING MODELS

As mentioned in Sec. 3, current state of the art approaches to unsupervised disentangled representation learningare based on the VAE (Kingma & Welling, 2014; Rezende et al., 2014) objective presented in Eq. 1. Theseapproaches decompose the objective in Eq. 1 into various terms and change their relative weighting to exploitthe trade-off between the capacity of the latent information bottleneck with independent sources of noise, and thequality of the resulting reconstruction in order to learn a disentangled representation. The ﬁrst such modiﬁcationwas introduced by Higgins et al. (2017a) in their β -VAE framework: E p ( x ) [ E q φ ( z | x ) [log p θ ( x | z )] − β KL ( q φ ( z | x ) || p ( z )) ] (4)In order to achieve disentangling in β -VAE, the KL term in Eq. 4 is typically up-weighted by setting β > .This implicitly reduces the latent bottleneck capacity and, through the interaction with the reconstruction term,encourages the generative factors c k with different reconstruction proﬁles to be encoded by different independentnoisy channels z l in the latent bottleneck. Building on the β -VAE ideas, CCI-VAE (Burgess et al., 2017)suggested slowly increasing the bottleneck capacity during training, thus improving the ﬁnal disentanglementand reconstruction quality: E p ( x ) [ E q φ ( z | x ) [log p θ ( x | z )] − γ | KL ( q φ ( z | x ) || p ( z )) − C | ] (5)Later approaches (Kim & Mnih, 2018; Chen et al., 2018; Kumar et al., 2017) showed that the KL term in Eq. 1can be further decomposed according to: E p ( x ) [ KL ( q φ ( z | x ) || p ( z )) ] = I ( x ; z )+ KL ( q φ ( z ) || p ( z )) (6)Hence, penalising the full KL term as in Eqs. 4-5 is not optimal, since it unnecessarily penalises the mutual infor-mation between the latents and the data. To remove this undesirable side effect, different authors suggested insteadadding more targeted penalised terms to the VAE objective function. These include different implementationsof the total correlation penalty (FactorVAE by Kim & Mnih (2018) and TC-VAE by Chen et al. (2018)): L V AE − γ KL ( q φ ( z ) || M (cid:89) j =1 q φ ( z j )) (7)and different implementations of the penalty that pushes the marginal posterior towards a factorised prior(DIP-VAE by Kumar et al. (2017)): L V AE − γ KL ( q φ ( z ) || p ( z )) (8) A.4.1 M

ODEL IMPLEMENTATION DETAILS

We re-used the trained checkpoints from Locatello et al. (2018), hence we recommend the readers to check theoriginal paper for model implementation details. Brieﬂy, the following architecture and optimiser were used.

Encoder DecoderInput: 64 × × number of channels Input: R × × × ×

64 ReLU4 × × × × ×

10 4 × × Table 5: Hyperparameters used for each model architecture

Model Parameters Values β -VAE β [1, 2, 4, 6, 8, 16]CCI-VAE c max [5, 10, 25, 50, 75, 100]iteration threshold 100000 γ γ [10, 20, 30, 40, 50, 100]DIP-VAE-I λ od [1, 2, 5, 10, 20, 50] λ d λ od DIP-VAE-II λ od [1, 2, 5, 10, 20, 50] λ d λ od TC-VAE β [1, 2, 4, 6, 8, 10](a) Common hyperparameters across allmodels Parameter ValuesBatch Size 64Latent space dimension 10Optimizer AdamAdam: beta1 0.9Adam: beta2 0.999Adam: epsilon 1e-8Adam: learning rate 0.0001Decoder type Bernoulli (b) FactorVAE discrimina-tor architecture

DiscriminatorFC, 1000 leaky ReLUFC, 1000 leaky ReLUFC, 1000 leaky ReLUFC, 1000 leaky ReLUFC, 1000 leaky ReLUFC, 1000 leaky ReLUFC, 2 (c) FactorVAE discriminator pa-rameters

Parameter ValuesBatch size 64Optimizer AdamAdam: beta1 0.5Adam: beta2 0.9Adam: epsilon 1e-8Adam: learning rate 0.0001Table 6: Miscellaneous model details

For consistency, all the models were trained using the same architecture, optimiser, and hyperparameters. Allof the methods use a deep neural network to encode and decode the latent embedding and the parameters ofthe latent factors are predicted using a Gaussian encoder whose architecture is speciﬁed in Table 4. All of themodels predict a latent vector with 10 factors. Each model was also trained with 6 different levels of regularisationstrength speciﬁed in Table 5. The ranges of the hyperparameters used for the various levels of regularisationwere speciﬁed to show a diversity of different performance on different datasets without relying on pre-existingintuition on good hyperparameters, however ranges were based on hyperparameters that were used previouslyin literature. For each of the model classes outlined above, we tried 6 hyperparameter values with 50 seeds each. β -VAE The β -VAE (Higgins et al., 2017a) model is similar to the vanilla VAE model but with an additionalhyperparameter β to modify the strength of the KL regulariser. E p ( x ) [ E q φ ( z | x ) [log p θ ( x | z )] − β KL ( q φ ( z | x ) || p ( z )) ] (9)where a β value of 1 corresponds to the vanilla VAE model. Increasing β enforces a stronger prior on the latentdistribution and encourages the representation to be independent. CCI-VAE

The CCI-VAE model (Burgess et al., 2017) is a variant of the β -VAE where the KL divergenceis encouraged to match a controlled value C which is increased gradually throughout training. This yields usthe objective function for CCI-VAE. E p ( x ) [ E q φ ( z | x ) [log p θ ( x | z )] − β | KL ( q φ ( z | x ) || p ( z )) − C | ] (10) FactorVAE

FactorVAE (Kim & Mnih, 2018) speciﬁcally penalises the dependencies between the latentdimensions such that the “Total Correlation” term is targeted yielding a modiﬁed version of the β -VAE objective. E p ( x ) [ E q φ ( z | x ) [log p θ ( x | z )] − KL ( q φ ( z | x ) || p ( z )) ] − βKL ( q ( z ) || (cid:89) j q ( z j )) (11)The “Total Correlation” term is intractable in this case so for FactorVAE, samples are used from both q ( z | x ) and q ( z ) as well as the density-ratio trick to compute an estimate of the “Total Correlation” term. FactorVAE usesan additional discriminator network to approximate the density ratio in the KL divergence. The implementationdetails for the discriminator network and its hyperparameters can be found in Table 5(b) and 5(c). TC-VAE

The TC-VAE model (Chen et al., 2018) which independently from FactorVAE has a similar objectiveKL regulariser which contains a “Total Correlation” term. In the case of TC-VAE the “Total Correlation” termis estimated using a biased Monte-Carlo estimate. (3)

Create MxP similarity matrices, for eachcalculate UDR ij score UDR = 0.47 (4) Aggregate UDR ij scores for each of M models UDR = 0.72UDR M = 0.36 (1) Train M=HxS modelsHyper 1 seed 1 seed 2 seed S

Hyper 2Hyper H (2)

Select P models for pairwise comparisons with each of M models

Model i=1 of M j = j = j = P seed 1 seed 2 seed Sseed 1 seed 2 seed S Figure 4: Schematic illustration of the UDR method. See details in text.

DIP-VAE

The DIP-VAE model also adds regularisation to the aggregated posterior but instead an additionalloss term is added to encourage it to match the factorised prior. Since the KL divergence is intractable, othermeasures of divergence are used instead.

Cov p ( x ) [ µ φ ( x )] can be used, yielding the DIP-VAE-I objective E p ( x ) [ E q φ ( z | x ) [log p θ ( x | z )] − KL ( q φ ( z | x ) || p ( z )) ] − λ od (cid:88) i (cid:54) = j [ Cov p ( x ) [ µ φ ( x )]] ij − λ d (cid:88) i ([ Cov p ( x ) [ µ φ ( x )]] ii − (12) or Cov q φ [ z ] is used instead yielding the DIP-VAE-II objective. E p ( x ) [ E q φ ( z | x ) [log p θ ( x | z )] − KL ( q φ ( z | x ) || p ( z )) ] − λ od (cid:88) i (cid:54) = j [ Cov q φ [ z ]] ij − λ d (cid:88) i ([ Cov q φ [ z ]] ii − (13) A.5 UDR

IMPLEMENTATION DETAILS

Similarity matrix

To compute the similarity matrix R ij we follow the approach of Li et al. (2016) andMorcos et al. (2018). For a given dataset X = { x , x , ..., , x N } and a neuron a ∈ { , ..., L } of model i (denoted as z i,a ), we deﬁne z i,a to be the vector of mean inferred posteriors q i ( z i | x i ) across the full dataset: z i,a = ( z i,a ( x ) ,...,z i,a ( x N )) ∈ R N . Note that this is different from the often considered notion of a “latentrepresentation vector”. Here z i,a is a response of a single latent dimension over the entire dataset, not an entirelatent response for a single input. We then calculate the similarity between each two of such vectors z i,a and z j,b using either Lasso regression or Spearman’s correlation. Lasso regression (UDR L ) We trained L lasso regressors to predict each of the latent responses z i,a from z j using the dataset of latent encodings Z i,a = { ( z j, ,z i,a, ) ,..., ( z j,N ,z i,a,N ) } . Each row in R ij ( a ) is then ﬁlledin using the weights of the trained Lasso regressor for z i,a . The lasso regressors were trained using the defaultScikit-learn multi-task lasso objective min w n samples || XW − Y || Fro + λ || W || where F ro is the Frobeniusnorm: || A || Fro = (cid:113)(cid:80) ij a ij and the l l loss is computed as || A || = (cid:80) i (cid:113)(cid:80) j a ij . λ is chosen using crossvalidation and the lasso is trained until convergence until either 1000 iterations have been run or our updatesare below a tolerance of 0.0001. Lasso regressors were trained on a dataset of 10000 latents from each modeland training was performed using coordinate descent over the entire dataset. R nm is then computed by extractingthe weights in the trained lasso regressor and computing their absolute value (Eastwood & Williams, 2018). Itis important that the representations are normalised per-latent such that the relative importances computed perlatent are scaled to reﬂect their contribution to the output. Normalising our latents also ensures that the weightsthat are computed roughly lie in the interval [ − , . Spearman’s based similarity matrix (UDR S ) We calculate each entry in the similarity matrix accordingto R ij ( a,b ) = Corr ( z i,a , z j,b ) , where Corr stands for Spearman’s correlation. We use Spearman’s correlation tomeasure the similarity between z i,a and z j,b , because we do not want to necessarily assume a linear relationshipbetween the two latent encodings, since the geometry of the representational space is not crucial for measuringwhether a representation is disentangled (see Sec. 2), but we do hope to ﬁnd a monotonic dependence betweenthem. Spearman correlation coefﬁcients were computed by extracting 1000 samples from each model andcomputing the Spearman correlation over all the samples on a per-latent basis. All-to-all calculations

To make all-to-all comparisons, we picked 10 random seeds per hyperparametersetting and limited all the calculations to those models. Hence we made the maximum of (60 choose 2) pairwisemodel comparisons when calculating UDR-A2A.

Informative latent thresholding

Uninformative latents typically have KL (cid:28) . while informative latentshave KL (cid:29) . , so KL = 0 . threshold in Eq. 3 is somewhat arbitrarily chosen to pick out the informative latents z . Sample reduction experiments

We randomly sampled without replacement 20 different sets of P modelsfor pairwise comparison from the original set of 50 models with the same hyperparameter setting for UDR or60 models with different seeds and hyperparameters for UDR-A2A. A.6 S

UPERVISED METRIC IMPLEMENTATION DETAILS

Original β -VAE metric. First proposed in Higgins et al. (2017a), this metric suggests sampling two batchesof observations x where in both batches the same single data generative factor is ﬁxed to a particular value, whilethe other factors are sampled randomly from the underlying distribution. These two batches are encoded intothe corresponding latent representations q φ ( z | x ) and the pairwise differences between the corresponding meanlatent values from the two batches are taken. Disentanglement is measured as the ability of a linear classiﬁerto predict the index of the data generative factor that was ﬁxed when generating x .We compute the β -VAE score by ﬁrst randomly picking a single factor of variation and ﬁxing the value of thatfactor to a randomly sampled value. We then generate two batches of 64 where all the other factors are sampled randomly and take the mean of the differences between the latent mean responses in the two batches to generate atraining point. This process is repeated 10000 times to generate a training set by using the ﬁxed factor of variationas the label. We then train a logistic regression on the data using Scikit-learn and report the evaluation accuracyon a test set of 5000 as the disentanglement score. FactorVAE metric.

Kim & Mnih (2018) proposed a modiﬁcation on the β -VAE metric which made theclassiﬁer non-parametric (majority vote based on the index of the latent dimension with the least variance after thepairwise difference step). This made the FactorVAE metric more robust, since the classiﬁer did not need to beoptimised. Furthermore, the FactorVAE metric is more accurate than the β -VAE one, since the β -VAE metric oftenover-estimates the level of disentanglement by reporting 100% disentanglement even when only K − factorswere disentangled.The Factor VAE score is computed similarly to the β -VAE metric but with a few modiﬁcations. First we drawa set of 10000 random samples from the dataset and we estimate the variance of the mean latent responses inthe model. Latents with a variance of less than 0.05 are discarded. Then batches of 64 samples are generatedby a random set of generative factors with a single ﬁxed generative factor. The variances of all the latent responsesover the 64 samples are computed and divided by the latent variance computed in the ﬁrst step. The variancesare averaged to generate a single training point using the ﬁxed factor of variation as the label. 10000 such trainingpoints are generated as the training set. A majority vote classiﬁer is trained to pick out the ﬁxed generative factorand the evaluation accuracy is computed on test set of 5000 and reported as the disentanglement score. VAE 1VAE M

Trained model pool z j z i Spearman LassoEntangled Disentangled

A CB x y s xy

Figure 5: A: Schematic illustration of the pairwise model comparison. Two trained models i and j aresampled for pairwise comparison. Both models learnt a perfectly disentangled representation, learningto represent two (positions x/y) and three (positions x/y, and size) generative factors respectively.Similarity matrix R ij : white – high similarity between latent dimensions, black – low. B: Similaritymatrix R ij for the same pair of models, calculated using either Spearman correlation or Lasso regression.The latter is often cleaner. C: Examples of Lasso similarity matrices of an entangled vs a disentangledmodel.

Mutual Information Gap (MIG).

The MIG metric proposed in Chen et al. (2018) proposes estimatingthe mutual information (MI) between each data generative factor and each latent dimension. For each factor,they consider two latent dimensions with the highest MI scores. It is assumed that in a disentangled representationonly one latent dimension will have high MI with a single data generative factor, and hence the difference betweenthese two MI scores will be large. Hence, the MIG score is calculated as the average normalised differencebetween such pairs of MI scores per each data generative factor. Chen et al. (2018) suggest that the MIG scoreis more general and unbiased than the β -VAE and FactorVAE metrics.We compute the Mutual Information Gap by taking the discretising the mean representation of 10000 samples into20 bins. The disentanglement score is then derived by computing, per generative factor, the difference betweenthe top two latents with the greatest mutual information with the generative factor and taking the mean. UDR L

ASSO S PEARMAN S UPERVISED H YPER . ± .

06 0 . ± .

07 0 . ± . A LL - TO - ALL . ± .

11 0 . ± . K K (cid:88) k =1 H v k (cid:16) I ( z ( k ) j ) − max j (cid:54) = j k I ( z j ,v k ) (cid:17) (14)where K is the number of generative factors, from which v k is a single generative factor z j is the meanrepresentation and j ( k ) = argmax j I n ( z j ; v k ) is the latent representation with the greatest mutual informationwith the generative factor. H v k is the computed entropy of the generative factor. DCI Disentanglement.

This is the disentanglement part of the three-part metric proposed by Eastwood &Williams (2018). The DCI disentanglement metric is somewhat similar to our unsupervised metric, whereby theauthors train a random forest classiﬁer to predict the ground truth factors from the corresponding latent encodings q ( z | x ) . They then use the resulting M × N matrix of feature importance weights to calculate the differencebetween the entropy of the probability that a latent dimension is important for predicting a particular groundtruth factor weighted by the relative importance of each dimension.The DCI disentanglement metric is an implementation of the disentanglement metric as described in Eastwood &Williams (2018) using a gradient boosted tree. It was computed by ﬁrst extracting the relative importance of eachlatent mean representation as a predictor for each generative factor by training a gradient boosted tree using the de-fault Scikit-learn model on 10000 training and 1000 test points and extracting the importance weights. The weightsare summarised into an importance matrix R ij with the number of rows equal to the number of generative factorsand columns equal to the number of latents. The disentanglement score for each column is computed as D i = (1 − H K ( P i )) where H K ( P i ) = − (cid:80) K − k =0 P ik log K P ik denotes the entropy. P ik = R ij / (cid:80) K − k =0 is the probability of thelatent factor i in being important for predicting factor k . The weighted mean of the scores for the column is computedusing the relative predictive importance of each column as the weight D = (cid:80) i p i ∗ D i where p i = (cid:80) j R ij / (cid:80) ij R ij . A.7 A

DDITIONAL RESULTS

We evaluated four UDR versions, which differed in terms of whether Spearman- and Lasso-based similaritymatrices R ij were used (subscripts S and L respectively), and whether the models for pairwise similaritycomparison are picked from the pool of different seeds trained with the same hyperparameters or from the poolof all models (the latter indicated by the A2A sufﬁx). The A2A correlations in Tbl. 7 are on average slightly higher,however these scores are more computationally expensive to compute due to the higher number of total pairwisesimilarity calculations. For that reason, the scores presented in the table are calculated using only 20% of allthe trained models. Hence, the results presented in the main text of the paper are computed using the UDR L score,which allowed us to evaluate all 5400 models and performed slightly better than the UDR S score. Figs. 6-8 providemore details on the performance of the different UDR versions.To qualitatively validate that the UDR method is ranking models well, we look into more detail into the β -VAEmodel ranking when evaluated with the DCI disentanglement metric on the dSprites dataset. This scenario resultedin the worst disagreement between UDR and the supervised metric as shown in Fig. 6. We consider the UDR L version of our method, since it appears to give the best trade off between overall correlations with the supervisedmetrics and hyperparameter selection accuracy. Fig. 9 demonstrates that the poor correlation between UDR L and DCI Disentanglement is due to the supervised metric. Models ranked highly by UDR L but poorly by DCIDisentanglement appear to be qualitatively disentangled through visual inspection of latent traversals. Conversely,models scored highly by DCI Disentanglement but poorly by UDR L appear entangled. A.8 UDR

CORRELATION WITH FINAL TASK PERFORMANCE

To illustrate the usefulness of UDR to select disentangled models, we ran two experiments. We computed theUDR correlation with fairness scores and with data efﬁciency on a model-based RL task.

Fairness scores.

Fig. 11 (left) demonstrates that UDR correlates well with the classiﬁcation fairness scoresintroduced by Locatello et al. (2019). We adopted a similar setup described in Locatello et al. (2019) to compute fair-ness, using a gradient booting classiﬁer over 10000 labelled examples. The fairness score was computed by taking

Lasso Spearman H y p e r A ll - t o - a ll Figure 6: Rank correlation between different versions of UDR with different supervised metrics acrosstwo datasets and three model classes. We see that the UDR L approaches slightly outperform the UDR S ones. the mean of the fairness scores across all targets and all sensitive variables where the fairness scores are computedby measuring the total variation after intervening on the sensitive variable. The fairness scores were comparedagainst the Lasso regression version of UDR where models were paired only within the same hyperparameters. Model-based RL data efﬁciency.

We reproduced the results from the COBRA agent (Watters et al., 2019),to observe if UDR would correlate with the ﬁnal tasks performance when using VAEs as state representations. Moreprecisely, we will look at the training data efﬁciency, reported as the number of steps needed to achieve perfor-mance on the Clustering tasks (see Watters et al. (2019) for details), while using differently disentangled models.The agent is provided with a pre-trained MONet (Burgess et al., 2019), an exploration policy and a transitionmodel and has to learn a good reward predictor for the task in a dense reward setting. It uses Model PredictiveControl in order to plan and solve the task, where sprites have to be clustered by color (e.g. two blue sprites andtwo red sprites). In COBRA, the authors use a MONet with disentangled representation by using a high β = 1 .When pre-training MONet, we used β ∈ { . , . , } in order to introduce entanglement in the representationswithout compromising reconstruction accuracy and pre-trained seeds for each value of β . We use 5 randominitialisations of the reward predictor for each possible MONet model, and train them to perform the clusteringtask as explained in Watters et al. (2019). We report the number of steps to reach 90% success, averaged acrossthe initialisations. The UDR score is computed by feeding images with a single sprite to obtain an associatedunique representation and proceeding as described in the main text.As can be seen in Figure 11 (right), we ﬁnd that the UDR scores correlate with this ﬁnal data efﬁciency (linearregression shown, Spearman correlation ρ = 0 . ). This indicates that one could leverage the UDR score as ametric to select representations for further tasks. In this analysis we used the version of UDR that uses Spearmancorrelations and within-hyperparameter model comparisons. A.9 E

VALUATING

UDR

ON MORE COMPLEX DATASETS

We evaluated whether UDR is useful for model selection on more complex datasets. In particular, we choseCelebA and ImageNet. While disentangling VAEs have been shown to perform well on CelebA in the past(e.g. Higgins et al. (2018b)), ImageNet is notoriously too complex for even vanilla VAEs to model. However,we still wanted to verify whether the coarse representations of VAEs on ImageNet could be disentangled, andif so, whether UDR would be useful for model selection. To this end, we ran a hyperparameter sweep for the β -VAE and ranked its representations using UDR. Fig. 12 shows that UDR scores are clearly different for thedifferent values of the β hyperparameter. It is also clear that the models were able to learn about CelebA andproduce reasonable reconstructions, but on ImageNet even the vanilla VAEs struggled to represent anything butthe coarsest information. Figs. 13-14 plot latent traversals for three randomly chosen models with high (>0.6) andlow (<0.3) UDR scores. The latents are sorted by their informativeness, as approximated by their batch-averagedper dimension KL with the prior as per Eq. 3. It is clear that for both datasets those models that are ranked highby the UDR have both more interpretable and more similar representations than those models that are ranked low. dSpritesShapes 3D S c o r e β -VAE U D R L U D R S U D R L - A A U D R S - A A DIP-VAE U D R L U D R S U D R L - A A U D R S - A A TC-VAE U D R L U D R S U D R L - A A U D R S - A A S c o r e β -VAE U D R L U D R S U D R L - A A U D R S - A A DIP-VAE U D R L U D R S U D R L - A A U D R S - A A TC-VAE U D R L U D R S U D R L - A A U D R S - A A Figure 7: The range of scores for each hyperparameter setting for the dSprites and 3D Shapes datasetsfor various models and metrics. We see that the different versions of the UDR method broadly agreewith each other.A.10 Q

UALITATIVE EVALUATION OF MODEL REPRESENTATIONS RANKED BY

UDR

SCORES

In this section we attempt to qualitatively verify our assumption that “for a particular dataset and a VAE-based un-supervised disentangled representation learning model class, disentangled representations are all alike, while everyentangled representation is entangled in its own way, to rephrase Tolstoy”. The theoretical justiﬁcation of the pro-posed UDR hinges on the work by Rolinek et al. (2019). However, that work only empirically evaluated their analy-sis on the β -VAE model class. Even though we have reasons to believe that their theoretical results would also holdfor the other disentangling VAEs evaluated in this paper, in this section we empirically evaluate whether this is true.First, we check if all model classes operate in the so called “polarised regime”, which was highlighted by Rolineket al. (2019) as being important for pushing VAEs towards disentanglement. It is known that even vanilla VAEs(Kingma & Welling, 2014; Rezende et al., 2014) enter the “polarised regime”, which is often cited as one of theirshortcomings (e.g. see Rezende & Viola (2018)). All of the disentangling VAEs considered in this paper augmentthe original ELBO objective with extra terms. None of these extra terms penalise entering the “polarised regime”,apart from that of DIP-VAE-I. We tested empirically whether different model classes entered the “polarisedregime” during our hyperparameter sweeps. We did this by counting the number of latents that were “switchedoff” in each of the 5400 models considered in our paper by using Eq. 3. We found that all models apart fromDIP-VAE-I entered the polarised regime during the hyperparameter sweep, having on average 2.95/10 latents“switched off” (with a standard deviation of 1.97).Second, we check if the models scored highly by the UDR do indeed have similar representations, and modelsthat are scored low have dissimilar representations. We do this qualitatively by plotting latent traversals forthree randomly chosen models within each of the six disentangling VAE model classes considered in this paper. Lasso Spearman H y p e r A ll - t o - a ll Figure 8: Rank correlations of the different versions of the UDR score with the β -VAE metric onthe dSprites dataset for a β -VAE hyperparameter search as the number of pairwise comparisons permodel were changed. Higher number of comparisons leads to more accurate and more stable rankings,however these are still decent even with 5 pairwise comparisons per model. We groups these plots by UDR scores into three bands: high (UDR>0.4), medium (0.3

12 463 5 -33z z -33-33-33-33-33 z z Figure 9: Example latent traversals of some of the best and worst ranked β -VAE models using theUDR L (ordinate) and DCI Disentanglement (abscissa) metrics, coloured either by hyperparametervalue (top) or ﬁnal informative latent number (bottom). Uninformative units are greyed out. Themodels ranked highly by UDR L do appear to be well disentangled, despite being ranked poorly byDCI Disentanglement (1, 2, 4). On the other hand, models ranked well by DCI Disentanglement butpoorly by UDR L look quite entangled (5, 6). Finally, models ranked poorly by both metrics do appearentangled (3). 23ublished as a conference paper at ICLR 2020 UDR = 0.806 d = 7UDR = 0.204 d = 10 UDR = 0.148 d = 10UDR = 0.589 d = 6 z z -33 UDR = 0.165 d = 9 z z -33 UDR = 0.579 d = 5 z z -33z z -33 z z -33z z -33 Figure 10: Example latent traversals of some of the best and worst ranked β -VAE models using theUDR L scores. Uninformative latents are greyed out.Figure 11: Left : Spearman correlation between UDR scores and classiﬁcation fairness scores intro-duced by Locatello et al. (2019) across sixty models trained per each one of the three different modelclasses (rows) and over two datasets (columns).

Right : Spearman correlation between UDR scores anddata efﬁciency for learning a clustering task by the COBRA agent introduced by Watters et al. (2019).Lower step number is better. 24ublished as a conference paper at ICLR 2020

Input Reconstruction Input Reconstruction

Figure 12: Distribution of UDR S scores (P=50) for 300 β -VAE models trained with 6 settings of the β hyperparameter and 50 seeds on CelebA and ImageNet datasets. The reconstructions shown are for thevanilla VAE ( β = 1 ). ImageNet is a complex dataset that VAEs struggle to model well. La t en t Traversal La t en t Traversal

UDR > 0.6UDR < 0.3

Figure 13: Latent traversals for the four most informative latents ordered by their KL from the priorfor three different β -VAE models that ranked high or low according to UDR. Those models that wereranked high have learnt representations that are both interpretable and very similar across models.Those models that were ranked low have learnt representations that are harder to interpret and they donot appear similar to each other across models. 25ublished as a conference paper at ICLR 2020 La t en t Traversal La t en t Traversal

UDR > 0.6UDR < 0.3

Figure 14: Latent traversals for the six most informative latents ordered by their KL from the prior forthree different β -VAE models that ranked high or low according to UDR. Despite the fact that none ofthe β -VAE or VAE models were able to learn to reconstruct this dataset well, those models that wereranked high by the UDR still managed to learn representations that are more interpretable and moresimilar across models. This is unlike the representations of those models that were ranked low by theUDR. 26ublished as a conference paper at ICLR 2020 Latent T r a v e r s a l Latent T r a v e r s a l Latent T r a v e r s a l β-VAETC-VAECCI-VAE Position xPosition ySizeRotationEntangledUninformative

UDR > 0.4

Figure 15: Latent traversals for all ten latent dimensions presented in no particular ordering for threedifferent models per model class. These models were ranked highly by the UDR. It can be seen thatthey learnt interpretable and similar representations up to permutation, sign inverse and subsetting. Weincluded all model classes that achieved UDR scores in the range speciﬁed (UDR > 0.4).27ublished as a conference paper at ICLR 2020

Latent T r a v e r s a l Latent T r a v e r s a l Factor VAEDIP-VAE-II

Position xPosition y

SizeRotationEntangledUninformative

Figure 16: Latent traversals for all ten latent dimensions presented in no particular ordering for threedifferent models per model class. These models received medium UDR scores. It can be seen that theylearnt less interpretable and less similar representations than the models shown in Fig. 15. None of themodels in these model classes scored higher than the range speciﬁed (0.3 < UDR < 0.4).28ublished as a conference paper at ICLR 2020

Latent T r a v e r s a l Latent T r a v e r s a l Latent T r a v e r s a l β-VAETC-VAEFactor VAE Latent T r a v e r s a l DIP-VAE-I

Latent T r a v e r s a l DIP-VAE-II

Position xPosition ySizeRotationEntangledUninformative