[PDF] DEFT: Distilling Entangled Factors by Preventing Information Diffusion

Abstract

Disentanglement is a highly desirable property of representation owing to its similarity to human understanding and reasoning. Many works achieve disentanglement upon information bottlenecks (IB). Despite their elegant mathematical foundations, the IB branch usually exhibits lower performance. In order to provide an insight into the problem, we develop an annealing test to calculate the information freezing point (IFP), which is a transition state to freeze information into the latent variables. We also explore these clues or inductive biases for separating the entangled factors according to the differences in the IFP distributions. We found the existing approaches suffer from the information diffusion problem, according to which the increased information diffuses in all latent variables. Based on this insight, we propose a novel disentanglement framework, termed the distilling entangled factor (DEFT), to address the information diffusion problem by scaling backward information. DEFT applies a multistage training strategy, including multigroup encoders with different learning rates and piecewise disentanglement pressure, to disentangle the factors stage by stage. We evaluate DEFT on three variants of dSprite and SmallNORB, which show low-variance and high-level disentanglement scores. Furthermore, the experiment under the correlative factors shows incapable of TC-based approaches. DEFT also exhibits a competitive performance in the unsupervised setting.

Full PDF

DDEFT: Distilling Entangled Factors

Jiantao Wu Lin Wang Chunxiuzi Liu Fanqi Li Abstract

Disentanglement is a highly desirable property ofrepresentation due to its similarity with humanunderstanding and reasoning. However, the per-formance of current disentanglement approachesis still unreliable and largely depends on the hy-perparameter selection. Inspired by fractional dis-tillation in chemistry, we propose DEFT, a disen-tanglement framework, to raise the lower limit ofdisentanglement approaches based on variationalautoencoder. It applies a multi-stage training strat-egy, including multi-group encoders with differ-ent learning rates and piecewise disentanglementpressure, to stage by stage distill entangled factors.Furthermore, we provide insight into identifyingthe hyperparameters according to the informationthresholds. We evaluate DEFT on three variantsof dSprite and SmallNORB, showing robust andhigh-level disentanglement scores.

1. Introduction

The basis of artiﬁcial intelligence is to understand and rea-son about the world based on a limited set of observations.Unsupervised disentanglement learning is highly desirabledue to its similarity with the way humans think. For in-stance, we can infer the movement of a running ball basedon a single glance. The human brain is capable of disentan-gling the positions from a set of images without supervision.It has been suggested that a disentangled representation ishelpful for a large variety of downstream tasks (Sch¨olkopfet al., 2012; Peters et al., 2017). According to Kim & Mnih(2018), a disentangled representation promotes interpretablesemantic information. That brings substantial advancement,including but not limited to reducing the performance gapbetween humans and AI approaches (Lake et al., 2017; Hig-gins et al., 2018). Other instances of disentangled represen-tation include semantic image understanding and generation * Equal contribution Shandong Provincial Key Laboratoryof Network Based Intelligent Computing, University of Ji-nan, Jinan, China. Correspondence to: Jiantao Wu < [email protected] > , Lin Wang < [email protected] > . MIG ρ Figure 1.

The distributions of MIG scores on dSprites, ColordSprites, and Scream dSprites. Models are abbreviated (0= β -VAE, 1=FactorVAE, 2=DIP-VAE-I, 3=DIP-VAE-II, 4= β -TCVAE,5=AnnealedVAE). The scores are measured from the mean rep-resentation of 50 models with the tuned hyperparameter for bestmean MIG score. See more in Sec. 4. (Lample et al., 2017; Zhu et al., 2018; Elgammal et al.,2017), zero-shot learning (Zhu et al., 2019), and reinforce-ment learning (Higgins et al., 2017b).The core of science is to ﬁnd the invariant law of the world.The consequence predominated by the law should be repro-ducible and low variance, yet, the current disentanglementapproaches showed inconsistency (Locatello et al., 2019).We analyzed 5400 models on three variants of dSprites forsix disentanglement approaches and six hyperparameters .Fig. 1 shows the MIG distributions of six disentanglementapproaches with the best hyperparameter settings, and onecan see a large number of trails have low MIG scores. Inother word, the results depend on luck to a certain degree.Thus, a stable and reliable approach is vital to study furtherand work in practice.The information leakage indicates that one factor’s infor-mation leaks into two or more latent variables; thus, thedisentanglement scores ﬂuctuate in training. We investi-gate this phenomenon on dSprites using β -VAE( β = 6 )(seeSec. 4), and trace the MIG scores while training. We ﬁndthat the information leakage happens and causes the ﬂuctua-tion of MIG scores by tracing the max-2 normalized mutualinformation (NMI) matrix (seeing EQ. 6). We use the pretrained models provided by disentangle-ment lib. a r X i v : . [ c s . L G ] F e b EFT: Distilling Entangled Factors

Inspired by the fractional distillation in chemistry, we can di-vide the training process into several stages, and extract com-ponents as “pure” as possible in each stage. In this paper,we propose a process distilling entangled factors (DEFT)from observations to address the information leakage prob-lem. DEFT applies different disentanglement pressure toallow partial information passing through the informationbottleneck at each stage. Moreover, DEFT consists of mul-tiple encoders with different learning rates to retain thelearned codes, reducing information leakage. We also pro-pose anneal test to determine the piecewise disentanglementpressures by information thresholds relative to the inductivebiases on the data. Our contributions are summarized in thefollowing:• We propose a DEFT method to distill entangled factorsstage-by-stage, in which each encoder is only respon-sible for learning partial information with a narrowinformation bottleneck at each stage.• We propose an anneal test to determine the hyperpa-rameters for manifesting the potential of DEFT.

2. Backgroud

Disentanglement learning is fascinating and challenging be-cause of its intrinsic similarity to human intelligence. Asdepicted in the seminal paper by Bengio et al., humanscan understand and reason from a complex observation tothe explanatory factors. A typical modeling assumption ofdisentanglement learning is that a set of independent fac-tors generates the observed data. The task of disentanglingfactors of variation is to learn a representation separatingthe ground-truth factors into different dimensions on latentspace. Current studies usually evaluate proposed approacheson the artiﬁcial datasets with label information for measur-ing the disentanglement score. The popular applied datasetsinclude dSprites (Matthey et al., 2017), CelebA (Liu et al.,2015), Shapes3D (Burgess & Kim, 2018), Face3D (Hesel-tine et al., 2008), Cars3D (Kim & Mnih, 2018), and Small-NORB (LeCun et al., 2004). The disentanglement metric isalso crutial for this domain, yet there is no widely acceptedmetric. The available metrics include BetaVAE metric (Hig-gins et al., 2017a), FactorVAE metric (Kim & Mnih, 2018),Mutual Information Gap (Chen et al., 2018), Modularity(Ridgeway & Mozer, 2018), DCI (Eastwood & Williams,2018), SAP score (Kumar et al., 2018), UDR (Duan et al.,2020), and IIS (informativeness, interpretability, and separa-bility)(Do & Tran, 2020).Higgins et al. (2017a) ﬁnd the KL term’s pressure in VAEencourages the disentanglement; they propose β -VAE tointroduce an extra hyperparameter for better disentangle-ment. However, the selection of β is a trade-off betweenthe reconstruction error and the disentanglement. Burgess et al. (2018) propose an annealed version, AnnealedVAE,to overcome this issue and achieve low reconstruction errorand high disentanglement score. They suggest that a gradu-ally increasing information bottleneck enforces the modelto utilize those extra bits for the best improvement codes.Kim & Mnih (2018); Chen et al. (2018) argue that a heavierpenalty on the total correlation between the latent variablesencourages the model to learn a more disentangled repre-sentation. β -TCVAE (Chen et al., 2018) decomposes theKL term into the index-coded mutual information (MI), the total correlation (TC), and the dimension-wise KL (DWKL).FactorVAE (Kim & Mnih, 2018) estimates the TC term by amulti-layer perceptron. Other approaches like DIPVAE (Ku-mar et al., 2018) can also enforce the aggregate posterior ofthe latent to be statistically independent. These approachesavoid diminishing the mutual information between the ob-servations and the latent variables, achieving both a highdisentanglement score and a low reconstruction error.Though promoting independence obtains the SOTA per-formance on these artiﬁcial datasets, the fundamentals aredoubted by Locatello et al. (2019). They claimed, “Wedo not ﬁnd any evidence that they can be used to reliablylearn disentangled representations in an unsupervised man-ner as hyper parameters seem to matter more than the modeland good hyperparameters seemingly cannot be identiﬁedwithout access to ground-truth labels. ”

3. Preliminary

The variational autoencoder (Kingma & Welling,2013) is a popular generative model, assuming that thelatent variables obey a speciﬁc prior (normal distribution inpractice). The key idea of VAE is maximizing the likelihoodobjective by the following approximation: L ( θ, φ ; x, z ) = E q φ ( z | x ) [log p θ ( x | z )] − D KL ( q φ ( z | x ) || p ( z )) . (1) β -VAE Higgins et al. (2017a) discovered the relationshipbetween the disentanglement and the KL divergence penaltystrength. They propose the β -VAE maximizing the follow-ing expression: L ( θ, φ ; x, z ; β ) = E q φ ( z | x ) [log p θ ( x | z )] − βD KL ( q φ ( z | x ) || p ( z )) . (2) β controls the pressure for the posterior Q φ ( z | x ) to matchthe factorized unit Gaussian prior p ( z ) . However, there is atrade-off between the quality of reconstructed images andthe disentanglement. A high value of β leads to a lowerimplicit capacity of the latent information and ambiguousreconstructions but a high disentanglement score. EFT: Distilling Entangled Factors

AnnealedVAE

Burgess et al. (2018) proposed the An-nealedVAE that progressively increases the information ca-pacity of the latent variables while training: L ( θ, φ ; x, z, C ) = E q φ ( z | x ) [log p θ ( x | z )] − γ | D KL ( q φ ( z | x ) || p ( z )) − C | , (3)where γ is a large enough constant to constrain the latentinformation, C is a value gradually increased from zero to alarge number to produce high reconstruction quality. β -TCVAE One popular disentangled prior assumes thatthe factors on the observations are independent. A seriesof methods enforce the independence between the latentvariables, avoiding diminishing the reconstruction error.FactorVAE (Kim & Mnih, 2018) applies a discriminatorto calculate the total correlation (TC, (Watanabe, 1960))approximately; β -TCVAE (Chen et al., 2018) promotes theTC penalty by decomposing the KL term; DIP-VAE (Ku-mar et al., 2018) identiﬁes the covariance matrix of q ( z ) .Here, we chose β -TCVAE as the representatives of theseapproaches. The objective of β -TCVAE is L ( θ, φ ; x, z ; β ) = E q φ ( z | x ) [log p θ ( x | z )] − E q ( z,n ) (cid:20) log q φ ( z | n ) p ( n ) q φ ( z ) p ( n ) (cid:21) − β E q φ ( z ) (cid:34) log q φ ( z ) (cid:81) j q φ ( z j ) (cid:35) − (cid:88) j E q φ ( z j ) (cid:20) log q φ ( z j ) p ( z j ) (cid:21) . (4) The notion of disentanglement is still an open topic, andthere does not exist a widely accepted deﬁnition. Therefore,one metric is hard to assess disentanglement, and there arefour metrics applied in this paper: DCI, MIG, Modularity,DCI. Among them, MIG detects axis-alignment, is unbiasedand suitable to any latent distributions.The mutual information measures an information-theoreticquantity between a latent z j and a factor v k . we deﬁne the m th largest normalized mutual information (NMI) between z j and v k as max- m NMI ( v k ) :max- m NMI ( v k ) = m max j H ( v k ) I n ( z j ; v k ) , (5)where max mj returns the m th largest element among index j . The calculation of MIG can be written as: K K (cid:88) k =1 max- NMI ( v k ) − max- NMI ( v k ) . (6) (a) disentanglemt score ﬂunctuation(b) max-1 NMI (c) max-2 NMI Figure 2.

Failure pattern on dSprites. (a) The curves of the ob-jective and MIG score. (b) max- NMI ( v k ) over training. (c)max- NMI ( v k ) over training. Each row in (b) and (c) corre-sponds to the NMI vector of factors at an epoch. Please notice that,despite the smooth trend shown in disentangling primary mutualinformation (b), the process for secondary mutual information isnot that stable, especially for the scale factor (c).

4. Unstability of disentanglement approaches

We observe a high variance of disentanglement from thecomprehensive study of disentanglement approaches (Lo-catello et al., 2019). It implies these models are unreliable,and there exists a failure pattern that violates disentangle-ment. A perfect disentangled representation should projectthe factors into one latent variable. Therefore, it has highvalues of the max-1 NMI vector max- NMI ( v k ) and lowvalues of the max-2 NMI vector max- NMI ( v k ) . In otherwords, the learned representation re-entangles when themax-1 NMI decreases or the max-2 NMI increases.We conducted experiments on dSprites with the standard β -VAE ( β = 6 ). We traced the normalized mutual informationbetween the observations and the latent variables at eachepoch’s end. The MIG scores ﬂuctuated while training, asdepicted in Fig. 2a. Fig. 2bc shown the changes in NMI overtraining. One can see that the ﬂuctuating values of max-2NMI vector were the main obstacle to disentangle factors.The increase of max-2 NMI indicated two or more latentvariables learned information from one factor (scale). Wealso noticed this phenomenon happen on the other datasetsor by the other disentanglement approaches. Thus, pre-venting information leakage could improve disentanglementsigniﬁcantly. EFT: Distilling Entangled Factors

5. DEFT β =32 Figure 3.

Information threshold. The model starts to learn infor-mation at iteration 7500 ( β = 32 ), where KL raising and recon-struction error reducing. Burgess et al. (2018) suggests the close relationship be-tween the information bottleneck principle and β in β -VAE.Though high disentanglement pressure encourages disentan-glement, an over-high value of β leads to the latent variablesshare no information between the observations. We can in-fer there is a critical point of β so that the model starts tolearn information from the observations, and we call thispoint information threshold . Therefore, we introduce the an-neal test to determine the threshold for a given dataset. Theobjective is the same as β -VAE, but uses an annealed β froma high value to one. (which starts with a very high valueand ends with value 1) As the KL term’s pressure decaying,there exists a critical point that the model learns informationfrom the dataset when decaying the reconstruction error andraising the KL loss. For example, we train the model withan annealed β from 200 to 0 in 100000 iterations. As shownin Fig. 3, the threshold is approximately 52 at iteration 7400.Roughly, we regard the model learns information when theKL loss raises over 0.1. To overcome information leakage, the main difﬁculty isretaining disentangling in previously learned factors. Wepropose a framework to distill entangled factors (DEFT)from the observations like the process of fractional distil-lation in chemistry.

This framework provides deft settingsto distill factors of variation and guarantees the improve-ment of disentanglement in each stage. It has multipleencoders with different learning rates and uses piecewisedisentanglement pressures. The decoder of DEFT takes theconcatenated latent from the outputs of these encoders. Ineach training stage, only one active encoder has a primarylearning rate lr , and the other inactive encoders have asecondary learning rate γ × lr (0 ≤ γ ≤ . DEFT pushesa disentanglement pressure following a piecewise constant meanvariance z z meanvariance z • High pressure • Low pressure

Dataset Th r es ho l d Figure 4.

Architecture of DEFT. The information thresholds onthe toy dataset vary from blue to red. In the ﬁrst stage, the ﬁrstencoder extracts translation with high pressure by blocking infor-mation of rotation. In the second stage, the second encoder extractsthe residual component by extending the information bottleneck. decay schedule to impose the different factors emerging ineach stage. Furthermore, DEFT is capable of the disentan-glement approaches based on variational autoencoder, andwe chose β -VAEas the base method in the main part of thispaper.The architecture of DEFT is shown in Fig. 4. There are twoindependent groups of encoders receiving the input images.Then, the concatenated latent variables are fed the outputs ofthese encoders into the decoder. DEFT learns disentangledfactors in the corresponding stages. Algorithm 1

The algorithm of DEFT.

Input: pressure β j , stages m Initialize θ, { φ i } for p θ ( x | z ) , { q iφ i ( z | x ) } . for j = 0 to m − do x = sample() z = concat( { reparameterize( q iφ i ( x )) } ) g θ , { g φ i } = ∇ θ, { φ i } L ( θ, { φ i } ; x, z ; β j ) for i = 0 to m − doif i = j then φ i = φ i − g φ i × lr else φ i = φ i − g φ i × lr × γ end ifend for θ = θ − g θ × lr end for

6. Experiment

Settings

We refer to the experiment settings in (Higginset al., 2017a). There are two types of encoder architectureand one decoder architecture in this paper. The encoderconsists of 4 convolutional layers for the standard settings,each with 32 channels, 4x4 kernels, and a stride of 2. Then,the outputs of the encoder are fed into one fully connectedlayer of 256 units. After that, one fully connected layer of20 units parameterized the mean and log standard deviation

EFT: Distilling Entangled Factors of the latent distribution with 10 Gaussian random variables.For the DEFT setting, the number of channels and latentvariable size is divided by the total number of groups. Thedecoder is almost symmetric with the encoder but appliesa parameterized Bernoulli distributions over the outputs’pixels. All layers are activated by ReLU. The optimizer isAdam, with a learning rate of 5e-4. (a) max-1 NMI (b) max-2 NMI

Figure 5.

The max-1 and max-2 NMI vectors for γ . Columns showthe mutual information of ﬁve factors. Rows show the results ofindependent trails with different γ . Coefﬁcient of secondary lr

We investigated the effectsof γ . The settings of DEFT were two encoders, β -VAE forthe base model, and piecewise pressure with 70,30. We ﬁrsttrained the model on the ﬁrst stage normally, then comparedthe normalized mutual information for different γ setting.One can see that the max-1 normalized mutual informationchanges slightly for those γ > . , and the max-2 normal-ized mutual information increases as γ getting bigger except γ = 0 . The experimental results indicate a small value of γ can prevent one factor’s information from leaking intoanother dimension. However, it also violates the model tomodify the learned dimension causing lower max-1 mutualinformation when γ is too small. In practice, γ = 0 . keepsa good balance between increasing max-1 mutual informa-tion and decreasing max-2 mutual information. Piecewise pressures

The ideal situation is to ﬁnd a setof β s to isolate information into several areas , and eacharea contains only “pure” information for one factor. In thispaper, we do not intend to uncover the mechanism of theinformation threshold. We only explore it as a tool for se-lecting suitable hyperparameters of DEFT and prove the ef-ﬁciency of DEFT. To get the distribution of the informationthresholds with respect to factor c i , we randomly sampleone observation and get all samples with the same factorsexcept factor c i ; then, calculate the information threshold ofthis sub dataset by the algorithm introduced in Sec. 5.1; last,repeat the above procedures 50 times to generate enoughdata points.We measured the information thresholds of factors on four

303 5 (a) SmallNORB (b) dSprites

30 105 160 (c) Color dSprites (d) Scream dSprites

Figure 6.

Information thresholds of factors on four datasets. Thered number denotes the pressure to separate thses factors withdifferent information thresholds.

Table 1.

Experimental settings for DEFT. γ is always . .G ROUP E POCHS P RESSURESD S PRITES

OLOR

CREAM

MALL

NORB 4 20,40,40,400 30,5,3,1 datasets, as shown in Fig. 6. One can see that the nor-mal dSprites and Color dSprites have more separable infor-mation thresholds than Scream dSprites and SmallNORB.Though the three variants of dSprites have the same factors,their information thresholds are different. The difference ofinformation thresholds answers why the current disentan-glement approaches fail to transfer hyperparameters acrossdifferent problems. We summarize the piecewise pressureand the other training settings in Tab. 1.

We trained each model for 10 times and compared our modeland the other six disentanglement approaches on dSprites,Color dSprites, Scream dSprites, and SmallNORB. Asshown in Fig. 7, all the four metrics show the learned repre-sentations by DEFT have a lower variance than the othersand have high-level scores. DEFT effectively raises thelower limit of β -VAE; however, DEFT can not boost theupper limit, which may be decided by the base approach.Interestingly, the Scream dSprites is the most challengingproblem among the four, and the result in Fig. 6 (d ) alsoindicates the factors are hard to separate. That implies an EFT: Distilling Entangled Factors (a) dSprites (b) color(c) scream (d) SmallNORB

Figure 7.

Disentanglement scores distribution for different approaches and datasets. Seven approaches respectively denote 1=DEFT,2= β -VAE, 3=FactorVAE, 4=DIP-VAE-I, 5=DIP-VAE-II, 6= β -TCVAE, and 7=AnnealedVAE. internship between disentanglement and information thresh-olds.In this part, we show the changes of MIG scores in differentstages in Fig. 8. All experimental results on four datasetsreveal that DEFT got low scores in the ﬁrst stage and grad-ually improved in the following stages. However, DEFTfailed to raise the lower limit of disentanglement on ScreamdSprites. DEFT is unable to distill entangled factors whentheir information thresholds are difﬁcult to distinguish. Ex-cept that, DEFT signiﬁcantly improves the lower limit ofdisentanglement by a multi-stage strategy. MIG s t age : s t age : s t age : s t age : (a) dSprites MIG s t age : s t age : s t age : s t age : (b) Color MIG s t age : s t age : (c) Scream MIG s t age : s t age : s t age : s t age : (d) SmallNORB Figure 8.

MIG distribution of DEFT on four datasets for differentstages.

The reconstruction quality is also a desirable property. Wealso show the distribution of reconstruction error in Fig. 9.In general, DEFT achieves both high image quality anddisentanglement.

Failure rate

We regard the model fails to learn a disentan-gled representation if the MIG score is lower than 0.1. Tab. 2shows the failure rates. One can see that FactorVAE andbtcvae have the lowest failure rate. Though DIP-VAE en-courages independence, it seems penalizing the TC term is amuch more reliable method, and FactorVAE and β -TCVAEperform better than DIP-VAE. Besides, DEFT signiﬁcantlydecreased the failure rate comparing the other approaches. Table 2.

Failure rate (%) for each approach (column) and dataset(row).1 2 3 4 5 6 78.0 47.7 11.0 76.0 87.0 18.7 61.70.0 36.0 16.7 78.0 89.3 23.7 47.712.0 34.3 66.7 80.3 89.0 43.7 100.00.0 0.0 0.0 28.7 0.0 0.0 88.0

We applied ex-perimental settings in Tab. 1 except the encoder part. Wetrained DEFT with a normal encoder on dSprites. Each trialrepeated 10 times. From Fig. 10a, the MIG scores got theaverage in the ﬁrst stage and diffused in the following stages. dSprites

Color

Scream

SmallNORB

Figure 9.

Distribution of reconstruction error.

EFT: Distilling Entangled Factors

We further investigated the reason and traced the normalizedmax-1 mutual information between the latent z and the ob-servations in Fig. 10b. Though this latent variable learnedthe disentangled factor (posX) in the ﬁrst stage, it couldnot retain information when the information bottleneck wasextended. MIG s t age : s t age : s t age : s t age : (a) MIG (b) max-1 NMI Figure 10.

DEFT with a normal encoder. The distribution of MIGscore for four stages (left). The max-1 NMI, I n ( z ; v k ) (right). Multiple encoders without piecewise pressures

We re-moved the piecewise pressures and applied a constant pres-sure while training. The encoder part was consist of twoencoders with different learning rates ( lr , lr ), and the otherparts were the same as β -VAE ( β = 6 ). We repeated thismodel 40 times in total. Without surprise, the differenti-ated encoders could not automatically separate factors intodifferent dimensions in Fig. 11. In contrast, it had lowerMIG scores and . That indicates the differentiated encodersviolate the model ﬁnding the optimal. These results matchthe results in Fig. 7. DEFT has a signiﬁcant improvement onthe lower limit yet insigniﬁcant improvement on the upperlimit. D e n s i t y MIG Distribution (a) normal D e n s i t y MIG Distribution (b) multiple

Figure 11.

MIG distributions of β -VAE with a normal (a) andmultiple (b) encoder(s). Although AnnealedVAE follows the same principles asDEFT—both use the extending information bottleneck, theyfocus on different properties. AnnealedVAE focuses onincreasing the bottleneck; instead, DEFT focuses on re-taining the learned representation. Besides, AnnealedVAEuses an increasing capacity of latent information; instead, DEFT uses decaying pressure. The performance of DEFTseems more like β -VAE discarding the lower part see inFig. 7. Furthermore, AnnealedVAE does not behave as theyclaimed. The KL term can be deposited into the MI, TC,DWKL. Our conducted experiment on dSprites showed thatAnnealedVAE increased the MI term at the very beginning,and the model learned no more information after 15K it-erations in Fig. 12. The increased DWKL had no help indisentanglement. M I KL Decomposition D i m e n s i o n - w i s e K L MIDW-KL (a) curves(b) max-1 NMI (c) max-2 NMI

Figure 12.

AnnealedVAE. β -TCVAE The intuitive picture we have developed of blocking partialinformation, enabling less factors of variation to be repre-sented in each stage, motivated us to directly penalize theMI term in EQ. 4. The modiﬁed objective for penalizing TCis L ( θ, φ ; x, z ; α ) = E q φ ( z | x ) [log p θ ( x | z )] − α E q ( z,n ) (cid:20) log q φ ( z | n ) p ( n ) q φ ( z ) p ( n ) (cid:21) − β E q φ ( z ) (cid:34) log q φ ( z ) (cid:81) j q φ ( z j ) (cid:35) − (cid:88) j E q φ ( z j ) (cid:20) log q φ ( z j ) p ( z j ) (cid:21) . (7)We compared plain β -VAE, DEFT, and DEFT (TC) ondSprites. To be smple, we setted β = 8 for DEFT (TC).Experimental results in Fig. 13 demonstrate that both DEFTvariants have rubost and high MIG scores. We had not a EFT: Distilling Entangled Factors comprehensive study on DEFT (TC), and DEFT is capableto other approaches. D e n s i t y -VAEDEFTDEFT(TC) Figure 13.

Distribution of MIG scores.

Higgins et al. (2017a) introduced the latent traversal to visu-alize the generated images through the traversal of a singlelatent z i . Fig. 14 shows the latent traversal of the best modelwith the highest MIG score. We discover some intrinsicrelationship between information thresholds and disentan-glement. Shape (1) and orientation (3) share the most consid-erable overlap of information thresholds; meanwhile, shapeand orientation entangle on the same latent on dSprites. ForSmallNORB, elevation and azimuth can hardly be disentan-gled, which have lower information thresholds. For ScreamdSprites, all factors have close information thresholds, andthey are also entangled in our model. For Color dSprites,posY and orientation have not been entangled.

7. Conclusion

We demonstrate the information leakage problem whichcausing the ﬂuctuation of disentanglement scores. We pro-pose a deft method, DEFT, to distill different factors in eachstage by changing the capacity of the information bottleneck.Our approach signiﬁcantly improves the lower limit of disen-tanglement and decreases the failure rate for disentangling.We have demonstrated the insights into the explanation forthe difﬁculty of transferring hyperparameters by informationthresholds. Such insights help us to select hyperparameterswhen the label information is available.Although DEFT improves robustness, the upper limit is con-strained by the base method. Thus, we may improve DEFTby combining other approaches. Exploring an unsupervisedalgorithm for determining the information thresholds is alsovital for DEFT.

References

Bengio, Y., Courville, A., and Vincent, P. Representationlearning: A review and new perspectives.

IEEE Transac- (a) dSprites (0.41)

Pos X Pos Y Scale Shape Orientation-303 (b) Color (0.37)

Pos X Pos Y Scale Shape Orientation-303 (c) Scream (0.26) category azimuth lighting elevation-303 (d) SmallNORB (0.27)

Figure 14.

Latent traversal of DEFT on four datasets (MIG score).Each column shows a latent z i and its corresponding factor (lastrow). tions on Pattern Analysis and Machine Intelligence , 35(8):1798–1828, 2013.Burgess, C. and Kim, H. 3d shapes dataset.https://github.com/deepmind/3dshapes-dataset/, 2018.Burgess, C. P., Higgins, I., Pal, A., Matthey, L., Watters, N.,Desjardins, G., and Lerchner, A. Understanding disentan-gling in β -vae. In Proceedings of the 35th InternationalConference on Machine Learning (ICML 2018) , 2018.Chen, T. Q., Li, X., Grosse, R. B., and Duvenaud, D. Iso-lating sources of disentanglement in variational autoen-coders. arXiv preprint arXiv:1802.04942 , 2018.Do, K. and Tran, T. Theory and evaluation metrics for learn-ing disentangled representations. In ,2020.Duan, S., Matthey, L., Saraiva, A., Watters, N., Burgess, C.,Lerchner, A., and Higgins, I. Unsupervised model selec-tion for variational disentangled representation learning.In , 2020.Eastwood, C. and Williams, C. K. I. A framework for thequantitative evaluation of disentangled representations.In , 2018.

EFT: Distilling Entangled Factors

Elgammal, A., Liu, B., Elhoseiny, M., and Mazzone, M.Can: Creative adversarial networks, generating art bylearning about styles and deviating from style norms. arXiv preprint arXiv:1706.07068 , 2017.Heseltine, T., Pears, N. E., and Austin, J. Three-dimensionalface recognition using combinations of surface featuremap subspace components.

Image and Vision Computing ,26(3):382–396, 2008. doi: 10.1016/j.imavis.2006.12.008.Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X.,Botvinick, M., Mohamed, S., and Lerchner, A. beta-vae:Learning basic visual concepts with a constrained vari-ational framework. In , pp. 1–13, 2017a.Higgins, I., Pal, A., Rusu, A. A., Matthey, L., Burgess, C.,Pritzel, A., Botvinick, M., Blundell, C., and Lerchner, A.DARLA: improving zero-shot transfer in reinforcementlearning. In Precup, D. and Teh, Y. W. (eds.),

Proceedingsof the 34th International Conference on Machine Learn-ing (ICML 2017) , volume 70, pp. 1480–1490, 2017b.Higgins, I., Sonnerat, N., Matthey, L., Pal, A., Burgess, C. P.,Bosnjak, M., Shanahan, M., Botvinick, M., Hassabis, D.,and Lerchner, A. SCAN: learning hierarchical composi-tional visual concepts. In . OpenReview.net,2018.Kim, H. and Mnih, A. Disentangling by factorising.

Pro-ceedings of the 35th International Conference on Ma-chine Learning (ICML 2018) , 6:4153–4171, 2018.Kingma, D. P. and Welling, M. Auto-encoding variationalbayes. , pp. 1–14, 2013.Kumar, A., Sattigeri, P., and Balakrishnan, A. Variationalinference of disentangled latent concepts from unlabeledobservations. arXiv preprint arXiv:1711.00848 , 2018.Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gersh-man, S. J. Building machines that learn and think likepeople.

Behavioral and brain sciences , 40, 2017.Lample, G., Zeghidour, N., Usunier, N., Bordes, A.,DENOYER, L., and Ranzato, M. A. Fader net-works:manipulating images by sliding attributes. In I.Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fer-gus, S. Vishwanathan, and R. Garnett (eds.),

Advances inNeural Information Processing Systems (NIPS 2017) , pp.5967–5976. Curran Associates, Inc, 2017.LeCun, Y., Huang, F. J., and Bottou, L. Learning methodsfor generic object recognition with invariance to pose andlighting. In [Proceedings of the IEEE Computer SocietyConference on Computer Vision and Pattern Recognition (CVPR 2004) , pp. 97–104. IEEE Computer Society, 2004.doi: 10.1109/CVPR.2004.144.Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning faceattributes in the wild. In

Proceedings of InternationalConference on Computer Vision (ICCV 2015) , December2015.Locatello, F., Bauer, S., Lucie, M., R¨atsch, G., Gelly, S.,Sch¨olkopf, B., and Bachem, O. Challenging commonassumptions in the unsupervised learning of disentangledrepresentations. In

Proceedings of the 36th InternationalConference on Machine Learning (ICML 2019) , volume2019-June, pp. 7247–7283, 2019.Matthey, L., Higgins, I., Hassabis, D., and Lerchner,A. dsprites: Disentanglement testing sprites dataset.https://github.com/deepmind/dsprites-dataset/, 2017.Peters, J., Janzing, D., and Sch¨olkopf, B. (eds.).

Elements ofCausal Inference: Foundations and Learning Algorithms .The MIT Press, 2017.Ridgeway, K. and Mozer, M. C. Learning deep disentan-gled embeddings with the f-statistic loss. In

Advancesin Neural Information Processing Systems (NIPS 2018) ,volume 2018-Decem, pp. 185–194, 2018.Sch¨olkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang,K., and Mooij, J. On causal and anticausal learning.In

Proceedings of the 29th International Conference onMachine Learning (ICML 2012) , 2012.Watanabe, S. Information theoretical analysis of multivari-ate correlation.

IBM Journal of research and development ,4:66–82, 1960.Zhu, Y., Elhoseiny, M., Liu, B., Peng, X., and Elgammal, A.A generative adversarial approach for zero-shot learningfrom noisy texts. In

IEEE Conference on Computer Visionand Pattern Recognition (CVPR 2018) , pp. 1004–1013,2018.Zhu, Y., Xie, J., Liu, B., and Elgammal, A. Learning feature-to-feature translator by alternating back-propagation forzero-shot learning. arXiv preprint arXiv:1904.10056arXiv preprint arXiv:1904.10056