Disentangling Action Sequences: Discovering Correlated Samples
PPublished as a conference paper at ICLR 2021 D ISENTANGLING A CTION S EQUENCES : F
INDING C OR - RELATED IMAGES
Jiantao Wu, Lin Wang * Shandong Provincial Key Laboratory of Network Based Intelligent ComputingUniversity of JinanJinan, 250022, China {clouderow,wangplanet}@gmail.com A BSTRACT
Disentanglement is a highly desirable property of representation due to its similaritywith human’s understanding and reasoning. This improves interpretability, enablesthe performance of down-stream tasks, and enables controllable generative models.However, this domain is challenged by the abstract notion and incomplete theories tosupport unsupervised disentanglement learning. We demonstrate the data itself, suchas the orientation of images, plays a crucial role in disentanglement and instead of thefactors, and the disentangled representations align the latent variables with the actionsequences. We further introduce the concept of disentangling action sequenceswhich facilitates the description of the behaviours of the existing disentanglingapproaches. An analogy for this process is to discover the commonality betweenthe things and categorizing them.Furthermore, we analyze the inductive biases on the data and find that the latentinformation thresholds are correlated with the significance of the actions. Forthe supervised and unsupervised settings, we respectively introduce two methodsto measure the thresholds. We further propose a novel framework, fractionalvariational autoencoder (FVAE), to disentangle the action sequences with differentsignificance step-by-step. Experimental results on dSprites and 3D Chairs showthat FVAE improves the stability of disentanglement.
NTRODUCTION
The basis of artificial intelligence is to understand and reason about the world based on a limited setof observations. Unsupervised disentanglement learning is highly desirable due to its similarity withthe way we as human thinking. For instance, we can infer the movement of a running ball based ona single glance this is because the human brain is capable of disentangling the positions from a setof images. It has been suggested that a disentangled representation is helpful for a large variety ofdownstream tasks (Schölkopf et al., 2012; Peters et al., 2017). According to Kim & Mnih (2018), adisentangled representation promotes interpretable semantic information. That brings substantialadvancement, including but not limited to reducing the performance gap between humans and AIapproaches (Lake et al., 2017; Higgins et al., 2018). Other instances of disentangled representationinclude semantic image understanding and generation (Lample et al., 2017; Zhu et al., 2018; Elgammalet al., 2017), zero-shot learning (Zhu et al., 2019), and reinforcement learning (Higgins et al., 2017b).Despite the advantageous of the disentangling representation approaches, there are still two issuesto be addressed including the abstract notion and the weak explanations.
Notion the conception of disentangling factors of variation is first proposed in 2013. It is claimed inBengio et al. (2013) that for observations the considered factors should be explanatory and independentof each other. The explanatory factors are however hard to formalize and measure. An alternative way isto disentangle the ground-truth factors (Ridgeway, 2016; Do & Tran, 2020). However, if we consider theuniqueness of the ground-truth factors, a question arise here is how to discover it from multiple equiva-lent representations? As a proverb “one cannot make bricks without straw”, Locatello et al. (2019) provethe impossibility of disentangling factors without the help of inductive biases in the unsupervised setting.1 a r X i v : . [ c s . L G ] O c t ublished as a conference paper at ICLR 2021 Explanation
There are mainly two types of explanations for unsupervised disentanglement:information bottleneck, and independence assumption. The ground-truth factors affect the dataindependently, therefore, the disentangled representations must follow the same structure. Theapproaches, holding the independence assumption, encourage independence between the latentvariables (Schmidhuber, 1992; Chen et al., 2018; Kim & Mnih, 2018; Kumar et al., 2018; Lopez et al.,2018). However, the real-world problems have no strict constraint on the independence assumption,and the factors may be correlative. The other explanation incorporates information theory intodisentanglement. Burgess et al.; Higgins et al.; Insu Jeon et al.; Saxe et al. suggest that a limit on thecapacity of the latent information channel promotes disentanglement by enforcing the model to acquirethe most significant latent representation. They further hypothesize that the information bottleneckenforces the model to find the significant improvement.In this paper, we first demonstrate that instead of the ground-truth factors the disentangling approacheslearn actions of translating based on the orientation of the images. We then propose the concept ofdisentangling actions which discover the commonalities between the images and categorizes theminto sequences. We treat disentangling action sequences as a necessary step toward disentanglingfactors, which can capture the internal relationships between the data, and make it possible to analyzethe inductive biases from the data perspective. Furthermore, the results on a toy example show thatthe significance of actions is positively correlated with the threshold of latent information. Then, wepromote that conclusion to complex problems. Our contributions are summarized in the following:• We show that the significance of action is related to the capacity of learned latent information,resulting in the different thresholds of factors.• We propose a novel framework, fractional variational autoencoder (FVAE) to extractsexplanatory action sequences step-by-step, and at each step, it learns specific actions byblocking others’ information.We organize the rest of this paper as follows. Sec.2 describes the development of unsuperviseddisentanglement learning and the proposed methods based on VAEs. In Sec.3, through an example, weshow that the disentangled representations are relative to the data itself and further introduce a novelconcept, i.e., disentangling action sequences. Then, we investigate the inductive biases on the data andfind that the significant action has a high threshold of latent information. In Sec.4, we propose a step-by-step disentangling framework, namely fractional VAE (FVAE), to disentangle action sequences. Forthe labelled and unlabelled tasks, we respectively introduce two methods to measure their thresholds.We then evaluate FVAE on a labelled dataset (dSprites, Matthey et al. (2017)) and an unlabelled dataset(3D Chairs, Aubry et al. (2014)). Finally, we conclude the paper and discuss the future work in Sec.5
NSUPERVISED DISENTANGLEMENT LEARNING
We first introduce the abstract concepts and the basic definitions, followed by the explanations basedon information theory and other related works. This article focuses on the explanation of informationtheory and the proposed models based on VAEs.2.1 T
HE CONCEPT
Disentanglement learning is fascinating and challenging because of its intrinsic similarity to humanintelligence. As depicted in the seminal paper by Bengio et al., humans can understand and reason froma complex observation to the explanatory factors. A common modelling assumption of disentanglementlearning is that the observed data is generated by a set of ground-truth factors. Usually, the data hasa high number of dimensions, hence it is hard to understand, whereas the factors have a low number ofdimensions thus simpler and easier to be understood. The task of disentanglement learning is to uncoverthe ground-truth factors, such factors are invisible to the training process in an unsupervised setting.The invisibility of factors makes it hard to define and measure disentanglement (Do & Tran, 2020).Furthermore, it is shown in Locatello et al. (2019) that it is impossible to unsupervisedly disentanglethe underlying factors for the arbitrary generative models without inductive biases. In particular, theysuggest that the inductive biases on the models and the data should be exploited. However, they donot provide a formal definition of the inductive bias and such a definition is still unavailable.2ublished as a conference paper at ICLR 20212.2 I
NFORMATION BOTTLENECK
Most of the dominant disentangling approaches are the variants of variational autoencoder (VAE). Thevariational autoencoder (VAE) is a popular generative model, assuming that the latent variables obeya specific prior (normal distribution in practice). The key idea of VAE is maximizing the likelihoodobjective by the following approximation: L ( θ,φ ; x,z ) = E q φ ( z | x ) [log p θ ( x | z )] − D KL ( q φ ( z | x ) || p ( z )) , (1)which is know as the evidence lower bound (ELBO); where the conditional probability P ( x | z ) ,Q ( z | x ) are parameterized with deep neural networks.Higgins et al. find that the KL term of VAEs encourages disentanglement and introduce a hyperparame-ter β in front of the KL term. They propose the β -VAE maximizing the following expression: L ( θ,φ ; x,z ) = E q φ ( z | x ) [log p θ ( x | z )] − βD KL ( q φ ( z | x ) || p ( z )) . (2) β controls the pressure for the posterior Q φ ( z | x ) to match the factorized unit Gaussian prior p ( z ) .Higher values of β lead to lower implicit capacity of the latent information and ambiguous reconstruc-tions. Burgess et al. propose the Annealed-VAE that progressively increases the information capacityof the latent code while training: L ( θ,φ ; x,z,C ) = E q φ ( z | x ) [log p θ ( x | z )] − γ | D KL ( q φ ( z | x ) || p ( z )) − C | (3)where γ is a large enough constant to constrain the latent information, C is a value gradually increasedfrom zero to a large number to produce high reconstruction quality. As the total information bottleneckgradually increasing, they hypothesize that the model will allocate the capacity of the most improvementof the reconstruction log-likelihood to the encoding axes of corresponding factor. However, they didnot exploit why each factor makes the different contributions to the reconstruction log-likelihood.2.3 O THER RELATED WORK
The other dominant direction initiates from the prior of factors. They assume that the ground-truthfactors are independent of each other, and a series of methods enforce the latent variables have thesame structure as the factors. FactorVAE (Kim & Mnih, 2018) applies a discriminator to approximatelycalculate the total correlation (TC, Watanabe (1960)); β -TCVAE (Chen et al., 2018) promotes theTC penelty by decomposing the KL term; DIP-VAE (Kumar et al., 2018) identifies the covariancematrix of q ( z ) . However, the real-world problems have no strict constrain in the prior of factors. Forinstance, the length of hair and the gender are two independent factors to describe a person, but thereal situation is that women are more likely to have long hair. ISENTANGLING ACTION SEQUENCES
For machines and humans, disentangling the underlying factors is a challenging task. For instance, thereare more than 1,400 breeds of dogs in the world, and it seems impossible for an ordinary person to distin-guish all of them just by looking at their pictures. The challenge in disentangling the underlying factorsis mainly due to the complexity of establishing relationships without supervision, where the correspond-ing models should contain some level of prior knowledge or inductive biases. However, it is possibleto determine the differences without having extensive knowledge. For example, one may mistakenlyidentify the breed of a dog by looking at a picture but is almost impossible to misrecognize a dog as a cat.Therefore, in practice, discovering the differences or similarities is often a much easier task than that ofuncovering the underlying factors, and it does not also need much prior knowledge. One may concludethat discovering the commonalities between the things is an important step toward disentanglement.3.1 D
ISENTANGLED REPRESENTATIONS
Although the disentangling approaches can learn informative, independent, and meaningfulrepresentations on dSprites (Matthey et al., 2017), these models disentangle the ground-truth factorsby accident. In other words, there is no reason for the latent variables to have any structure sincethere always exists a decoder that can map the latent variable into the desired output. To determine3ublished as a conference paper at ICLR 2021Figure 1: Leaned presentations on A1, A2, and A3. (a) The factors are projected into the latent space.(b) Each row denotes the latent traversal on the specific dataset.the representations that the current approaches learn, here we design a dataset family to reveal thecommon aspects of these representations.We create a toy dataset family— each dataset contains 40x40 images which are generated from anoriginal image of a 11x5 rectangle, by translating on a 64x64 canvas. Each image on the datasethas a unique label (position of X, position of Y) to describe the ground-truth factors. In this datasetfamily, there are two variables: the orientation of the rectangle and the way to determine the twofactors. There are infinite solutions to determine these two factors; the polar coordinate system andthe Cartesian coordinate system are the most common solution. We then create a baseline dataset,A1, with a horizontal rectangle in the Cartesian coordinate system and obtain its variants. A2 differs inthe positions determined by the polar coordinate system, and A3 differs in the orientation (45 degrees)of the images. For the experiment settings, we choose the well-examined baseline model, β -VAE( β = 50 ), and the backbone network follows the settings in Locatello et al. (2019).As it is shown in Fig. 1(a), we visualize the learned representation on the latent space. In the left column,each point represents an image with the annotation of the corresponding factors. We also link the pointswith the same controlling factor with coloured lines to show the ground-truth transformation. Weargue that the factors are not the key to disentanglement since the learned representations are changedwhile the factors are unchanged (A1, A3), and the learned representations do not change while thefactors are changed (A1, A2). From Fig. 1, one can see that the invariant in these three cases is that theylearn a representation moving along the direction of the long side of the rectangle and the orthogonalitydirection. The disentangled representations consider the rectangle as the reference system.3.2 A CTION SEQUENCES
We assume that images of the observed data consist of a set of actions such as rotation, translation,and scaling. Given a subset of images from original data set, we define the meaningful action as asequence, i.e. an ordered permutation of elements from the subset. To obtain such sequences, onecan get an action by traversing the ground-truth factors: S = { x i } = T ( { c i } ) , (4)where T is a transformation, c i is the set of factors, x i is an image from the dataset. Therefore, thedisentanglement learning can be interpreted as a process of learning meaningful action sequences.Furthermore, these actions have associated parameters to illustrate the changing processes in detail,i.e., a parameterized action sequence represents an action. We define an action sequence as a subset4ublished as a conference paper at ICLR 2021 S={z1, z2, z3} • Ordered • Unordered • Ordered • Unordered
Binding Possible Sequences
Figure 2: An example of action sequences. There are three images with differing sizes on the dataset(left box), and the AE learns a representation with one dimension of latent space (the double-arrowlines). So the total number of possible sequences is six. Two of them are meaningful, and the others aresomewhat random. (a) Entropy (b) KL divergence
Figure 3: Comparison of entropy and KL divergence on A4.of images from the dataset, and it reveals the relationships amongst the images. The formal definitionof action sequences can be described as the latent traversal in Higgins et al. (2017a).For an autoencoder (AE) architecture, the neural networks are flexible enough to approximate anyactions. For clarity, we define an action as a sequence of images controlled by the ground-truth factors,and an action sequence is the approximation of this action generated by a neural network. Due to thestrong expression of neural networks, the sequences may be organized in any given order. As shownin 2, the neural network can form any possible sequences, and among them, the ordered sequencesfollow human’s intuition—the circle becomes larger or smaller. Using VAEs, minimizing the objectiveincrease the overlap between the posterior distributions across the dataset Burgess et al. (2018), andit then leads to disentanglement and emerges human intuition.3.3 S
IGNIFICANCE OF SEQUENCES
Although Locatello et al. (2019) addresses the importance of inductive biases, few studies investigatethe biases for disentanglement on the data. As we have shown in Sec. 3.1, the orientation of the rectanglecan affect the direction of disentangled representation. We assume that there are much more inductivebiases on the data which have not been yet exploited. It is suggested in Burgess et al. (2018) that the latentcomponents have different contributions to the objective function. Here, we investigate the reasons fordifferent contributions and show that the capacity of latent information is correlated to the significanceof the sequences. To measure the significance of actions, we use the information among this action: H ( S ) = − (cid:90) S p ( x )log( p ( x )) d x , (5)where S is the action controlled by the factor, . 5ublished as a conference paper at ICLR 2021We design a translation family (A4) with two controllable parameters, θ,L to determine the significanceof the sequence; θ is the orientation of the rectangle; L is the maximal distance of translating. Alldatasets have only one factor which is the position of X -axis, and also contain one sequence. If thevalue of L is small enough, the images in the sequence are almost the same, this sequence is referredto as insignificant. Both θ,L have different effects on the significance of the sequence. As shown inFig. 3, the trend of KL divergence is consistent with that of entropy. Please note that the maximumfor both are reached when θ = 90 and L is at its maximum. These imply that the significance is closelyrelated to information among the sequence.Therefore, a higher significance of the sequence results in more information on the latent variables. Incontrast, by the gradual increase of the pressure on the KL term, the latent information decreases untilit reaches zero. One can infer that there exists a threshold of latent information. Experimental resultsshow a positive correlation between the significance of data and the threshold of latent information. LAIN IMPLEMENTATION ACCORDING TO THRESHOLDS meanvariance z z mean variance z • High pressure • Low pressure
Dataset S i gn i f i ca n ce Th r es ho l d (a) FVAE EndocerC1 C2 C3 C4 Decoder
X Z (b) Mixed label
Figure 4: (a) The architecture of FVAE. Though the samples distribute in the dataset randomly, theyhave intrinsic significance. Under a high pressure ( β ), the significant actions can pass informationalone the red line to itself along, and the information of insignificant actions is blocked. (b) The decoderreceive the label information except for the target action.We have discussed the situation with only one action and there are various thresholds for differentsignificant actions. Particularly, if β is large enough, information of the insignificant actions will beblocked, leading to the disentangling process decays to a single factor discovery problem. From themodelling perspective, the learning process is similar to a single action learning problem. However, thedifficulty of disentanglement is that different kinds of ground-truth actions are mixed, and a single anda fixed parameter β is unable to separate them. Therefore, a plain idea is to set different thresholds onthe learning phases, and then in each phase, we enforce the model to learn specific actions by blockinginformation of the secondarily significant actions. We propose a fractional variational autoencoder(FVAE) which disentangles the action sequences step-by-step. The architecture of FVAE is shown inFig. 4(a). The encoder consists of several groups of sub-encoders, and the inputs of the decoder are theconcatenated codes of all sub-encoders. Besides, to prevent re-entangling the learned actions, we setdifferent learning rates for the sub-encoders, which reduces the learning rate for the reset of N-1 groupsand prevent the model from allocating the targeted action into the leaned codes. The training processof FVAE is similar to the common operation in chemistry for separating mixtures—distillation. Toseparate a mixture of liquid, we repeat the process of heating the liquid with different boiling points, andin each step, the heating temperature is different to ensure that only one component is being collected. Discussion
Although AnnealedVAE follows the same principles as the FVAE, it differs in theinterpretation of the effects of beta, and it does not explicitly prevent mixing the factors. Moreover,the performance of AnnealedVAE depends on the choice of hyperparameter in practice Locatelloet al. (2019). “A large enough value” is hard to determine, and the disentangled representationis re-entangled for an extremely large C. To address this issue, here we introduce two methods todetermine the thresholds in each phase for the labelled and unlabelled tasks.6ublished as a conference paper at ICLR 2021 (a) (b)
Figure 5: β vs KL divergence on dSprites (left) and 3D Chairs (right). (a) β -VAE (b) FVAE -VAE FVAE0.10.20.30.40.5 M I G m e t r i c Disentanglement on dSprites (c) Disentanglement
Figure 6: Comparison between β -VAE and FVAE. (a) and (b) show the relationship between z andfactors on dSprites (only active units are shown). (c) The disentanglement scores of β -VAE and FVAE(MIG, Chen et al. (2018)).4.1 L ABELLED TASK
For the labelled setting, we focus on one type of actions and clip the rest of them at first. However, thesamples of one action are usually insufficient. For example, there are only three types of shapes ondSprites. Besides, the label information may be corrupted and only some parts of the dataset are labelled.To address these issues, we introduce the architecture shown in Fig. 4(b), in which the label informationexcepted for the target actions are directly provided to the decoder. We evaluate FVAE on dSpritesdataset (involving five actions: translating, rotating, scaling, and shaping). We first measure thethreshold of each action, and the result is shown in Fig. 5(a). One can see that the thresholds oftranslating and scaling are higher than the others. This suggests that these actions are significant andeasy to be disentangled. This is in line with the results in Burgess et al. (2018); Higgins et al. (2017a).According to these thresholds, we then arrange three stages for dSprites. At each stage, we set upa bigger number of β than the threshold of the secondary action. The pressure on the KL term alsoprevents the insignificant actions from being disentangled and ensures that the model only learnsfrom the information of the target action. The training of each stage can be found in the Appendix.As shown in Fig. 7(a), the translation factor is disentangled at first easily, while it is hard to distinguishthe shape, orientation, and scale. Gradually, scaling and orientation also emerge in order. Nevertheless,it should be noted that the shaping is still hard to be separated. This could be attributed to the lackof commonalities between these three shapes on dSprites and motion compensation for a smoothtransition. In other words, in terms of shape, the lack of intermediate states between different shapesis an inevitable hurdle for its disentanglement. Fig. 6 shows more substantial difference betweenthe β -VAE and the FVAE. β -VAE has an unstable performance compared to FVAE, and positioninformation entangles with orientation on some dimension.7ublished as a conference paper at ICLR 2021 Stage (1)Stage (2)Stage (3)(a) dSprites Stage (1)Stage (2)Stage (3)(b) 3D Chairs
Figure 7:
FVAE disentangles action sequences step-by-step.
Latent traversals at each stage. Theleft is the results on dSprites. The right is the results on 3D Chairs4.2 U
NLABELLED TASK
For the unlabelled setting, we introduce the annealing test to detect the potential components. Inthe beginning, a very large value for β is set to ensure that no action is learned. Then, we graduallydecrease β to disentangle the significant actions. There exists a critical point in which the latentinformation starts increasing and that point approximates the threshold of the corresponding action.3D Chairs is an unlabelled dataset containing 1394 3D models from the Internet. Fig. 5(b) shows theresult of the annealing test on 3D Chairs. One can recognize three points where the latent informationsuddenly increases: 60, 20, 4. Therefore, we arrange a three-stage training process for 3D Chairs (moredetails in the Appendix). As shown in Fig. 7(b), one can see the change of azimuth in the first stage.In the second stage, one can see the change of size and in the third stage, one can see the change ofthe leg style, backrest, and material used in the making of the chair. ONCLUSION
We demonstrated an example of the effects of images’ orientation on the disentangled representations.We have further investigated the inductive biases on the data by introducing the concept of disentanglingaction sequences, and we regarded that as discovering the commonality between the things, whichis essential for disentanglement. The experimental results revealed that the actions with highersignificance have a larger value of thresholds of latent information. We further proposed the fractionalvariational autoencoder (FVAE) to disentangle the action sequences with different significancestep-by-step. We then evaluated the performance of FVAE on dSprites and 3D Chairs. The resultssuggested robust disentanglement where re-entangling is prevented.This paper proposed a novel tool to study the inductive biases by action sequences. However, otherproperties of inductive biases on the data remain to be exploited. The current work focuses on analternative explanation for disentanglement from the perspective of information theory. In the future,the influence of independence on disentanglement requires further investigation.8ublished as a conference paper at ICLR 2021 R EFERENCES
Mathieu Aubry, Daniel Maturana, Alexei A. Efros, Bryan C. Russell, and Josef Sivic. Seeing 3d chairs:Exemplar part-based 2d-3d alignment using a large dataset of cad models. , pp. 3762–3769, 2014.Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and newperspectives.
TPAMI , 35(8):1798–1828, 2013.Christopher P. Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins,and Alexander Lerchner, 2018. URL http://arxiv.org/abs/1804.03599 .Tian Qi Chen, X. Li, Roger B. Grosse, and D. Duvenaud. Isolating sources of disentanglement invariational autoencoders.
ArXiv , abs/1802.04942, 2018.Kien Do and Truyen Tran. Theory and evaluation metrics for learning disentangled representations.In . OpenReview.net, 2020.Ahmed Elgammal, Bingchen Liu, Mohamed Elhoseiny, and Marian Mazzone. Can: Creativeadversarial networks, generating art by learning about styles and deviating from style norms. arXivpreprint arXiv:1706.07068 , 2017.I. Higgins, Nicolas Sonnerat, Loic Matthey, A. Pal, C. Burgess, Matko Bosnjak, M. Shanahan,M. Botvinick, Demis Hassabis, and Alexander Lerchner. Scan: Learning hierarchical compositionalvisual concepts. arXiv: Machine Learning , 2018.Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, ShakirMohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrainedvariational framework. In , pp. 1–13, 2017a.Irina Higgins, Arka Pal, Andrei A. Rusu, Loïc Matthey, Christopher Burgess, Alexander Pritzel,Matthew Botvinick, Charles Blundell, and Alexander Lerchner. DARLA: improving zero-shottransfer in reinforcement learning. In Doina Precup and Yee Whye Teh (eds.),
Proceedings of the 34thInternational Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August2017 , volume 70 of
Proceedings of Machine Learning Research , pp. 1480–1490. PMLR, 2017b.Insu Jeon, Wonkwang Lee, and Gunhee Kim. Ib-gan: Disentangled representation learning withinformation bottleneck gan, 2019.Hyunjik Kim and Andriy Mnih. Disentangling by factorising. , 6:4153–4171, 2018.Abhishek Kumar, P. Sattigeri, and A. Balakrishnan. Variational inference of disentangled latentconcepts from unlabeled observations.
ArXiv , abs/1711.00848, 2018.Brenden M. Lake, Tomer D. Ullman, Joshua B. Tenenbaum, and Samuel J. Gershman. Buildingmachines that learn and think like people.
Behavioral and brain sciences , 40, 2017.Guillaume Lample, Neil Zeghidour, Nicolas Usunier, Antoine Bordes, Ludovic DENOYER, andMarc \ textquotesingle Aurelio Ranzato. Fader networks:manipulating images by sliding attributes.In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett(eds.), NIPS 30 , pp. 5967–5976. Curran Associates, Inc, 2017.Francesco Locatello, Stefan Bauer, Mario Lucie, Gunnar Rätsch, Sylvain Gelly, Bernhard Schölkopf,and Olivier Bachem. Challenging common assumptions in the unsupervised learning of disentangledrepresentations. , 2019-June:7247–7283, 2019.Romain Lopez, Jeffrey Regier, N. Yosef, and Michael I. Jordan. Information constraints onauto-encoding variational bayes. In
NeurIPS , 2018.9ublished as a conference paper at ICLR 2021Loic Matthey, Irina Higgins, Demis Hassabis, and Alexander Lerchner. dsprites: Disentanglementtesting sprites dataset. https://github.com/deepmind/dsprites-dataset/, 2017.J. Peters, D. Janzing, and B. Schölkopf. Elements of causal inference: Foundations and learningalgorithms, 2017.K. Ridgeway. A survey of inductive biases for factorial representation-learning.
ArXiv , abs/1612.05299,2016.Andrew M Saxe, Yamini Bansal, Joel Dapello, Madhu Advani, Artemy Kolchinsky, Brendan D Tracey,and David D Cox. On the information bottleneck theory of deep learning.
Journal of StatisticalMechanics: Theory and Experiment , 2019(12):124020, 2019.Jürgen Schmidhuber. Learning factorial codes by predictability minimization.
Neural Computation ,4(6):863–879, 1992. ISSN 08997667. doi: 10.1162/neco.1992.4.6.863.B. Schölkopf, D. Janzing, J. Peters, Eleni Sgouritsa, K. Zhang, and J. Mooij. On causal and anticausallearning. In
ICML , 2012.Satosi Watanabe. Information theoretical analysis of multivariate correlation.
IBM Journal of researchand development , 4(1):66–82, 1960.Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, Xi Peng, and Ahmed Elgammal. A generativeadversarial approach for zero-shot learning from noisy texts. In
Proceedings of the IEEE conferenceon computer vision and pattern recognition , pp. 1004–1013, 2018.Yizhe Zhu, Jianwen Xie, Bingchen Liu, and Ahmed Elgammal. Learning feature-to-feature translatorby alternating back-propagation for zero-shot learning. arXiv preprint arXiv:1904.10056 , 2019.
A A
PPENDIX
A.1 D
ATASETS
A.2 D S PRITES AND
3D C
HAIRS (a) (b)
Figure 8: Real samples from the training dataset.A.2.1 D
ESIGNED D ATASETS
A.3 T
RAINING DETAILS
The basic architecture for all experiments follows the settings on Burgess et al. (2018). Thehyperparameters of our proposed methods are listed in Tab. 1. Tab. 3 and 2 show the measuredthresholds of the intrinsic action sequences. 10ublished as a conference paper at ICLR 2021 (a) A1 (b) A2 (c) A3
Figure 9: Samples on A1, A2, A3. We visualize these samples by afterimage, and the image starts witha dark colour, then, it goes brighter as traversing the factor.Phase 1 2 3 E E E β
100 40 4 β