NeurIPS 2019 Disentanglement Challenge: Improved Disentanglement through Aggregated Convolutional Feature Maps
aa r X i v : . [ c s . L G ] F e b Journal of Machine Learning Research 1:1–7, 2019 NeurIPS2019 Disentanglement Challenge
NeurIPS 2019 Disentanglement Challenge: ImprovedDisentanglement through Aggregated Convolutional FeatureMaps
Maximilian Seitzer [email protected]
Fraunhofer Institute for Integrated Circuits IIS, Intelligent Systems Group, Erlangen, Germany
Abstract
This report to our stage 1 submission to the NeurIPS 2019 disentanglement challengepresents a simple image preprocessing method for training VAEs leading to improved dis-entanglement compared to directly using the images. In particular, we propose to useregionally aggregated feature maps extracted from CNNs pretrained on ImageNet. Ourmethod achieved the 2nd place in stage 1 of the challenge (AIcrowd, 2019). Code is avail-able at https://github.com/mseitzer/neurips2019-disentanglement-challenge .
1. Introduction
The representational power and utility of feature representations obtained from deep CNNstrained on large image datasets such as ImageNet (Russakovsky et al., 2014) is well-known.Amongst others, they are routinely used by practitioners to improve performance in transferlearning scenarios, and form the basis for perceptual loss functions (Johnson et al., 2016).A common view explaining the success of deep convolutional representations is that theydescribe an image in an abstract, concise way, simplifying downstream tasks such as clas-sification (Bengio et al., 2012). Thus, a natural hypothesis to draw is that it is easier for aVariational Autoencoder (VAE) (Kingma and Welling, 2014) to disentangle the latent fac-tors of variations from this abstract description than from the image itself. Therefore, in ourchallenge submission, we employ pretrained CNNs to extract convolutional feature maps asa preprocessing step before training the VAE. To reduce the high-dimensional feature mapsand fit the challenge’s resource restrictions, we propose to aggregate the feature maps usinga regional pooling technique from the context of image retrieval.
2. Method
Our method consists of the following three steps: (1) from each image in the dataset, extracta convolutional feature map using a CNN pretrained on ImageNet (section 2.1), (2) eachfeature map is aggregated into a feature vector and stored in memory (section 2.2), (3)a VAE is trained to reconstruct the feature vectors and disentangle the latent factors ofvariation (section 2.3). Appendix A contains further comments about the hyperparameterchoices and lists some other approaches we tested for the challenge. c (cid:13) eitzer To extract convolutional feature maps from the images, we use the VGG19-BN architecture(Simonyan and Zisserman, 2014) in the torchvision package. In particular, we use thepretrained weights stemming from training on ImageNet without further finetuning themin any way. Input images are transformed to the format the pretrained networks expect,i. e. we bilinearly resize them to 224 ˆ
224 pixels and standardize them using mean andvariance across each channel computed from the ImageNet dataset. We use the outputsof the last layer before the final average pooling, resulting in a spatial feature map of size512 ˆ ˆ As the memory limitations of the challenge prohibit us to store the full feature maps inmemory, we choose to aggregate them into feature vectors. This also appears sensibleas the dimensionality of the full feature maps is actually larger than of the input images(3 ˆ ˆ regional maximum activations of convolutions (RMAC) (Tolias et al.,2015). In object retrieval, the goal is to find the image a target object appears on from acollection of images. Tolias et al. (2015) achieve this by matching a feature vector carryingthe object’s “signature” against an RMAC feature vector for each image. To allow matchingagainst all the different objects that appear in an image, RMAC aggregates the signaturesof objects at different scales and locations into the image feature vector. We assume thatthis property of RMAC is also useful in our case, as we need to consider different objects(e. g. on the MPI3d dataset (Gondal et al., 2019), the robotic arm and the object) to findthe latent factors of variation from feature maps, but we do not know the scale and locationof these object a priori.We compute RMAC by applying max-pooling operations with different kernel sizes andstrides to the feature maps (without any padding), resulting in a set of 512-dimensionalfeature vectors. Concretely, we use kernel sizes 1 ˆ
1, 3 ˆ
3, 5 ˆ ˆ ℓ ℓ https://download.pytorch.org/models/vgg19_bn-c79401a0.pth mproved Disentanglement through Aggregated Convolutional Feature Maps Finally, we train a standard β -VAE (Higgins et al., 2017) on the set of aggregated fea-ture vectors resulting from the previous step. The encoder network consists of threefully-connected layers with 256, 128, 64 neurons, followed by two fully-connected layersparametrizing C “
18 means and log variances of a normal distribution N ` µ p x q , σ p x q ˘ used as the approximate posterior q p z | x q . The number of latent factors was experimen-tally determined. The decoder network consists of three fully-connected layers with 64, 128,and 256 neurons, followed by a fully-connected layer parametrizing the means of a normaldistribution N p ˆ µ p z q , I q used as the conditional likelihood p p x | z q . All fully-connectedlayers but the final ones use batch normalization and are followed by ReLU activation func-tions. We use the standard Pytorch initialization for all layers and assume a factorizedstandard normal distribution N p , I q as the prior p p z q on the latent variables.For optimization, we use the Adam optimizer (Kingma and Ba, 2014) with a learningrate of 0 . β “ . β “ . N “ ÿ i “ p ˆ µ i ´ x i q ´ . βC C ÿ j “ ` log p σ j q ´ µ j ´ σ j where β is a hyperparameter to balance the MSE reconstruction and the KLD penaltyterm. As the scale of the KLD term depends on the numbers of latent factors C , wenormalize it by C such that β can be varied independently of C . It can be harmful to starttraining with too much weight on the KLD term (Bowman et al., 2015). Therefore, we usethe following cosine schedule to smoothly anneal β from β start “ ´ to β end “ .
12 overthe course of training: β p t q “ $’’&’’% β start for t ă t start β end ´ p β end ´ β start q ´ ` cos π t ´ t start t end ´ t start ¯ for t start ď t ď t end β end for t ą t end where β p t q is the value for β in training episode t P t , . . . , N ´ u , and annealing runsfrom epoch t start “ t end “
19. This schedule lets the model initially learn toreconstruct the data and only then puts pressure on the latent variables to be factorizedwhich we found to considerably improve performance.
3. Conclusion
Our approach was able to obtain the second place in stage 1 of the competition. On the pub-lic leaderboard (i. e. on
MPI3D-realistic ), our best submission achieves the first rank on theFactorVAE (Kim and Mnih, 2018), SAP (Kumar et al., 2017) and DCI (Eastwood and Williams,2018) metrics. See appendix B for a discussion of the results.As Locatello et al. (2018) point out, for successful unsupervised disentanglement, somekind of inductive biases are required. We suggest that pretrained feature extractors canplay the role of a strong inductive bias for natural image data. Our method could also bea straight-forward avenue to scale disentanglement techniques to larger image sizes. This eitzer report only provides exploratory results, but we think that the initial results are promisingenough to warrant further investigation. References
AIcrowd. NeurIPS 2019: Disentanglement Challenge. ,2019.Yoshua Bengio, Aaron C. Courville, and Pascal Vincent. Representation Learning: AReview and New Perspectives.
IEEE Transactions on Pattern Analysis and MachineIntelligence , 35:1798–1828, 2012.Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal J´ozefowicz, andSamy Bengio. Generating Sentences from a Continuous Space. In
CoNLL , 2015.Tian Qi Chen, Xuechen Li, Roger Baker Grosse, and David Kristjanson Duvenaud. IsolatingSources of Disentanglement in Variational Autoencoders. In
ICLR , 2018.Cian Eastwood and Christopher K. I. Williams. A Framework for the Quantitative Evalu-ation of Disentangled Representations. In
ICLR , 2018.Muhammad Waleed Gondal, Manuel W¨uthrich, ore Miladinovic, Francesco Locatello, Mar-tin Breidt, Valentin Volchkov, Joel Akpo, Olivier Bachem, Bernhard Sch¨olkopf, and Ste-fan Bauer. On the transfer of inductive bias from simulation to the real world: a newdisentanglement dataset. In
NeurIPS , 2019.Irina Higgins, Lo¨ıc Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, MatthewBotvinick, Shakir Mohamed, and Alexander Lerchner. beta-VAE: Learning Basic Vi-sual Concepts with a Constrained Variational Framework. In
ICLR , 2017.Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style trans-fer and super-resolution. In
ECCV , 2016.Hyunjik Kim and Andriy Mnih. Disentangling by Factorising. In
ICML , 2018.Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In
ICLR , 2014.Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes. In
ICLR , 2014.Abhishek Kumar, Prasanna Sattigeri, and Avinash Balakrishnan. Variational Inferenceof Disentangled Latent Concepts from Unlabeled Observations.
ArXiv , abs/1711.00848,2017.Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar R¨atsch, Sylvain Gelly, BernhardSch¨olkopf, and Olivier Bachem. Challenging Common Assumptions in the UnsupervisedLearning of Disentangled Representations. In
RML@ICLR , 2018. mproved Disentanglement through Aggregated Convolutional Feature Maps Niki Parmar, Ashish Vaswani, Jakob Uszkoreit, Lukasz Kaiser, Noam Shazeer, AlexanderKu, and Dustin Tran. Image transformer. In
ICML , 2018.Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhi-heng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein, Alexander C. Berg,and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.
International Journalof Computer Vision , 115:211–252, 2014.Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for Large-ScaleImage Recognition.
ICLR , 2014.Raphael Suter, ore Miladinovic, Bernhard Sch¨olkopf, and Stefan Bauer. Robustly Disentan-gled Causal Mechanisms: Validating Deep Representations for Interventional Robustness.In
ICML , 2019.Giorgos Tolias, Ronan Sicre, and Herv´e J´egou. Particular object retrieval with integralmax-pooling of CNN activations.
ICLR , 2015.Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In
NIPS , 2017.
Appendix A. Further Notes
A.1. Notes on Feature Map Extraction
We experimented with features from pretrained ResNet, ResNeXt, DenseNet and VGG-19 architectures. On
MPI3D-simple , ResNeXt-101 and VGG-19 outperformed ResNet andDenseNet in terms of the metrics used in the challenge. Between the two of them, wecould not clearly detect which architecture works better. On
MPI3D-realistic (i. e. on theevaluation server), VGG-19 showed better performance based on our limited number oftrials, and thus we chose it as our feature extraction network. However, we expect thatResNeXt-101 can also be used given the right kind of hyperparameter settings.
A.2. Notes on Feature Aggregation
Besides RMAC, we also experimented with simple spatial average- and max-pooling overthe feature maps to aggregate the feature maps. This did not result in better performancethan RMAC (given the set of other hyperparameters we tested). We conjecture this isbecause global pooling loses the information of the spatial locations of objects in the imageidentifying some of the factors of variations. For example, the degrees of freedom of therobotic arm can easily be derived by the relative positions of object and manipulator.Compared to global pooling, RMAC enhances the ability of the VAE to infer the factorsof variations by better representing the properties of different objects in the image in theaggregated representation. For example, the degrees of freedom of the robotic arm can alsobe derived by the specific orientation of the manipulator. However, like global pooling,RMAC also does not directly encode the spatial location of objects. An approach to doso could be to overlay a positional encoding onto the feature maps before aggregation, eitzer Table 1: Summary of scores and ranks of our best submission on the private and publicleaderboard at the end of stage 1.
Dataset FactorVAE DCI SAP IRS MIG
Private Score MPI3d-real 0.792 0.527 0.166 0.623 0.292 ř . I . ř . I . A.3. Notes on VAE Training
The number of latent factors C plays an important role for the performance: if C is chosentoo low, the reconstruction error can not be reduced sufficiently whereas if C is chosen tohigh, there is not enough pressure on the latent bottleneck to disentangle the latent factors;in both cases, performance suffers. Our best model uses C “
18. This is considerablyhigher than the number of latent factors of the MPI3D dataset (i. e. 7), which presumablyis because feature vectors encode more information (e. g. about textures) than raw images,and thus a larger latent bottleneck is required to reconstruct the data.
Appendix B. Discussion of Results on the Public Leaderboard
We summarize the results of our best submission on the public and private leaderboards intable 1. On the private leaderboard (i. e. on
MPI3D-real ), our approach achieves thefirst rank on the FactorVAE (Kim and Mnih, 2018) metric, with a particularly largedifference of 0 .
11 to the second ranked entry. Our submission is also second ranked onDCI (Eastwood and Williams, 2018) and SAP (Kumar et al., 2017), with small differencesof respectively 0 .
017 and 0 .
012 to the first ranked entries. Compared to the simulationdataset
MPI3D-realistic , there is a slight drop across all metrics besides IRS (Suter et al.,2019), reflecting the increased difficulty of disentangling natural images compared to simu-lation data. mproved Disentanglement through Aggregated Convolutional Feature Maps On the public leaderboard (i. e. on
MPI3D-realistic ), our method achieves the first rankon FactorVAE, SAP and DCI. On FactorVAE, there is a particularly large margin of 0 . .
044 absolute difference to the best method on this metric.Our method only falls behind on IRS, where the method is ranked 26th, with 0 .
145 absolutedistance to the best method. In our experiments, there seemed to be a correlation betweenIRS and the amount of pressure on factorizing the latent factors (i. e. the β value in theloss function). As a consequence, if training collapses and the KLD loss term approacheszero, the IRS can still reach high values. This explains the number of submissions withhigher IRS values (but considerably lower scores on the other metrics) than our method.In particular, the default submission has an IRS value of 0 .6199, but fails to provide gooddisentanglement otherwise. Overall, we think that the results show the potential of ourapproach.