VVariance Lossin Variational Autoencoders
Andrea Asperti
University of BolognaDepartment of Informatics: Science and Engineering (DISI) [email protected]
Abstract.
In this article, we highlight what appears to be major issueof Variational Autoencoders (VAEs), evinced from an extensive experi-mentation with different networks architectures and datasets: the vari-ance of generated data is significantly lower than that of training data.Since generative models are usually evaluated with metrics such as theFr´echet Inception Distance (FID) that compare the distributions of (fea-tures of) real versus generated images, the variance loss typically resultsin degraded scores. This problem is particularly relevant in a two stagesetting [8], where a second VAE is used to sample in the latent space ofthe first VAE. The minor variance creates a mismatch between the actualdistribution of latent variables and those generated by the second VAE,that hinders the beneficial effects of the second stage. Renormalizing theoutput of the second VAE towards the expected normal spherical distri-bution, we obtain a sudden burst in the quality of generated samples, asalso testified in terms of FID.
Since their introduction ([19,21]), Variational Autoencoders (VAEs) have rapidlybecome one of the most popular frameworks for generative modeling. Their ap-peal mostly derives from the strong probabilistic foundation; moreover, they aretraditionally reputed for granting more stable training than Generative Adver-sarial Networks (GANs) ([12]).However, the behaviour of Variational Autoencoders is still far from satis-factory, and there are a lot of well known theoretical and practical challengesthat still hinder this generative paradigm. We may roughly identify four main(interrelated) topics that have been addressed so far: balancing issue [5,18,15,7,8,3] a major problem of VAE is the difficulty tofind a good compromise between sampling quality and reconstruction qual-ity. The VAE loss function is a combination of two terms with somehowcontrasting effects: the log-likelihood, aimed to reduce the reconstructionerror, and the Kullback-Leibler divergence, acting as a regularizer of thelatent space with the final purpose to improve generative sampling (see Sec-tion 2 for details). Finding a good balance between these components duringtraining is a complex and delicate issue; a r X i v : . [ c s . L G ] M a y Andrea Asperti variable collapse phenomenon [6,27,2,25,8]. The KL-divergence componentof the VAE loss function typically induces a parsimonious use of latent vari-ables, some of which may be altogether neglected by the decoder, possiblyresulting in a under-exploitation of the network capacity; if this is a beneficialside effect or regularization (sparsity), or an issue to be solved (overpruning),it is still debated; training issues
VAE approximate expectations through sampling during train-ing that could cause an increased variance in gradients ([6,26]); this and otherissues require some attention in the initialization, validation, and annealingof hyperparameters ([5,15,4]) aggregate posterior vs. expected prior mismatch [18,8,1,11] even after asatisfactory convergence of training, there is no guarantee that the learnedaggregated posterior distribution will match the latent prior. This may bedue to the choice of an overly simplistic prior distribution; alternatively, theissue can e.g. be addressed by learning the actual distribution, either via asecond VAE or by ex-post estimation by means of different techniques.The main contribution of this article is to highlight an additional issue that,at the best of our knowledge, has never been pointed out so far: the variance ofgenerated data is significantly lower than that of training data.This resulted from a long series of experiments we did with a large varietyof different architectures and datasets. The variance loss is systematic, althoughits extent may vary, and looks roughly proportional to the reconstruction loss.The problem is relevant because generative models are traditionally evalu-ated with metrics such as the popular Fr´echet Inception Distance (FID) thatcompare the distributions of (features of) real versus generated images: any biasin generated data usually results in a severe penalty in terms of FID score.The variance loss is particularly serious in a two stage setting [8], where weuse a second VAE to sample in the latent space of the first VAE. The reducedvariance induces a mismatch between the actual distribution of latent variablesand those generated by the second VAE, substantially hindering the beneficialeffects of the second stage.We address the issue by a simple renormalization of the generated data tomatch the expected variance (that should be 1, in case of a two stage VAE).This simple expedient, in combination with a new balancing technique for theVAE loss function discussed in a different article [3], are the basic ingredients thatpermitted us to get the best FID scores ever achieved with variational techniquesover traditional datasets such as CIFAR-10 and CelebA.The cause of the reduced variance is not easy to identify. A plausible expla-nation is the following. It is well known that, in presence of multimodal output,the mean square error objective typically results in blurriness, due to averaging(see [14]).Variational Autoencoders are intrinsically multimodal, due to the samplingprocess during training, comporting averaging around the input data X in thedata manifold, and finally resulting in the blurriness so typical of VariationalAutoencoders [10]. The reduced variance is just a different facet of the same ariance Loss in Variational Autoencoders 3 phenomenon: averaging on the data manifold eventually reduces the variance ofdata, due to Jensen’s inequality.The structure of the article is the following. Section 2 contains a short intro-duction to Variational Autoencoders from an operational perspective, focusingon the regularization effect of the Kullback-Leibler component of the loss func-tion. In Section 3, we discuss the variance loss issue, relating it to a similarproblem of Principal Component Analysis, and providing experimental evidenceof the phenomenon. Section 4 is devoted to our approach to the variance loss,with experimental results on CIFAR-10 and CelebA, two of the most commondatasets in the field of generative modeling. A summary of the content of thearticle and concluding remarks are given in Section 5. A Variational Autoencoder is composed by an encoder computing an inference distribution Q ( z | X ), and a decoder, computing the posterior probability P ( X | z ).Supposing that Q ( z | X ) has a Gaussian distribution N ( µ z ( X ) , σ z ( X )) (differentfor each data X ), computing it amounts to compute its two first moments: sowe expect the encoder to return the standard deviation σ z ( X ) in addition to themean value µ z ( X ).During decoding, instead of starting thereconstruction from µ z ( X ), we samplearound this point with the computedstandard deviation:ˆ z = µ z ( X ) + σ z ( X ) ∗ δ where δ is a random normal noise (seeFigure 1). This may be naively under-stood as a way to inject noise in thelatent representation, with the aim toimprove the robustness of the autoen-coder; in fact, it has a much strongertheoretical foundation, well addressedin the literature (see e.g. [9]). Observethat sampling is outside the backpropa-gation flow; backpropagating the recon-struction error (typically, mean squarederror), we correct the current estima-tion of σ z ( X ), along with the estima-tion of µ ( X ). X N(0,1)Q(z|X)P(X|z)XKL[Q(z|X) || N(0,1)] || X − X || X µ( ) X σ( )+ ∗ Fig. 1: VAE architectureWithout further constraints, σ z ( X ) would naturally collapse to 0: as a mat-ter of fact, µ z ( X ) is the expected encoding, and the autoencoder would haveno reason to sample away from this value. The variational autoencoder addsan additional component to the loss function, preventing Q ( z | X ) from col-lapsing to a dirac distribution: specifically, we try to bring each Q ( z | X ) close Andrea Asperti to the prior P ( z ) distribution by minimizing their Kullback-Leibler divergence KL ( Q ( z | X ) || P ( z )).If we average this quantity on all input data, and expand KL-divergence interms of entropy, we get: E X KL ( Q ( z | X ) || P ( z ))= − E X H ( Q ( z | X )) + E X H ( Q ( z | X ) , P ( z ))= − E X H ( Q ( z | X )) + E X E z ∼ Q ( z | X ) logP ( z )= − E X H ( Q ( z | X )) + E z ∼ Q ( z ) logP ( z )= − E X H ( Q ( z | X )) (cid:124) (cid:123)(cid:122) (cid:125) Avg. Entropyof Q ( z | X ) + H ( Q ( z ) , P ( z )) (cid:124) (cid:123)(cid:122) (cid:125) Cross-entropyof Q ( X ) vs P ( z ) (1)By minimizing the cross-entropy between the distributions we are pushing Q ( z )towards P ( z ). Simultaneously, we aim to augment the entropy of each Q ( z | X );assuming Q ( z | X ) is Gaussian, this amounts to enlarge the variance, with the ef-fect of improving the coverage of the latent space, essential for a good generativesampling. The price we have to pay is more overlapping, and hence more con-fusion, between the encoding of different datapoints, likely resulting in a worsereconstruction quality. We already supposed that Q ( X | z ) has a Gaussian distribution N ( µ z ( X ) , σ z ( X )).Moreover, provided the decoder is sufficiently expressive, the shape of the priordistribution P ( z ) can be arbitrary, and for simplicity it is usually assumed tobe a normal distribution P ( z ) = N (0 , KL ( Q ( z | X ) || P ( z ) is hencethe KL-divergence between two Gaussian distributions N ( µ z ( X ) , σ z ( X )) and N (1 ,
0) which can be computed in closed form: KL ( N ( µ z ( X ) , σ z ( X )) , N (0 , ( µ z ( X ) + σ z ( X ) − log ( σ z ( X )) −
1) (2)The closed form helps to get some intuition on the way the regularizing effectof the KL-divergence is supposed to work. The quadratic penalty µ z ( X ) iscentering the latent space around the origin; moreover, under the assumptionto fix the ratio between µ z ( X ) and σ z ( X ) (rescaling is an easy operation for aneural network) it is easy to prove [1] that expression 2 has a minimum when µ z ( X ) + σ z ( X ) = 1. So, we expect E X µ ( X ) = 0 (3)and also, assuming 3, and some further approximation (see [1] for details), E X µ z ( X ) + E X σ z ( X ) = 1 (4) ariance Loss in Variational Autoencoders 5 If we look at Q ( z ) = E X Q ( z | X ) as a Gaussian Mixture Model (GMM) composedby a different Gaussian Q ( z | X ) for each X , the two previous equations expressthe two moments of the GMM, confirming that they coincide with those of anormal prior. Equation 4, that we call variance law , provides a simple sanitycheck to ensure that the regularization effect of the KL-divergence is working asexpected.Of course, even if two first moments of the aggregated inference distribution Q ( z ) are 0 and 1, it could still be very far from a Normal distribution. The pos-sible mismatching between Q ( z ) and the expected prior P ( z ) is likely the mostproblematic aspect of VAEs since, as observed by several authors [16,22,1], itcould compromise the whole generative framework. Possible approaches consistin revising the VAE objective by encouraging the aggregated inference distribu-tion to match P ( z ) [23] or by exploiting more complex priors [17,24,4].An interesting alternative addressed in [8] is that of training a second VAEto learn an accurate approximation of Q ( z ); samples from a Normal distribu-tion are first used to generate samples of Q ( z ), that are then fed to the actualgenerator of data points. Similarly, in [11], the authors try to give an ex-post es-timation of Q ( z ), e.g. imposing a distribution with a sufficient complexity (theyconsider a combination of 10 Gaussians, reflecting the ten categories of MNISTand Cifar10).These two works provide the current state of the art in generative frameworksbased on variational techniques (hence, excluding models based on adversarialtraining), so we shall mostly compare with them. Autoencoders, and especially variational ones, seems to suffer from a systematicloss of variance of reconstructed/generated data with respect to source data.Suppose to have a training set X of n data, each one with m features, and let ˆ X be the corresponding set of reconstructed data. We measure the (mean) varianceloss as the mean over data (that is over the the first axis) of the differences ofthe variances of the features (i.e. over the second, default, axis):mean(var( X ) - var( ˆ X ))Not only this quantity is always positive, but it is also approximately equal tothe mean squared error (mse) between X and ˆ X :mse( X , ˆ X ) = mean(( X - ˆ X ) )where the mean is here computed over all axes.We observed the variance loss issue over a large variety of neural architecturesand datasets. In particular cases, we can also give a theoretical explanation ofthe phenomenon, that looks strictly related to averaging. This is for instance thecase of Principal Component Analysis (PCA), where the variance loss is preciselyequal to the reconstruction error (it is well known that a shallow Autoencoderimplements PCA, see e.g [13]).
Andrea Asperti
Let us discuss this simple case first, since it helps to clarify the issue.Principal component analysis (PCA) is a well know statistical procedurefor dimensionality reduction. The idea is to project data in a lower dimensionalspace via an orthogonal linear transformation, choosing the system of coordinatesthat maximize the variance of data (principal components). These are easilycomputed as the vectors with the largest eigenvalues relative to the covariancematrix of the given dataset (centered around its mean points). Since the distanceFig. 2: The principal component isthe green line. Projecting the redpoints on it, we maximize their vari-ance or equivalently we minimize theirquadratic distance. Fig. 3: The green line is a smootherversion of the blue line, obtainedby averaging values in a suitableneighborhood of each point. Thetwo lines have the same mean; themean squared error between them is0.546,the variance loss is 2.648.of each point from the origin is fixed, by the Pythagorean theorem, maximizingits variance is equivalent to minimize its quadratic error from the hyper-planedefined by the principal components. For the same reason, the quadratic error ofthe reconstruction is equal to the sum of the variance errors of the componentswhich have been neglected .This is a typical example of variance loss due to averaging. Since we want torenounce some components, the best we can do along them is to take the meanvalue. We entirely lose the variance along these directions, that is going to bepaid in terms of reconstruction error.
We expect to have a similar phenomenon even with more expressive networks.The idea is expressed in Figure 3. Think of the blue line as the real data manifold;due to averaging, the network reconstructs a smoother version of the input data,resulting in a significant loss in terms of variance. ariance Loss in Variational Autoencoders 7
The need for averaging may have several motivations: it could be causedby a dimensionality reduction, as in the case of PCA, but also, in the case ofvariational autoencoders, it could derive from the Gaussian sampling performedbefore reconstruction. Since the noise injected during sampling is completelyunpredictable, the best the network can due is to reconstruct an “average image”corresponding to a a portion of the latent space around the mean value µ z ( X ),spanning an area proportional to the variance σ z ( X ) .In Figure 4, we plot the relation between mean squared error (mse) andvariance loss for reconstructed images , computed over a large variety of differ-ent neural architectures and datasets: the distribution is close to the diagonal.Typically, the variance loss for generated images is even greater. We must alsoaccount for a few pathological cases not reported in the figure , occurring withdense networks with very high capacity, and easily prone to overfitting. In thiscases, mse is usually relatively high, while variance loss may drop to 0.Fig. 4: Relation between mean squared error and variance loss. The differentcolors refer to different neural architectures: • (blue) Dense Networks; • (red)ResNet-like; • (green) Convolutional Networks; • Iterative Networks (DRAW-GQN-like)In the general, deep case, however, it is not easy to relate the variance lossto the mean squared error. We just discuss a few cases.If for each data X , the reconstructed value ˆ X is comprised between X andits mean value µ , it is easy to prove that the mean squared error is a lower bound Andrea Asperti to the variance loss (the worse case is when ˆ X = µ , where the variance loss isjust equal to the mean squared error, as in the PCA case).Similarly, let X p be an arbitrary permutation of elements of X and let ˆ X =( X + X p ) /
2. Then, the mean square distance between X and ˆ X is equal tothe variance loss. However, the previous property does not generalize when weaverage over an arbitrary number of permutations; usually the mean squarederror is lower than the quadratic distance between X and ˆ X , but we can alsoget examples of the contrary.We are still looking for a comfortable theoretical formulation of the propertywe are interested in. As we explained in the introduction, the variance loss issue has a great practicalrelevance. Generative models are traditionally evaluated with metrics such asthe popular Fr´echet Inception Distance (FID) aimed to compare the distribu-tions of real versus generated images trough a comparison of extracted features.In the case of FID, the considered features are inception features; inception isusually preferred over other models due to the limited amount of preprocessingperformed on input images. As a consequence, a bias in generated data mayeasily result in a severe penalty in terms of FID score (see [20] for an extensiveanalysis of FID in relation to the training set).The variance loss is particularly dangerous in a two stage setting [8], wherea second VAE is used to sample in the latent space of the first VAE, in order tofix the possible mismatch between the aggregate inference distribution Q ( z ) andthe expected prior P ( z ). The reduced variance induces a mismatch between theactual distribution of latent variables and those generated by the second VAE,hindering the beneficial effects of the second stage.A simple way to address the variance loss issue consists in renormalizinggenerated data to match the actual variance of real data by applying a multi-plicative scaling factor. We implemented this simple approach in a variant ofours of the two stage model of Dai and Wipf, based on a new balancing strategybetween reconstruction loss and Kullback-Leibler described in [3]. We refer tothis latter work for details about the structure of the network, hyperparameterconfiguration, and training settings, clearly outside the scope of this article. Thecode is available at https://github.com/asperti/BalancingVAE . In Figure 5we provide examples of randomly generated faces. Note the particularly sharpquality of the images, so unusual for variational approaches.Both for CIFAR-10 and CelebA, the renormalization operation results inan improvement in terms of FID scores, particularly significant in the case ofCelebA, as reported in Tables 1 and 2. At the best of our knowledge, theseare the best generative results ever obtained for these datasets without relyingon adversarial training. In the Tables, we compare our generative model withthe original two-stage model in [8] and with the recent deterministic model in[11]; as we mentioned above, these approaches represent the state of the art for ariance Loss in Variational Autoencoders 9 Fig. 5: Examples of generated faces. The resulting images do not show theblurred appearance so typical of variational approaches, significantly improvingtheir perceptive quality.Table 1: CIFAR-10: summary of resultsmodel REC GEN-1 GEN-2RAE-l2 [11] (128 vars) 32 . ± ? 80 . ± ? 74 . ± ?2S-VAE [8] 76 . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Table 2: CelebA: summary of resultsmodel REC GEN-1 GEN-2RAE-SN [11] 36 . ± ? 44 . ± ? 40 . ± ?2S-VAE [8] 60 . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . latent space re-normalization (right and left re-spectively). Images on the right have better contrasts and more definite contours.In Figure 6 we show the difference between faces generated from a samerandom seed with and without latent space re-normalization. We hope that thequality of images allows the reader to appreciate the improvement: renormalizedimages (on the right) have more precise contours, sharper contrasts and moredefinite details. ariance Loss in Variational Autoencoders 11 In this article, we stressed an interesting and important problem typical of au-toencoders and especially of variational ones: the variance of generated data canbe significantly lower than that of training data. We addressed the issue with asimple renormalization of generated data towards the expected moments of thedata distribution, permitting us to obtain significant improvements in the qual-ity of generated data, both in terms of perceptual assessment and FID score. Ontypical datasets such as CIFAR-10 and CelebA, this technique - in conjunctionwith a new balancing strategy between reconstruction error and Kullback-Leiblerdivergence - allowed us to get what seems to be the best generative results everobtained without the use of adversarial training.
References
1. Andrea Asperti. About generative aspects of variational autoencoders. In
MachineLearning, Optimization, and Data Science - 5th International Conference, LOD2019, Siena, Italy, September 10-13, 2019, Proceedings , pages 71–82, 2019.2. Andrea Asperti. Sparsity in variational autoencoders. In
Proceedings of the FirstInternational Conference on Advances in Signal Processing and Artificial Intelli-gence, ASPAI, Barcelona, Spain, 20-22 March 2019 , 2019.3. Andrea Asperti and Matteo Trentin. Balancing reconstruction error and Kullback-Leibler divergence in Variational Autoencoders.
CoRR , abs/2002.07514, Feb 2020.4. Matthias Bauer and Andriy Mnih. Resampled priors for variational autoencoders.
CoRR , abs/1810.11428, 2018.5. Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal J´ozefowicz,and Samy Bengio. Generating sentences from a continuous space.
CoRR ,abs/1511.06349, 2015.6. Yuri Burda, Roger B. Grosse, and Ruslan Salakhutdinov. Importance weightedautoencoders.
CoRR , abs/1509.00519, 2015.7. Christopher P. Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guil-laume Desjardins, and Alexander Lerchner. Understanding disentangling in beta-vae. 2018.8. Bin Dai and David P. Wipf. Diagnosing and enhancing vae models. In
SeventhInternational Conference on Learning Representations (ICLR 2019), May 6-9, NewOrleans , 2019.9. Carl Doersch. Tutorial on variational autoencoders.
CoRR , abs/1606.05908, 2016.10. Alexey Dosovitskiy and Thomas Brox. Generating images with perceptual similar-ity metrics based on deep networks. In Daniel D. Lee, Masashi Sugiyama, Ulrikevon Luxburg, Isabelle Guyon, and Roman Garnett, editors,
Advances in NeuralInformation Processing Systems 29: Annual Conference on Neural InformationProcessing Systems 2016, December 5-10, 2016, Barcelona, Spain , pages 658–666,2016.11. Partha Ghosh, Mehdi S. M. Sajjadi, Antonio Vergari, Michael J. Black, andBernhard Sch¨olkopf. From variational to deterministic autoencoders.
CoRR ,abs/1903.12436, 2019.12. I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,A. Courville, and Y. Bengio. Generative Adversarial Networks.
ArXiv e-prints ,June 2014.2 Andrea Asperti13. Ian Goodfellow, Yoshua Bengio, and Aaron Courville.
Deep Learning . MIT Press,2016. .14. Ian J. Goodfellow. NIPS 2016 tutorial: Generative adversarial networks.
CoRR ,abs/1701.00160, 2017.15. Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot,Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learningbasic visual concepts with a constrained variational framework. 2017.16. Matthew D. Hoffman and Matthew J. Johnson. Elbo surgery: yet another wayto carve up the variational evidence lower bound. In
Workshop in Advances inApproximate Bayesian Inference, NIPS , volume 1, 2016.17. Diederik P. Kingma, Tim Salimans, Rafal J´ozefowicz, Xi Chen, Ilya Sutskever, andMax Welling. Improving variational autoencoders with inverse autoregressive flow.In
Advances in Neural Information Processing Systems 29: Annual Conference onNeural Information Processing Systems 2016, December 5-10, 2016, Barcelona,Spain , pages 4736–4744, 2016.18. Diederik P. Kingma, Tim Salimans, and Max Welling. Improving variational in-ference with inverse autoregressive flow.
CoRR , abs/1606.04934, 2016.19. Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In , 2014.20. Daniele Ravaglia. Performance dei variational autoencoders in relazione al trainingset. Master’s thesis, University of Bologna, school of Science, Session II 2020.21. Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic back-propagation and approximate inference in deep generative models. In
Proceedingsof the 31th International Conference on Machine Learning, ICML 2014, Beijing,China, 21-26 June 2014 , volume 32 of
JMLR Workshop and Conference Proceed-ings , pages 1278–1286. JMLR.org, 2014.22. Mihaela Rosca, Balaji Lakshminarayanan, and Shakir Mohamed. Distributionmatching in variational inference, 2018.23. Ilya O. Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Sch¨olkopf.Wasserstein auto-encoders.
CoRR , abs/1711.01558, 2017.24. Jakub M. Tomczak and Max Welling. VAE with a vampprior. In
InternationalConference on Artificial Intelligence and Statistics, AISTATS 2018, 9-11 April2018, Playa Blanca, Lanzarote, Canary Islands, Spain , pages 1214–1223, 2018.25. Brian Trippe and Richard Turner. Overpruning in variational bayesian neuralnetworks. In
Advances in Approximate Bayesian Inference workshop at NIPS 2017 ,2018.26. George Tucker, Andriy Mnih, Chris J. Maddison, John Lawson, and Jascha Sohl-Dickstein. REBAR: low-variance, unbiased gradient estimates for discrete latentvariable models. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M.Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors,
Ad-vances in Neural Information Processing Systems 30: Annual Conference on NeuralInformation Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA ,pages 2627–2636, 2017.27. Serena Yeung, Anitha Kannan, Yann Dauphin, and Li Fei-Fei. Tackling over-pruning in variational autoencoders.