[PDF] WiSE-ALE: Wide Sample Estimator for Approximate Latent Embedding

Abstract

Variational Auto-encoders (VAEs) have been very successful as methods for forming compressed latent representations of complex, often high-dimensional, data. In this paper, we derive an alternative variational lower bound from the one common in VAEs, which aims to minimize aggregate information loss. Using our lower bound as the objective function for an auto-encoder enables us to place a prior on the bulk statistics, corresponding to an aggregate posterior for the entire dataset, as opposed to a single sample posterior as in the original VAE. This alternative form of prior constraint allows individual posteriors more flexibility to preserve necessary information for good reconstruction quality. We further derive an analytic approximation to our lower bound, leading to an efficient learning algorithm - WiSE-ALE. Through various examples, we demonstrate that WiSE-ALE can reach excellent reconstruction quality in comparison to other state-of-the-art VAE models, while still retaining the ability to learn a smooth, compact representation.

Full PDF

WWiSE-ALE: Wide Sample Estimator for Approximate Latent Embedding

Shuyu Lin Ronald Clark Robert Birke Niki Trigoni Stephen Roberts Abstract

Variational Auto-encoders (VAEs) have been very successful as methods for forming compressed latent represen-tations of complex, often high-dimensional, data. In this paper, we derive an alternative variational lower boundfrom the one common in VAEs, which aims to minimize aggregate information loss. Using our lower bound asthe objective function for an auto-encoder enables us to place a prior on the bulk statistics , corresponding to anaggregate posterior for the entire dataset, as opposed to a single sample posterior as in the original VAE. Thisalternative form of prior constraint allows individual posteriors more ﬂexibility to preserve necessary informationfor good reconstruction quality. We further derive an analytic approximation to our lower bound, leading to anefﬁcient learning algorithm - WiSE-ALE. Through various examples, we demonstrate that WiSE-ALE can reachexcellent reconstruction quality in comparison to other state-of-the-art VAE models, while still retaining the abilityto learn a smooth, compact representation.

1. Introduction

Unsupervised learning is a central task in machine learning. Its objective can be informally described as learning arepresentation of some observed forms of information in a way that the representation summarizes the overall statisticalregularities of the data (Barlow, 1989). Deep generative models are a popular choice for unsupervised learning, as theymarry deep learning with probabilistic models to estimate a joint probability between high dimensional input variables x and unobserved latent variables z . Early successes of deep generative models came from Restricted Boltzmann Machines(Hinton & Salakhutdinov, 2006) and Deep Boltzmann Machines (Salakhutdinov & Hinton, 2009), which aim to learn acompact representation of data. However, the fully stochastic nature of the network requires layer-by-layer pre-trainingusing MCMC-based sampling algorithms, resulting in heavy computation cost.Kingma & Welling (2013) consider the objective of optimizing the parameters in an auto-encoder network by deriving ananalytic solution to a variational lower bound of the log likelihood of the data, leading to the Auto-Encoding VariationalBayes (AEVB) algorithm. They apply a reparameterization trick to maximally utilize deterministic mappings in the network,signiﬁcantly simplifying the training procedure and reducing instability. Furthermore, a regularization term naturallyoccurs in their model, allowing a prior p ( z ) to be placed over every sample embedding q ( z | x ) . As a result, the learnedrepresentation becomes compact and smooth; see e.g. Fig. 1 where we learn a 2D embedding of MNIST digits using 4different methods and visualize the aggregate posterior distribution of 64 random samples in the learnt 2D embedding space.However, because the choice of the prior is often uninformative, the smoothness constraint imposed by this regularizationterm can cause information loss between the input samples and the latent embeddings, as shown by the merging of individualembedding distributions in Fig. 1(d) (especially in the outer areas away from zero code). Extreme effects of such behaviourscan be noticed from β -VAE (Higgins et al., 2016), a derivative algorithm of AEVB which further increases the weightingon the regularizing term with the aim of learning an even smoother, disentangled representation of the data. As shownin Fig. 1(e), the individual embedding distributions are almost indistinguishable, leading to an overly severe informationbottleneck which can cause high rates of distortion (Tishby et al., 1999). The other end of the spectrum can be indicated byFig. 1(b), where perfect reconstruction can be achieved but the learnt embedding distributions appear to severely sharp,indicating a latent representation which is heavily non-smooth and likely to be unstable due to a small amount of noise. University of Oxford, UK Imperial College London, UK ABB Corporate Research, Switzerland. Correspondence to: Shuyu Lin < [email protected] > , Ronald Clark < [email protected] > , Stephen Roberts < [email protected] > . a r X i v : . [ c s . L G ] M a r iSE-ALE xx z z b) WAE c)WiSE d) AEVB e)Beta-VAEa) -3 -2 -1 0 1 2-3-2-1012 108642024 -3 -2 -1 0 1 2-3-2-1012 10864202-3 -2 -1 0 1 2-3-2-1012 10.07.55.02.50.02.55.0 -3 -2 -1 0 1 2-3-2-1012 108642024 z Better ReconstructionQuality Smoother EmbeddingSpacez z z z Figure 1. (a) Learning a 2D embedding of MNIST handwritten digits through an auto-encoding framework. Embedding distributions(aggregate posteriors) of 64 randomly drawn MNIST digits when WAE (b), our proposed WiSE-ALE (c), AEVB (d) or β -VAE (e) isused for the learning. Different learning algorithms ﬁnd a different level of tradeoff between the reconstruction quality (informationpreservation) and the smoothness of the posterior distribution. In this paper, we propose WiSE-ALE (a wide sample estimator), which imposes a prior on the bulk statistics of a mini-batchof latent embeddings. Learning under our WiSE-ALE objective does not penalize individual embeddings lying away fromthe zero code, so long as the aggregate distribution (the average of all individual embedding distributions) does not violatethe prior signiﬁcantly. Hence, our approach mitigates the distortion caused by the current form of the prior constraint inthe AEVB objective. Furthermore, the objective of our WiSE-ALE algorithm is derived by applying variational inferencein a simple latent variable model (Section 2) and with further approximation, we derive an analytic form of the learningobjective, resulting in efﬁcient learning algorithm.In general, the latent representation learned using our algorithm enjoys the following properties: 1) smoothness , as indicatedin Fig. 1(d), the probability density for each individual embedding distribution decays smoothly from the peak value; 2) compactness , as individual embeddings tend to occupy a maximal local area in the latent space with minimal gaps inbetween; and 3) separation , indicated by the narrow, but clear borders between neighbouring embedding distributions asopposed to the merging seen in AEVB. In summary, our contributions are: • An alternative variational lower bound to the data log likelihood is derived, allowing us to impose prior constrainton the bulk statistics of a mini-batch embedding distributions. • Analytic approximations to the lower bound are derived, allowing efﬁcient optimization without sampling-basedmethods and leading to our WiSE-ALE algorithm. • Extensive analysis of our algorithm’s performance in comparison with three related VAE algorithms, namely AEVB, β -VAE and WAE (Tolstikhin et al., 2017).In the rest of the paper, we ﬁrst review directed graphical models in Section 2. We then derive our variational lower boundand its analytic approximations in Section 3. Related work is discussed in Section 4. Experiment results are analyzed inSection 5, leading to conclusions in Section 6.

2. Background: Latent Variable Models

Here we brieﬂy review the latent variable model that allows variational inference through an auto-encoding task and highlightthe difference between the latent variable model for our WiSE-ALE algorithm and that for the AEVB algorithm (Kingma &Welling, 2013).Given N observations of input samples x ∈ R d x denoted D N = (cid:0) x (1) , x (2) , · · · , x ( N ) (cid:1) , we assume x is generated froma latent variable z ∈ R d z of a much lower dimension. Here we denote x and z as random variables, x ( i ) or z ( i ) as the i -th input or latent code sample (i.e. a vector), and x i and z i as the random variable for x ( i ) and z ( i ) . As shown in Fig.2(a), this generative process can be modelled by a simple directed graphical model (Jordan et al., 1999), which modelsthe joint probability distribution p θ ( x , z |D N ) = p θ ( x | z ) p ( z |D N ) = p θ ( z | x ) p ( x |D N ) between x and z , given the currentobservations D N . p ( z |D N ) is the latent distribution given D N , p ( x |D N ) is the data distribution for D N and p θ ( x | z ) and p θ ( z | x ) denote the complex transformation from the latent to the input space and reverse, where the transformation mapping iSE-ALE x GenerativeModelInferenceModel Nz GeneratorInference zx InferenceModel NZ b) c) xz d)a) Figure 2. (a) Directed graphical models for the generative model between observation x and latent variable z . (b) Latent variable model forthe AEVB algorithm, where z indicates N random variables for N latent codes. (c) Latent variable model for the our WiSE-ALE algorithm,where z is a single random variable for the aggregate posterior of the entire dataset. (d) A generic neural network implementation suitablefor both (b) and (c). is parameterised by θ . The learning task is to estimate the optimal set of θ so that this latent variable model can explain thedata D N well.As the inference of the latent variable z given x (i.e. p θ ( z | x ) ) cannot be directly estimated because p ( x |D N ) is unknown,both AEVB (Fig. 2(b)) and our WiSE-ALE (Fig. 2(c)) resort to variational method to approximate the target distribution p θ ( z | x ) by a proposal distribution q φ ( z | x ) with the modiﬁed learning objective that both θ and φ are optimised so that themodel can explain the data well and q φ ( z | x ) approaches p θ ( z | x ) . The primary difference between the AEVB model andour WiSE-ALE model lies in how the joint probability p θ ( x , z |D N ) is modelled and speciﬁcally whether we assume anindividual random variable for each latent code z ( i ) . The AEVB model assumes a pair of random variables ( x i , z i ) for each x ( i ) and estimates the joint probability as p θ ( x , z |D N ) = p θ ( x , x , · · · , x N , z , z , · · · , z N | D N ) (1) = p θ ( x , x , · · · , x N | z , z , · · · , z N ) p θ ( z , z , · · · , z N | D N ) (2) = N (cid:89) i =1 p θ ( x i | z , z , · · · , z N ) N (cid:89) i =1 p θ ( z i | D N ) (3) = N (cid:89) i =1 p θ ( x i | z i ) N (cid:89) i =1 p θ ( z i | D N ) (4) = N (cid:89) i =1 (cid:18) p θ ( x i | z i ) p θ ( z i |D N ) (cid:19) . (5)The equality between Eq. 2 and Eq. 3 can only be made with the assumption that the generation process for each x i isindependent (ﬁrst product in Eq. 3) and each z i is also independent (second product in Eq. 3). Such interpretation of thejoint probability leads to the latent variable model in Fig. 2(b) and the prior constraint (often taken as N (0 , I ) to encourageshrinkage when no data is observed) is imposed on every z i .In contrast, our WiSE-ALE model takes a single random variable to estimate the latent distribution for the entire dataset D N .Hence, the joint probability in our model can be broken down as p θ ( x , z | D N ) = p θ ( x , x , · · · , x N , z | D N ) (6) = p θ ( x , x , · · · , x N , | z ) p θ ( z | D N ) (7) = p θ ( z |D N ) N (cid:89) i =1 p θ ( x i | z ) , (8)leading to the latent variable model illustrated in Fig. 2(c). The only assumption we make in our model is assuming thegenerative process of different input samples given the latent distribution of the current dataset as independent, which weconsider as a sensible assumption. More signiﬁcantly, we do not require independence between different z i as opposed to iSE-ALE the AEVB model, leading to a more ﬂexible model. Furthermore, the prior constraint in our model is naturally imposed onthe aggregate posterior p ( z |D N ) for the entire dataset, leading to more ﬂexibility for each individual sample latent code toshape an embedding distribution to preserve a better quality of information about the corresponding input sample.Neural networks can be used to parameterize p θ ( x i | z i ) in the generative model and q φ ( z i | x i ) in the inference model fromthe AEVB latent variable model or p θ ( x i | z ) and q φ ( z | x i ) correspondingly from our WiSE-ALE latent variable model. Bothnetworks can be implemented through an auto-encoder network illustrated in Fig. 2(d).

3. Our Method

In this section, we ﬁrst deﬁne the aggregate posterior distribution p ( z |D N ) which serves as a core concept in our WiSE-ALEproposal. We then derive a variational lower bound to the marginal log likelihood of the data log p ( D N ) with the focuson the aggregate posterior distribution. Further, analytic approximation to the lower bound is derived, allowing efﬁcientoptimization of the model parameters and leading to our WiSE-ALE learning algorithm. Intuition of our proposal is alsodiscussed. Here we formally deﬁne the aggregate posterior distribution p ( z |D N ) , i.e. the latent distribution given the entire dataset D N .Considering p ( z |D N ) = (cid:90) p θ ( z | x j ) p ( x j |D N ) d x j = N (cid:88) i =1 p θ ( z | x j = x ( i ) ) P ( x j = x ( i ) |D N ) = 1 N N (cid:88) i =1 p θ ( z | x ( i ) ) , (9)we have the aggregate posterior distribution for the entire dataset as the average of all the individual sample posteriors. Thesecond equality in Eq. 9 is made approximating the integral through summation. The third equality is obtained followingthe conventional assumption in the VAE literature that each input sample, x ( i ) , is drawn from the data set D N with equalprobability, i.e. P ( x ( i ) |D N ) = N . Similarly, for the estimated aggregate posterior distribution q ( z |D N ) , we have q ( z |D N ) = 1 N N (cid:88) i =1 q φ ( z | x ( i ) ) . (10) To carry out variational inference, we minimize the KL divergence between the estimated and the true aggregate posteriordistributions q φ ( z |D N ) and p θ ( z |D N ) , i.e. D KL (cid:2) q φ ( z |D N ) (cid:107) p θ ( z |D N ) (cid:3) = E q φ ( z |D N ) (cid:34) log q φ ( z |D N ) p θ ( z |D N ) (cid:35) . (11)Substituting p θ ( z |D N ) = p θ ( D N | z ) p ( z ) p θ ( D N ) in Eq. 11 and breaking down the products and fractions inside the log , we have D KL (cid:2) q φ ( z |D N ) (cid:107) p θ ( z |D N ) (cid:3) = E q φ ( z |D N ) (cid:2) log q φ ( z |D N ) − log p θ ( D N | z ) − log p ( z ) (cid:3) + log p ( D N ) . Re-arranging the above equation, we have log p ( D N ) − D KL (cid:2) q φ ( z |D N ) (cid:107) p φ ( z |D N ) (cid:3) = E q φ ( z |D N ) (cid:2) log p θ ( D N | z ) (cid:3) − D KL (cid:2) q φ ( z |D N ) (cid:107) p ( z ) (cid:3) . As D KL (cid:2) q φ ( z |D N ) (cid:107) p φ ( z |D N ) (cid:3) is non-negative, we have obtained a variational lower bound L WiSE-ALE ( φ, θ ; D N ) to themarginal log likelihood of the data log p ( D N ) as log p ( D N ) ≥ L WiSE-ALE ( φ, θ ; D N ) = E q φ ( z |D N ) (cid:2) log p θ ( D N | z ) (cid:3)(cid:124) (cid:123)(cid:122) (cid:125) − D KL (cid:2) q φ ( z |D N ) (cid:107) p ( z ) (cid:3)(cid:124) (cid:123)(cid:122) (cid:125) . (12) iSE-ALE There are two terms in the derived lower bound: 1 a reconstruction likelihood term that indicates how likely the currentdataset D N are generated by the aggregate latent posterior distribution q φ ( z |D N ) and 2 a prior constraint that penalizessevere deviation of the aggregate latent posterior distribution q φ ( z |D N ) from the preferred prior p ( z ) , acting naturally as aregularizer. By maximizing the lower bound L WiSE-ALE ( φ, θ ; D N ) deﬁned in Eq. 12, we are approaching to log p ( D N ) and,hence, obtaining a set of parameters θ and φ that ﬁnd a natural balance between a good reconstruction likelihood (goodexplanation of the observed data) and a reasonable level of compliance to the prior assumption (achieving some preferableproperties of the posterior distribution, such as smoothness and compactness). To allow fast and efﬁcient optimization of the model parameters θ and φ , we derive analytic approximations for the twoterms in our proposed lower bound (Eq. 12).3.3.1. A PPROXIMATION TO R ECONSTRUCTION L IKELIHOOD T ERM

To approximate 1 reconstruction likelihood term in Eq. 12, we ﬁrst substitute the deﬁnition of the approximate aggregateposterior given in Eq. 10 in the expectation operation in E q φ ( z |D N ) (cid:2) log p θ ( D N | z ) (cid:3) , i.e. E q φ ( z |D N ) (cid:2) log p θ ( D N | z ) (cid:3) = 1 N N (cid:88) i =1 E q φ ( z | x ( i ) ) (cid:2) log p θ ( D N | z ) (cid:3) . (13)Now we can decompose the p θ ( D N | z ) as a product of individual sample likelihood, due to the conditional independence, i.e. log p θ ( D N | z ) = log N (cid:89) j =1 p θ ( x ( j ) | z ) = N (cid:88) j =1 log p θ ( x ( j ) | z ) . (14)Substituting this into Eq. 13, we have E q φ ( z |D N ) [log p θ ( D N | z )] = N (cid:88) i =1 E q φ ( z | x ( i ) ) (cid:34) N N (cid:88) j =1 log p θ ( x ( j ) | z ) (cid:35) . (15)Eq. 15 can be used to evaluate the reconstruction likelihood for D N . However, learning directly with this reconstructionestimate does not lead to convergence in our experiments and the computation is quite costly, as for every evaluation ofthe reconstruction likelihood, we need to evaluate N expectation operations. We choose to simplify the reconstructionlikelihood further to be able to reach convergence during learning at the cost of losing the lower bound property of theobjective function L WiSE-ALE ( φ, θ ; D N ) . Firstly, we apply Jensen inequality to the term inside the expectation in Eq. 15,leading to an upper bound of the reconstruction likelihood term as E q φ ( z |D N ) [log p θ ( D N | z )] ≤ N (cid:88) i =1 E q φ ( z | x ( i ) ) (cid:34) log (cid:18) N N (cid:88) j =1 p θ ( x ( j ) | z ) (cid:19)(cid:35) . (16)Now ( N − sample-WiSE-ALE likelihood distributions in the summation inside the log can be dropped with the assumptionthat the p θ ( x ( j ) | z ) will only be non-zero if z is sampled from the posterior distribution of the same sample x ( j ) at theencoder, i.e. i = j . Therefore, the approximation becomes E q φ ( z |D N ) [log p θ ( D N | z )] ≤ N (cid:88) i =1 E q φ ( z | x ( i ) ) (cid:104) log p θ ( x ( i ) | z ) (cid:105) − N log N. (17)Using the approximation of the reconstruction likelihood term given by Eq. 17 rather than Eq. 15, we are able to reachconvergence efﬁciently during learning at the cost of the estimated objective no longer remaining a lower bound to log p ( D N ) .Details of deriving the above approximation are given in Appendix A. iSE-ALE PPROXIMATION TO P RIOR C ONSTRAINT T ERM

The 2 prior constraint term D KL (cid:2) q φ ( z |D N ) (cid:107) p ( z ) (cid:3) in our objective function (Eq. 12) evaluates the KL divergencebetween the approximate aggregate posterior distribution q φ ( z |D N ) and a zero-mean, unit-variance Gaussian distribution p ( z ) . Here we assume that each sample-WiSE-ALE posterior distribution can be modelled by a factorial Gaussiandistribution, i.e. q φ ( z | x ( i ) ) = (cid:81) d z k =1 N (cid:0) z k | µ k ( x ( i ) ) , σ k ( x ( i ) ) (cid:1) , where k indicates the k -th dimension of the latent variable z and µ k ( x ( i ) ) and σ k ( x ( i ) ) are the mean and variance of the k -th dimension embedding distribution for the input x ( i ) .Therefore, D KL (cid:2) q φ ( z |D N ) (cid:107) p ( z ) (cid:3) computes the KL divergence between a mixture of Gaussians (as Eq. 10) and N (0 , I ) .There is no analytical solution for such KL divergences. Hence, we derive an analytic upper bound allowing for efﬁcientevaluation.Firstly, we substitute q φ ( z |D N ) = N (cid:80) Ni =1 q φ ( z | x ( i ) ) (Eq. 10) to D KL (cid:2) q φ ( z |D N ) (cid:107) p ( z ) (cid:3) , giving D KL (cid:2) q φ ( z |D N ) (cid:107) p ( z ) (cid:3) = 1 N N (cid:88) i =1 (cid:16) E q φ ( z | x ( i ) ) (cid:2) log q φ ( z |D N ) (cid:3) − E q φ ( z | x ( i ) ) (cid:2) log p ( z ) (cid:3)(cid:17) . (18)Applying Jensen inequality, i.e. E x (cid:2) log f ( x ) (cid:3) ≤ log E x (cid:2) f ( x ) (cid:3) , to the ﬁrst term inside the summation in Eq. 18, we have D KL (cid:2) q φ ( z |D N ) (cid:107) p ( z ) (cid:3) ≤ KL UBapprox (19) = 1 N N (cid:88) i =1 (cid:16) log E q φ ( z | x ( i ) ) (cid:2) q φ ( z |D N ) (cid:3)(cid:17) − N N (cid:88) i =1 (cid:16) E q φ ( z | x ( i ) ) (cid:2) log p ( z ) (cid:3)(cid:17) . (20)Taking advantage of the Gaussian assumption for q φ ( z | x ( i ) ) and p ( z ) , we can compute the expectations in Eq. 20 analyticallywith the result quoted below and the full derivation given in Appendix B.1.KL UBapprox = 1 N N (cid:88) i =1 log  N N (cid:88) j =1 d z (cid:89) k =1 A − / B  + 12 N N (cid:88) i =1 d z (cid:88) k =1 C, (21)where A = 2 π (cid:0) ( σ ( i ) k ) + ( σ ( j ) k ) (cid:1) , (22) B = exp  − (cid:0) µ ( i ) k − µ ( j ) k (cid:1) ( σ ( i ) k ) + ( σ ( j ) k )  , (23) C = ( σ ( i ) k ) + ( µ ( i ) k ) + log 2 π. (24)When the overall objective function L WiSE-ALE ( φ, θ ; D N ) in Eq. 12 is maximised, this upper bound approximation willapproach the true KL divergence D KL (cid:2) q φ ( z |D N ) (cid:107) p ( z ) (cid:3) , which ensures that the prior constraint on the overall aggregateposterior distribution takes effects.3.3.3. O VERALL O BJECTIVE F UNCTIONS

Combining results from Section 3.3.1 and 3.3.2, we obtain an analytic approximation L WiSE-ALEapprox ( φ, θ ; D N ) for the variationallower bound L WiSE-ALE ( φ, θ ; D N ) deﬁned in Eq. 12, as shown below: L WiSE-ALEapprox ( φ, θ ; D N ) = N (cid:88) i =1 L ( φ, θ | x ( i ) ) − KL (cid:2) q φ ( z |D N ) || p ( z ) (cid:3) , (25)where we use L (cid:0) φ, θ | x ( i ) (cid:1) to denote the sample-WiSE-ALE reconstruction likelihood E q φ ( z | x ( i ) ) (cid:104) log p θ ( x ( i ) | z ) (cid:105) givenby Eq. 17 and the KL divergence term is estimated through KL UBapprox deﬁned in Eq. 21. Optimizing L WiSE-ALEapprox ( φ, θ ; D N ) w.r.t the model parameters φ and θ , we are able to learn a model that naturally balances between a good embedding of theobserved data and some preferred properties of the latent embedding distributions, such as smoothness and compactness. iSE-ALE -4 -3 -2 -1 0 1 2 3 4 5 -4 -3 -2 -1 0 1 2 3 4 5 AEVB:Ours: x i z z z z z z z -4 -3 -2 -1 0 1 2 3 4 5 ... i =1 i =N z i Input Latent z

Embedding C o n s tr a i n e d b y Figure 3.

Comparison between our WiSE-ALE learning scheme and the AEVB estimator. AEVB imposes the prior constraint on everysample embedding distribution, whereas our WiSE-ALE imposes the constraint to the overall aggregate embedding distribution over theentire dataset (over a mini-batch as an approximation for efﬁcient learning).

Comparing the objective function in our WiSE-ALE algorithm and that proposed in AEVB algorithm (Kingma & Welling,2013) stated below, L AEVB ( φ, θ ; D N ) = N (cid:88) i =1 L ( φ, θ | x ( i ) ) − N (cid:88) i =1 D KL (cid:2) q φ ( z | x ( i ) ) (cid:107) p ( z ) (cid:3) . (26)we notice that the difference lies in the form of prior constraint and the difference is illustrated in Fig. 3. AEVB learningalgorithm imposes the prior constraint on every sample embedding and any deviation away from the zero code or the unitvariance (e.g. variance of a sample posterior becomes less than 1, as the model becomes more certain about a speciﬁcinput sample embedding) will incur penalty. In contrast, our WiSE-ALE learning objective imposes the prior constraint onthe aggregate posterior distribution, i.e. the average of all the sample embeddings. Such prior constraint will allow moreﬂexibility for each sample posterior to settle at a mean and variance value in favour for good reconstruction quality, whilepreventing too large mean values (acting as a regulariser) or too small variance values (ensuring smoothness of the learntlatent representation).To investigate the different behaviours of the two prior constraints more concretely, we consider only two embeddingdistributions q ( z | x (1) ) and q ( z | x (2) ) (red dashed lines) in a 1D latent space, as shown in Fig. 4. The mean values of the twoembedding distributions are ﬁxed to make the analysis simple and their variances are allowed to change. When the variancesof the two embedding distributions are large, such as Fig. 4(a), q ( z | x (1) ) and q ( z | x (2) ) have a large area of overlap and it isdifﬁcult to distinguish the input samples x (1) and x (2) in the latent space. On the other hand, when the two embeddingdistributions have small variances, such as Fig. 4(c), there is clear separation between x (1) and x (2) in the latent space,indicating the embedding only introduces a small level of information loss. Overall, the prior constraint in the AEVBobjective favours the embedding distributions much closer to the uninformative N (0 , I ) prior, leading to large area ofoverlap between the individual posteriors, whereas our WiSE-ALE objective allows a wide range of acceptable embeddingmean and variance, which will then offer more ﬂexibility in the learnt posteriors to maintain a good reconstruction quality. iSE-ALE KL AEVB KL WiSE a) b) c) d) -4 -3 -2 -1 0 1 2 3 400.511.5 KL AEVB : 1.60; KL

WiSE : 0.35 -4 -3 -2 -1 0 1 2 3 400.511.5 KL AEVB : 0.30; KL

WiSE : 0.14 -4 -3 -2 -1 0 1 2 3 400.511.5 KL AEVB : 0.25; KL

WiSE : 0.16

Unit Gaussian priorz distributionz distributionAggregate posterior P D F z z z a b c Individual Posterior Variance K L w r t U n i t G a u ss i a n Figure 4.

Comparison of the prior constraint in our objective function and that in the AEVB objective function. In a-c, red dashed linesare two sample-WiSE-ALE posterior distributions q ( z | x (1) ) and q ( z | x (2) ) which embed the inputs x (1) and x (2) in the latent space(the more separable q ( z | x (1) ) and q ( z | x (2) ) , the easier to distinguish x (1) and x (2) in the latent space), dark blue line is N (0 , I ) priordistribution, light blue line is the aggregate posterior (average of the two individual posteriors). The posteriors given by (a) the minimalKL value in AEVB objective, (b) the minimal KL value in our WiSE-ALE objective and (c) an acceptable KL value in our WiSE-ALEobjective. (d) comparison of KL values in the AEVB and our WiSE-ALE objectives across different posterior variances. So far our derivation has been for the entire dataset D N . Given a small subset B M with M samples randomly drawn from D N , we can obtain a variational lower bound for a mini-batch as: L WiSE-ALE ( φ, θ ; B M ) = M (cid:88) i =1 (cid:16) E q φ ( z | x ( i ) ) (cid:2) log p θ ( x ( i ) | z ) (cid:3)(cid:17) − D KL (cid:2) q φ ( z |B M ) (cid:107) p ( z ) (cid:3) . (27)When B M is reasonably large, then L WiSE-ALE ( φ, θ ; B M ) becomes an good approximation of L WiSE-ALE ( φ, θ ; D N ) through L WiSE-ALE ( φ, θ ; D N ) ≈ NM L

WiSE-ALE ( φ, θ ; B M ) . (28)Given the expressions for the objective functions derived in Section 3.3, we can compute the gradient for an approximation tothe lower bound of a mini-batch B M and apply stochastic gradient ascent algorithm to iteratively optimize the parameters φ and θ . We can thus apply our WiSE-ALE algorithm efﬁciently to a mini-batch and learn a meaningful internal representationof the entire dataset. Algorithmically, WiSE-ALE is similar to AEVB, save for an alternate objective function as per Section3.3.3. The procedural details of the algorithm are presented in Appendix C.

4. Related Work

Bengio et al. (2013) proposes that a learned representation of data should exhibit some general features, such as smoothness,sparsity and simplicity. These attributes are general, however, and are not tailored to any speciﬁc downstream tasks.Requirements from Bayesian decision making (see e.g. (Lacoste-Julien et al., 2011; Cobb et al., 2018)) adds considerationof a target task and proposes latent distribution approximations which optimise the performance over a particular task, aswell as conforming to more general properties. The AEVB algorithm (Kingma & Welling, 2013) learns the latent posteriordistribution under a reconstruction task, while simultaneously satisfying the prior, thus ensuring that the representationis smooth and compact. However, the prior form of the AEVB algorithm imposes signiﬁcant inﬂuence on the solutionspace (as discussed in Section 3.4), and leads to a sacriﬁce of reconstruction quality. Our WiSE-ALE algorithm, however,prioritises the reconstruction task yet still enables globally desirable properties.WiSE-ALE is, however, not the only algorithm that considers an alternate prior form to mitigate its impact on reconstructionquality. The Gaussian Mixture VAE (Dilokthanakul et al., 2016) uses a Gaussian mixture model to parameterise p ( z ) ,encouraging more ﬂexible sample posteriors. The Adversarial Auto-Encoder (Makhzani et al., 2016) matches the aggregateposterior over the latent variables with a prior distribution through adversarial training. The WAE (Tolstikhin et al., 2017) iSE-ALE minimises a penalised form of the Wasserstein distance between the aggregate posterior distribution and the prior, claiminga generalisation of the AAE algorithm under the theory of optimal transport (Villani, 2008). More recently, the SinkhornAuto-Encoder (Patrini et al., 2018) builds a formal analysis of auto-encoders using an optimal transport based prior and usesthe Sinkhorn algorithm as an alternative to estimate the Wasserstein distance in WAE.Our work differs from these in two main aspects. Firstly, our objective function can be evaluated analytically, leadingto an efﬁcient optimization process. In many of the above work, the optimization involves adversarial training and somehyper-parameter tuning, which leading to less efﬁcient learning and slow or even no convergence. Secondly, our WiSE-ALEalgorithm naturally ﬁnds a balance between good reconstruction quality and preferred latent representation properties, suchas smoothness and compactness, as shown in Fig. 1(c). In contrast, some other work sacriﬁce the properties of smoothnessand compactness severely for improved reconstruction quality, as shown in Fig. 1(b). Many works (Bloesch et al., 2018;Clark et al., 2018) have indicated that those properties of the learnt latent representation are essential for tasks that requireoptimisation over the latent space.

5. Experiments

We evaluate our WiSE-ALE algorithm in comparison with AEVB, β -VAE and WAE on the following 3 datasets. Theimplementation details for all experiments are given in Appendix E.1. Sine Wave . We generated 200,000 sine waves with small random noise: x ( t ) = A sin (2 πf t + ϕ ) + (cid:15) , each containing256 samples, with independently sampled frequency f ∼ Unif (0 , Hz ) , phase angle ϕ ∼ Unif (0 , π ) and amplitude A ∼ Unif (0 , .2. MNIST (LeCun, 1998). 70,000 × binary images that contain hand-written digits.3. CelebA (Liu et al., 2015). 202,599 RGB images of aligned celebrity faces of × are cropped to square imagesof × and resized to × . Throughout all experiments, our method has shown consistently superior reconstruction quality compared to AEVB, β -VAEand WAE. Fig. 5 offers a graphical comparison across the reconstructed samples given by different methods for the sinewave and CelebA datasets. For the sine wave dataset, our WiSE-ALE algorithms achieves almost perfect reconstruction,whereas AEVB and β -VAE often struggle with low-frequency signals and have difﬁculty predicting the amplitude correctly.For the CelebA dataset, our WiSE-ALE manages to predict much sharper human faces, whereas the AEVB predictions areoften blurry and personal characteristics are often ignored. WAE reaches a similar level of reconstruction quality to ours insome images, but it sometimes struggles with discovering the right elevation and azimuth angles, as shown in the second tothe right column in Fig. 5b. a ) A E V B b ) B e t a - VA E c ) O u r s (a) Reconstructed sine waves given by AEVB, β -VAE andour WiSE-ALE. A E V B W A E O u r s G T (b) Reconstructed celebrity faces given by AEVB, WAE andour WiSE-ALE (CelebA dataset). Figure 5.

Qualitative comparison of the reconstruction quality between our WiSE-ALE and other methods. iSE-ALE

We understand that a good latent representation should not only reconstruct well, but also preserve some preferable qualities,such as smoothness, compactness and possibly meaningful interpretation of the original data. Fig. 1 indicates that ourWiSE-ALE automatically learns a latent representation that ﬁnds a good tradeoff between minimizing the information lossand maintaining a smooth and compact aggregate posterior distribution. Furthermore, as shown in Fig. 6, we compare theELBO values given by AEVB, β -VAE and our WiSE-ALE over training for the Sine dataset. Our WiSE-ALE manages toreport the highest ELBO with a signiﬁcantly lower reconstruction error and a fairly good performance in the KL divergenceloss. This indicates that our WiSE-ALE is able to learn an overall good quality representation that is closest to the true latentdistribution which gives rise to the data observation. R e c o n s t r u c t i o n e rr o r A E V B K L d i v e r g e n c e AEVBBetaVAE (beta=2)WiSE 0 25 50 75 100 125Iteration (k)5045403530252015 E L B O Figure 6.

Comparison of training reconstruction error, AEVB KL divergence loss and ELBO given by AEVB, β -VAE and our WiSE-ALEmethods on the sine wave dataset over 50 epochs (batch size = 64).

6. Conclusion and Future Work

In this paper, we derive a variational lower bound to the data log likelihood, which allows us to impose a prior constraint onthe bulk statistics of the aggregate posterior distribution for the entire dataset. Using an analytic approximation to this lowerbound as our learning objective, we propose WiSE-ALE algorithm. We have demonstrated its ability to achieve excellentreconstruction quality, as well as forming a smooth, compact and meaningful latent representation.In the future, we are planning to analyse the error introduced in our approximation to our proposed lower bound. Further,we would like to investigate the potential to use the evaluation of the reconstruction likelihood term given by Eq. 15 as ourlearning objective, which will keep the lower bound property of our proposal and guarantee that our proposed posteriorapproaches the true posterior through optimisation.

References

Barlow, H. B. Unsupervised learning.

Neural computation , 1(3):295–311, 1989.Bengio, Y., Courville, A., and Vincent, P. Representation learning: A review and new perspectives.

IEEE Trans. PatternAnal. Mach. Intell. , 35(8):1798–1828, August 2013. ISSN 0162-8828. doi: 10.1109/TPAMI.2013.50. URL http://dx.doi.org/10.1109/TPAMI.2013.50 .Bloesch, M., Czarnowski, J., Clark, R., Leutenegger, S., and Davison, A. J. Codeslam learning a compact, optimisablerepresentation for dense visual slam. In

The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , June2018.Clark, R., Bloesch, M., Czarnowski, J., Leutenegger, S., and Davison, A. J. Learning to solve nonlinear least squares formonocular stereo. In

The European Conference on Computer Vision (ECCV) , September 2018.Cobb, A. D., Roberts, S. J., and Gal, Y. Loss-calibrated approximate inference in bayesian neural networks.

CoRR ,abs/1805.03901, 2018.Dilokthanakul, N., Mediano, P. A., Garnelo, M., Lee, M. C., Salimbeni, H., Arulkumaran, K., and Shanahan, M. Deepunsupervised clustering with gaussian mixture variational autoencoders. arXiv preprint arXiv:1611.02648 , 2016. iSE-ALE

Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. beta-vae: Learningbasic visual concepts with a constrained variational framework. 2016.Hinton, G. E. and Salakhutdinov, R. R. Reducing the dimensionality of data with neural networks. science , 313(5786):504–507, 2006.Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. An introduction to variational methods for graphical models.

Machine learning , 37(2):183–233, 1999.Kingma, D. P. and Welling, M. Auto-encoding variational bayes.

CoRR , abs/1312.6114, 2013. URL http://arxiv.org/abs/1312.6114 .Lacoste-Julien, S., Huszar, F., and Ghahramani, Z. Approximate inference for the loss-calibrated bayesian. In

Proceedingsof the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale,USA, April 11-13, 2011 , pp. 416–424, 2011. URL .LeCun, Y. The mnist database of handwritten digits. http://yann.lecun.com/exdb/mnist/ , 1998. URL https://ci.nii.ac.jp/naid/10027939599/en/ .Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning face attributes in the wild. In

Proceedings of the 2015 IEEEInternational Conference on Computer Vision (ICCV) , ICCV ’15, pp. 3730–3738, Washington, DC, USA, 2015. IEEEComputer Society. ISBN 978-1-4673-8391-2. doi: 10.1109/ICCV.2015.425. URL http://dx.doi.org/10.1109/ICCV.2015.425 .Makhzani, A., Shlens, J., Jaitly, N., and Goodfellow, I. Adversarial autoencoders. In

International Conference on LearningRepresentations , 2016. URL http://arxiv.org/abs/1511.05644 .Patrini, G., Carioni, M., Forr´e, P., Bhargav, S., Welling, M., van den Berg, R., Genewein, T., and Nielsen, F. Sinkhornautoencoders.

CoRR , abs/1810.01118, 2018.Salakhutdinov, R. and Hinton, G. E. Deep boltzmann machines. In

AISTATS , volume 5 of

JMLR Proceedings , pp. 448–455.JMLR.org, 2009.Tishby, N., Pereira, F. C., and Bialek, W. The information bottleneck method. pp. 368–377, 1999.Tolstikhin, I. O., Bousquet, O., Gelly, S., and Sch¨olkopf, B. Wasserstein auto-encoders.

CoRR , abs/1711.01558, 2017.Villani, C.

Optimal transport – Old and new , volume 338, pp. xxii+973. 01 2008. doi: 10.1007/978-3-540-71050-9. ppendix for WiSE-ALE

In this appendix, we omit the trainable parameters φ and θ in the expressions of distributions for simplicity. For example, q ( z | x ) is equivalent to q φ ( z | x ) and p ( x | z ) represents p θ ( x | z ) . A. Approximation of the Reconstruction Term

Here we demonstration that the reconstruction term E q ( z |D N ) (cid:2) log p ( D N | z ) (cid:3) in our lower bound can be estimated withindividual sample likelihood log p ( x ( i ) | z ) and how our reconstruction error term becomes the same as the reconstructionterm in the AEVB objective.Firstly, we can substitute q ( z |D N ) = 1 N N X i =1 q ( z | x ( i ) ) (1)into the reconstruction term E q ( z |D N ) (cid:2) log p ( D N | z ) (cid:3) , i.e. E q ( z |D N ) (cid:2) log p ( D N | z ) (cid:3) = Z q ( z |D N ) (cid:2) log p ( D N | z ) (cid:3) d z = Z N N X i =1 q ( z | x ( i ) ) (cid:2) log p ( D N | z ) (cid:3) d z = 1 N N X i =1 Z q ( z | x ( i ) ) (cid:2) log p ( D N | z ) (cid:3) d z = 1 N N X i =1 E q ( z | x ( i ) ) (cid:2) log p ( D N | z ) (cid:3) . Now we can decompose the the marginal likelihood of the entire dataset as a product of individual samples, due to theconditional independence, i.e. log p ( D N | z ) = log N Y j =1 p ( x ( j ) | z ) = N X j =1 log p ( x ( j ) | z ) . Substituting this into the reconstruction term, we have: E q ( z |D N ) [log p ( D N | z )] = 1 N N X i =1 E q ( z | x ( i ) ) " N X j =1 log p ( x ( j ) | z ) . To evaluate the reconstruction term in our lower bound, we need to do the following: 1) draw a sample x ( i ) from the dataset D N ; 2) evaluate the latent code distribution q ( z | x ( i ) ) through the encoder function q ( ·| x ( i ) ) ; 3) draw samples of z accordingto q ( z | x ( i ) ) ; 4) reconstruct input samples using the sampled latent codes z ( l ) ; 5) compute the reconstruction error w.r.t toevery single input sample and sum this error.We can simplify the above evaluation. Firstly, the sampling process in Step 3 can be replaced to a sampling process at theinput using the reparameterisation trick. Besides, the sum of reconstruction errors w.r.t. all the input samples can be furthersimpliﬁed. To do this, we need to re-arrange the above expression as E q ( z |D N ) [log p ( D N | z )] = N X i =1 E q ( z | x ( i ) ) " N N X j =1 log p ( x ( j ) | z ) and apply Jensen inequality for the special case of log , i.e. log (cid:16) N N X i =1 a i (cid:17) ≥ N N X i =1 log (cid:0) a i (cid:1) ppendix for WiSE-ALE to the terms inside the expectation. As a result, we have obtain an upper bound of the reconstruction error term as E q ( z |D N ) [log p ( D N | z )] ≤ N X i =1 E q ( z | x ( i ) ) " log (cid:18) N N X j =1 p ( x ( j ) | z ) (cid:19) . This upper bound can be evaluated more efﬁciently with the assumption that the likelihood p ( x ( j ) | z ) representing theprobability of a reconstructed sample from a latent code z imitating the sample x ( j ) will only be non-zero if z is sampledfrom the embedding prediction distribution with the same sample x ( j ) at the encoder input. With this assumption, N − posterior distributions in the inner summation will be dropped as zeros and the only non-zero term is p ( x ( i ) | z ) . Therefore,the upper bound becomes E q ( z |D N ) [log p ( D N | z )] ≤ N X i =1 E q ( z | x ( i ) ) " log (cid:18) N p ( x ( i ) | z ) (cid:19) = N X i =1 E q ( z | x ( i ) ) h log p ( x ( i ) | z ) i − N log N ≈ N X i =1 E q ( z | x ( i ) ) h log p ( x ( i ) | z ) i . The constant can be omitted, because it will not affect the gradient updates of the parameters.

B. An Upper Bound Approximation of the KL Term D KL (cid:2) q ( z |D N ) k p ( z ) (cid:3) = Z q ( z |D N ) (cid:16) log q ( z |D N ) − log p ( z ) (cid:17) d z = Z N N X i =1 q ( z | x ( i ) ) (cid:16) log q ( z |D N ) − log p ( z ) (cid:17) d z = 1 N N X i =1 Z q ( z | x ( i ) ) (cid:16) log q ( z |D N ) − log p ( z ) (cid:17) d z = 1 N N X i =1 (cid:16) E q ( z | x ( i ) ) (cid:2) log q ( z |D N ) (cid:3) − E q ( z | x ( i ) ) (cid:2) log p ( z ) (cid:3)(cid:17) Applying Jensen inequality, i.e. E x (cid:2) log f ( x ) (cid:3) ≤ log E x (cid:2) f ( x ) (cid:3) , (2)to the ﬁrst term of above equation, we have D KL (cid:2) q ( z |D N ) k p ( z ) (cid:3) ≤ N N X i =1 (cid:16) log E q ( z | x ( i ) ) (cid:2) q ( z |D N ) (cid:3)(cid:17) − N N X i =1 (cid:16) E q ( z | x ( i ) ) (cid:2) log p ( z ) (cid:3)(cid:17) . We will look at the two summation individually. The expectation w.r.t. the aggregate posterior can be expanded as E q ( z | x ( i ) ) (cid:2) q ( z |D N ) (cid:3) = Z q ( z | x ( i ) ) 1 N N X j =1 q ( z | x ( j ) ) d z = 1 N N X j =1 Z q ( z | x ( i ) ) q ( z | x ( j ) ) d z . ppendix for WiSE-ALE We assume the posterior distribution of the latent code z given a speciﬁc input sample x ( i ) is a diagonal Gaussian, i.e. q ( z | x ( i ) ) = N (cid:16) z | µ ( i ) , ( σ ( i ) ) (cid:17) = d z Y k =1 N (cid:16) z k | µ ( i ) k , ( σ ( i ) k ) (cid:17) . (3)Similarly, q ( z | x ( j ) ) = N (cid:16) z | µ ( j ) , ( σ ( j ) ) (cid:17) = d z Y k =1 N (cid:16) z k | µ ( j ) k , ( σ ( j ) k ) (cid:17) . Therefore, E q ( z | x ( i ) ) (cid:2) q ( z |D N ) (cid:3) = 1 N N X j =1 Z d z Y k =1 N (cid:16) z k | µ ( i ) k , ( σ ( i ) k ) (cid:17) d z Y k =1 N (cid:16) z k | µ ( j ) k , ( σ ( j ) k ) (cid:17) d z Y k =1 d z k = 1 N N X j =1 d z Y k =1 Z N (cid:16) z k | µ ( i ) k , ( σ ( i ) k ) (cid:17) N (cid:16) z k | µ ( j ) k , ( σ ( j ) k ) (cid:17) d z k . Substituting the exponential form for Gaussian distribution, i.e. N (cid:16) z k | µ ( i ) k , ( σ ( i ) k ) (cid:17) = 1 q π ( σ ( i ) k ) exp − ( z k − µ ( i ) k ) σ ( i ) k ) ! , (4)to the above equation, we have E q ( z | x ( i ) ) (cid:2) q ( z |D N ) (cid:3) = 1 N N X j =1 d z Y k =1 Z πσ ( i ) k σ ( j ) k exp − ( z k − µ ( i ) k ) σ ( i ) k ) − ( z k − µ ( j ) k ) σ ( j ) k ) ! d z k = 1 N N X j =1 d z Y k =1 πσ ( i ) k σ ( j ) k Z exp − ( z k − µ ( i ) k ) σ ( i ) k ) − ( z k − µ ( j ) k ) σ ( j ) k ) ! d z k . The exponent of the above equation can be simpliﬁed to − ( z k − µ ( i ) k ) σ ( i ) k ) − ( z k − µ ( j ) k ) σ ( j ) k ) = − σ ( i ) k ) + 1( σ ( j ) k ) ! z k + µ ( i ) k ( σ ( i ) k ) + µ ( j ) k ( σ ( j ) k ) ! z k − ( µ ( i ) k ) ( σ ( i ) k ) + ( µ ( i ) k ) ( σ ( i ) k ) ! . Using the following properties, i.e. Z ∞−∞ exp (cid:0) − ax + bx (cid:1) d x = r πa exp (cid:16) b a (cid:17) ( a ≥ , (5)we can evaluate the integral needed for E q ( z | x ( i ) ) (cid:2) q ( z |D N ) (cid:3) as πσ ( i ) k σ ( j ) k Z z k exp − ( z k − µ ( i ) k ) σ ( i ) k ) − ( z k − µ ( j ) k ) σ ( j ) k ) ! d z k = 1 q π (cid:0) ( σ ( i ) k ) + ( σ ( j ) k ) (cid:1) exp − (cid:0) µ ( i ) k − µ ( j ) k (cid:1) ( σ ( i ) k ) + ( σ ( j ) k ) ! . ppendix for WiSE-ALE Therefore, we have obtained the expression for the ﬁrst term in our upper bound, i.e. N N X i =1 (cid:16) log E q ( z | x ( i ) ) (cid:2) q ( z |D N ) (cid:3)(cid:17) (6) = 1 N N X i =1 log N N X j =1 d z Y k =1 q π (cid:0) ( σ ( i ) k ) + ( σ ( j ) k ) (cid:1) exp (cid:18) − (cid:0) µ ( i ) k − µ ( j ) k (cid:1) ( σ ( i ) k ) + ( σ ( j ) k ) (cid:19)! . (7)To ﬁnd out the expression for the second term N P Ni =1 (cid:16) E q ( z | x ( i ) ) (cid:2) log p ( z ) (cid:3)(cid:17) , we ﬁrst examine the prior distribution p ( z ) which is chosen to be a zero-mean unit-variance Gaussian across all latent code dimensions, i.e. p ( z ) = N (cid:0) z | , I (cid:1) = d z Y k =1 N (cid:0) z k | , (cid:1) . (8)Therefore, log p ( z ) = d z X k =1 log N (cid:0) z k | , (cid:1) = − d z X k =1 (cid:16) log (cid:0) π (cid:1) + z k (cid:17) (9)Substituting this expression for log p ( z ) into N P Ni =1 (cid:16) E q ( z | x ( i ) ) (cid:2) log p ( z ) (cid:3)(cid:17) and examining the expectation term for now,we have E q ( z | x ( i ) ) (cid:2) log p ( z ) (cid:3) = d z X k =1 E q ( z | x ( i ) ) (cid:2) log p ( z k ) (cid:3) = d z X k =1 E q ( z k | x ( i ) ) q ( z \ k | x ( i ) ) (cid:2) log p ( z k ) (cid:3) = d z X k =1 E q ( z k | x ( i ) ) (cid:2) log p ( z k ) (cid:3) = d z X k =1 Z q ( z k | x ( i ) ) log p ( z k ) d z k = − d z X k =1 Z q ( z k | x ( i ) ) (cid:16) log (cid:0) π (cid:1) + z k (cid:17) d z k = − d z X k =1 (cid:18) log (cid:0) π (cid:1) Z q ( z k | x ( i ) ) d z k + Z q ( z k | x ( i ) ) z k d z k (cid:19) . The ﬁrst integral R q ( z k | x ( i ) ) d z k = 1 . To evaluate the second integral, we substitute Equation (4) and use the followingproperties, i.e. Z ∞−∞ exp (cid:0) − ax (cid:1) d x = 12 r πa , a ≥ (10) Z ∞−∞ x exp (cid:0) − a ( x − b ) (cid:1) d x = b r πa , Re ( a ) ≥ (11) Z ∞−∞ x exp (cid:0) − ax (cid:1) d x = 12 r πa , a ≥ . (12)As a result, we have Z z k q ( z k | x ( i ) ) z k d z k = ( σ ( i ) k ) + ( µ ( i ) k ) . ppendix for WiSE-ALE Therefore, E q ( z | x ( i ) ) (cid:2) log p ( z ) (cid:3) = − d z X k =1 (cid:18) ( σ ( i ) k ) + ( µ ( i ) k ) + log 2 π (cid:19) (13) N N X i =1 E q ( z | x ( i ) ) (cid:2) log p ( z ) (cid:3) = − N N X i =1 d z X k =1 (cid:18) ( σ ( i ) k ) + ( µ ( i ) k ) + log 2 π (cid:19) . (14)Combining the ﬁrst term deﬁned in Equation (6) and the second term deﬁned in Equation (13), we have obtained theexpression for the overall upper bound as D KL (cid:2) q ( z |D N ) k p ( z ) (cid:3) ≤ N N X i =1 log N N X j =1 d z Y k =1 q π (cid:0) ( σ ( i ) k ) + ( σ ( j ) k ) (cid:1) exp (cid:18) − (cid:0) µ ( i ) k − µ ( j ) k (cid:1) ( σ ( i ) k ) + ( σ ( j ) k ) (cid:19)! + 12 N N X i =1 d z X k =1 (cid:18) ( σ ( i ) k ) + ( µ ( i ) k ) + log 2 π (cid:19) . C. WiSE Algorithm

Algorithm 1

WiSE algorithm. Either L WiSEapprox ( φ, θ ; B M ) deﬁned in Eq. 19 in Section 3.5 can be used as the learning objectivefunction. φ , θ ← Initialize parameters repeat B M ← Draw M random samples from D N (cid:15) ← N (0 , I ) Apply reparameterisation trick so that z ∼ q φ ( z | x ( i ) ) becomes z = µ ( i ) + (cid:15) (cid:12) σ ( i ) g ← 5 φ,θ L WiSEapprox ( φ, θ ; B M ) Compute the gradient φ , θ ← Update parameters using g according to AdamOptimizer until convergence of the objective function or end of iterations return φ, θ D. Experiment Details

We carry out experiments on four datasets (Sine wave, MNIST, Teapot and CelebA) to examine different properties of thelatent representation learnt from the proposed WiSE algorithm. Speciﬁcally, we compare with β -VAE on the smoothness anddisentanglement of the learnt representation and compare with WAE and AEVB on the reconstruction quality. In addition,by learning a 2D embedding of the MNIST dataset, we are able to visualise the latent embedding distributions learnt fromAEVB, β -VAE, WAE and our WiSE and compare the compactness and smoothness of the learnt latent space across thesemethods. Here we give the implementation details for each dataset. D.1. Sine Wave

We aim to learn a latent representation in R for a one second long sine wave with sampling rate of 256Hz. The networkarchitecture for the Sine wave dataset is shown below. x is an input sample, µ and σ are the latent code mean and latentcode standard deviation to deﬁne the embedding distribution q ( z | x ) and ˆ x is the reconstructed input sample. (cid:15) is anauxiliary variable drawn from unit Gaussian at the input of the encoder network so that an estimate of a sample from theembedding distribution q ( z | x ) can be computed. Conv m × nk denotes a convolution operation with k ﬁlters each of size m × n . TransposedConv m × nk ( stride = (a,b) ) denotes a stride of a and b for the sliding window and k ﬁlters each of size m × n . FC k denotes a fully connected layer with output in R k . Reshape ba denotes reshaping an variable from dimension a to dimension b . ReLU denotes rectiﬁed linear units. ppendix for WiSE-ALE Encoder network: x ∈ R × → Conv × ( stride = 2 ) → ReLU → Conv × ( stride = 2 ) → ReLU → Conv × ( stride = 2 ) → ReLU → Conv × ( stride = 2 ) → ReLU → Conv × ( stride = 2 ) → ReLU → FC → FC ⇒ µ ∈ R & FC → ReLU ⇒ σ ∈ R Decoder network: z = µ + (cid:15) (cid:12) σ → FC → ReLU → Reshape × × → Conv × → ReLU → TransposedConv × ( stride = (4,1) ) → ReLU → TransposedConv × ( stride = (4,1) ) → ReLU → TransposedConv × ( stride = (4,1) ) → ReLU → TransposedConv × ( stride = (4,1) ) → ReLU → Reshape × × × ⇒ ˆ x ∈ R × We use the following hyper-parameters to train the network:

Batch size Number of epochs Optimizer Learning rate Padding

64 50 Adam × − SAME

D.2. MNIST

We aim to learn a 2D embedding of the MNIST dataset. The network architecture is shown below.Encoder network: x ∈ R × × → Conv × ( stride = 2 ) → ReLU → Conv × ( stride = 2 ) → ReLU → Conv × ( stride = 2 ) → ReLU → FC → FC ⇒ µ ∈ R & FC → ReLU ⇒ σ ∈ R Decoder network: z = µ + (cid:15) (cid:12) σ → FC → ReLU → FC → ReLU → FC → Sigmoid → Reshape × × ⇒ ˆ x ∈ R × × We use the following hyper-parameters to train the network:

Batch size Number of epochs Optimizer Learning rate Padding

64 30 Adam × − SAME ppendix for WiSE-ALE

D.3. CelebA

We implement our WiSE and AEVB on the same encoder and decoder network used in WAE in order to compare thereconstruction quality of our method with AEVB and WAE. The network architecture and training parameters are statedbelow.Encoder network: x ∈ R × × → Conv × ( stride = 2 ) → ReLU → Conv × ( stride = 2 ) → ReLU → Conv × ( stride = 2 ) → ReLU → Conv × ( stride = 2 ) → ReLU → FC ⇒ µ ∈ R & FC ⇒ σ ∈ R Decoder network: z = µ + (cid:15) (cid:12) σ → FC × → ReLU → Reshape × × × → TransposedConv × ( stride = (2,2) ) → BN → ReLU → TransposedConv × ( stride = (2,2) ) → BN → ReLU → TransposedConv × ( stride = (2,2) ) → BN → ReLU → TransposedConv × ( stride = (1,1) ) ⇒ ˆ x ∈ R × × We use the following hyper-parameters to train the network:

Batch size Number of epochs Optimizer Learning rate Padding

256 50 Adam × − at thestart, . × −4