On Latent Distributions Without Finite Mean in Generative Models
OOn Latent Distributions Without Finite Mean inGenerative Models
Damian Le´sniak ∗ Igor Sieradzki ∗ Igor Podolak
Jagiellonian University
Abstract
We investigate the properties of multidimensional probability distributions in thecontext of latent space prior distributions of implicit generative models. Our workrevolves around the phenomena arising while decoding linear interpolations be-tween two random latent vectors – regions of latent space in close proximity tothe origin of the space are sampled causing distribution mismatch. We show thatdue to the Central Limit Theorem, this region is almost never sampled during thetraining process. As a result, linear interpolations may generate unrealistic dataand their usage as a tool to check quality of the trained model is questionable. Wepropose to use multidimensional Cauchy distribution as the latent prior. Cauchydistribution does not satisfy the assumptions of the CLT and has a number of prop-erties that allow it to work well in conjunction with linear interpolations. We alsoprovide two general methods of creating non-linear interpolations that are easilyapplicable to a large family of common latent distributions. Finally we empiri-cally analyze the quality of data generated from low-probability-mass regions forthe DCGAN model on the CelebA dataset.
Generative latent variable models have grown to be a very popular research topic, with Varia-tional Auto-Encoders (VAEs) Kingma and Welling [2013] and Generative Adversarial Networks(GANs) Goodfellow et al. [2014] gaining a lot of research interest in the last few years. VAEs usea stochastic encoder network to embed input data in a typically lower dimensional space, using aconditional probability distribution p ( z | x ) over possible latent space codes z ∈ R D . A stochastic decoder network is then used to reconstruct the original sample. GANs, on the other hand, use a generator network that creates data samples from noise samples z ∼ p ( z ) , where p ( z ) is a fixedprior distribution, and train a discriminator network jointly to distinguish between real and "fake"(i.e. generated) data.Both of those model families use a specific prior distribution on the latent space. In those models thelatent codes aim to "explain" the underlying features of the real distribution p ( x ) without explicitaccess to it. One would expect a well-trained probabilistic model to encode the properties of thedata. Typical priors for those latent codes are the multidimensional standard Normal distribution N ( , I ) or uniform distribution on a hypercube [ − , D .A linear interpolation between two latent vectors x , x is formally defined as a function f Lx ,x : [0 , (cid:51) λ (cid:55)→ (1 − λ ) x + λx , which may be understood as a traversal along the shortest path between these two endpoints. Weare interested in decoding data for several values λ and inspecting how smooth the transition be- ∗ These two authors contributed equallyPreprint. Work in progress. a r X i v : . [ c s . L G ] J un ween the decoded data points is. Linear interpolations were utilized in previous work on generativemodels, mainly to show that the learned models do not overfit Kingma and Welling [2013], Good-fellow et al. [2014], Dumoulin et al. [2016] and that the latent space is able to capture the semanticcontent of the data Radford et al. [2015], Donahue et al. [2016]. Linear interpolations can also bethought of as special case of vector algebra in the code space, similarly to the work done in wordembeddings Mikolov et al. [2013].While considered useful, linear interpolations used in conjunction with the most popular latent dis-tributions are prone to traverse low probability mass regions. In high dimensions norms of vectorsdrawn from the latent distribution are concentrated around a certain value. Thus latent vectors arefound near the surface of a sphere which results in the latent space distribution resembling a soapbubble fer. This is explained using the Central Limit Theorem (CLT), which we show in 2. Linear in-terpolations pass through inside of the sphere with high enough probability to drastically change thedistribution of interpolated points in comparison to the prior distribution. This was reported Kilcheret al. [2017] to result in flawed data generation.Some approaches to counteract this phenomena were proposed: [White, 2016] recommended usingspherical interpolations to avoid traversing unlikely regions; Agustsson et al. [2017] suggest normal-izing the norms of the points along the interpolation to match the prior distribution; [Kilcher et al.,2017] propose using a modified prior distribution, which saturates the origin of the latent space.[Arvanitidis et al., 2017] gives an interesting discussion on latent space traversal using the theory ofRiemannian spaces. Firstly, we propose to use the Cauchy distribution as the prior in generative models. This results inpoints along linear interpolations being distributed identically to those sampled from the prior. Thisis possible because Cauchy distributed noise does not satisfy the assumptions of the CLT.Furthermore, we present two general ways of defining non-linear interpolations for a given latentdistribution. Similarly, we are able to force points along interpolations to be distributed accordingto the prior.Lastly, we show that the DCGAN Radford et al. [2015] model on the CelebA Liu et al. [2015] datasetis able to generate sensible images from the region near the supposedly "empty" origin of the latentspace. This is contrary to what has been reported so far and we further empirically investigate thisresult by evaluating the model trained with specific pathological distributions.
The normal distribution with mean µ and variance σ is denoted by N ( µ, σ ) , the uniform distri-bution on the interval [ a, b ] is denoted by U ( a, b ) , and the Cauchy distribution with location µ andscale γ is denoted by C ( µ, γ ) . If not stated otherwise, the normal distribution has mean zero andvariance one, the uniform distribution is defined on the interval [ − , , and the Cauchy distributionhas location zero and scale one.The dimension of the latent space is denoted by D .Multidimensional random variables are written in bold, e.g. Z . Lower indices denote coordinatesof multidimensional random variables, e.g. Z = ( Z , . . . , Z D ) . Upper indices denote indepen-dent samples from the same distribution, e.g. Z (1) , Z (2) , . . . , Z ( n ) . If not stated otherwise, D -dimensional distributions are defined as products of D one-dimensional independent equal distribu-tions.The norm used in this work is always the Euclidean norm. Let us assume that we want to train a generative model which has a D -dimensional latent spaceand a fixed latent probability distribution defined by random variable Z = ( Z , Z , . . . , Z D ) . Z , Z , . . . , Z D are the independent marginal distributions, and let Z denote a one-dimensionalrandom variable distributed identically to every Z j , where j = 1 , . . . , D .2or example, if Z ∼ U ( − , , then Z is distributed uniformly on the hypercube [ − , D ; if Z ∼N (0 , , then Z is distributed according to the D -dimensional normal distribution with mean andidentity covariance matrix.In the aforementioned cases we observe the so-called soap bubble phenomena – the values sampledfrom Z are concentrated close to a ( D − -dimensional sphere, contrary to the low-dimensionalintuition. Observation 2.1.
Let us assume that Z has finite mean µ and finite variance σ . Then (cid:107) Z (cid:107) approximates the normal distribution with mean √ Dµ and variance σ µ .Sketch of proof. Recall that (cid:107) Z (cid:107) = Z + . . . + Z D . If Z , . . . , Z D are independent and distributedidentically to Z , then Z , . . . , Z D are independent and distributed identically to Z . Using thecentral limit theorem we know that for large D √ D ( Z + . . . + Z D D − µ ) (cid:39) N (0 , σ ) from which it follows ( Z + . . . + Z D D ) (cid:39) N ( µ, σ D ) , and thus we can approximate the squared norm of Z as (cid:107) Z (cid:107) = Z + . . . + Z D (cid:39) N ( Dµ, Dσ ) . Due to the nature of the convergence in distribution dividing or multiplying both sides by factors √ D or D that tend to infinity does not break the approximation.The final step is to take the square root of both random variables. In proximity of Dµ , square rootbehaves approximately like scaling with constant (2 √ Dµ ) − . Additionally, N ( Dµ, Dσ ) has width proportional to √ D , so we may apply affine transformation to the normal distribution to approximatethe square root for large D, which in the end gives us: (cid:107) Z (cid:107) (cid:39) N ( (cid:112) Dµ, σ µ ) . An application of this observation to the two most common latent space distributions: ◦ if Z ∼ N (0 , , then Z has moments µ = 1 , σ = 2 , thus (cid:107) Z (cid:107) (cid:39) N ( √ D, , ◦ if Z ∼ U ( − , , then Z has moments µ = 13 , σ = 445 , thus (cid:107) Z (cid:107) ∼ N ( (cid:114) D ,
115 ) .It is worth noting that the variance of the norm does not depend on D , which means that the dis-tribution does not converge to the uniform distribution on the sphere of radius √ Dµ . Another factworth noting is the observation that the D -dimensional normal distribution with identity covariancematrix is isotropic, hence this distribution resembles the uniform distribution on a sphere. On theother hand, the uniform distribution on the hypercube [ − , D is concentrated in close proximity tothe surface of the sphere, but has regions of high density corresponding to directions defined by thehypercube’s vertices.Now let us assume that we want to randomly draw two latent samples and interpolate linearly be-tween them. We denote the two independent draws by Z (1) and Z (2) . Let us examine the distributionof the random variable Z := Z (1) + Z (2) . Z is the distribution of the middle points of a linear inter-polation between two vectors drawn independently from Z . If the generative model was trained onnoise sampled from Z and if the distribution of Z differs from Z , then data decoded from samplesdrawn from Z might be unrealistic, as such samples were never seen during training. One way toprevent this issue is to find Z such that Z is distributed identically to Z .3 bservation 2.2. If Z has a finite mean and Z , Z are identically distributed, then Z must be con-centrated at a single point.Sketch of proof. Using induction on n we can show that for all n ∈ N the average of n indepen-dent samples from Z is distributed equally to Z . On the other hand if n → ∞ , then the averagedistribution tends to E [ Z ] . Thus Z must be concentrated at E [ Z ] .There have been attempts to find Z with finite mean such that Z is at least similar to Z Kilcheret al. [2017], where similarity was measured with Kullback-Leibler divergence between the distri-butions. We extend this idea by using a specific distribution that has no finite mean, namely themultidimensional Cauchy distribution.Let us start with a short review of useful properties Cauchy distribution in of one-dimensional case.Let C ∼ C (0 , . Then:1. The probability density function of C is equal to π (1 + x ) .2. For C all moments of order greater than or equal to one are undefined. The location pa-rameter should not be confused with the mean.3. If C (1) and C (2) are independent and distributed identically to C , then C (1) + C (2) isdistributed identically to C . Furthermore, if λ ∈ [0 , , then λC (1) + (1 − λ ) C (2) is alsodistributed identically to C .4. If C (1) , . . . , C ( n ) are independent and distributed identically to C , and λ , . . . , λ n ∈ [0 , with λ + . . . + λ n = 1 , then λ C (1) , . . . , λ n C ( n ) is distributed identically to C .Those are well-known facts about the Cauchy distribution, and proving them is a common exercisein statistics textbooks. However, according to our best knowledge, the Cauchy distribution has neverbeen used for in the context of generative models. With this in mind, the most important take-awayis the following observation: Observation 2.3. If Z is distributed according to the D -dimensional Cauchy distribution, thena linear interpolation between any number of latent points does not change the distribution.Sketch of proof. Let C ∼ C (0 , and Z = ( Z , . . . , Z D ) . The variables Z , . . . , Z D are indepen-dent and distributed equally to C . If Z (1) , . . . , Z ( n ) are independent and distributed identically to Z ,and λ , . . . , λ n ∈ [0 , with λ + . . . + λ n = 1 are fixed, then λ Z (1) j , . . . , λ n Z ( n ) j is distributedequally to Z j for j = 1 , . . . , D , thus λ Z (1) , . . . , λ n Z ( n ) is distributed equally to Z .We observed that the normal and uniform distributions are concentrated around a sphere with radiusproportional to √ D . On the other hand, the multidimensional Cauchy distribution fills the latentspace. It should be noted that for the D -dimensional Cauchy distribution the region near the originof the latent space is empty – similarly to the normal and uniform distributions.Figure 1 shows a comparison between approximations of density functions of (cid:107) Z (cid:107) for multidi-mensional normal, uniform and Cauchy distributions, and a distribution proposed by Kilcher et al.[2017].The one-dimensional Cauchy distribution has heavy tails, hence we can expect that one of Z ( i ) coor-dinates will usually be sufficiently larger (by absolute value) than the others. This could potentiallyhave negative impact on training of the GAN model, but we did not observe such difficulties. How-ever, there is an obvious trade-off with using a distribution with heavy tails, as there will always bea number of samples with high enough norm. For those samples the generator will not be able tocreate sensible data points. A particular result of choosing a Cauchy distributed prior in GANs isthe fact that during inference there will always be a number of "failed" generated data points dueto latent vectors being sampled from the tails. Some of those faulty examples are presented in theappendix B. Figure 2 shows a set of samples from the DCGAN model trained on the CelebA datasetusing the Cauchy and distribution from Kilcher et al. [2017] and Figure 3 shows linear interpolationon those two models. 4igure 1: Comparison of approximate distribution of Euclidean norms for , samples withincreasing dimensionality from different probability distributions.Figure 2: Comparison of samples from DCGAN trained on the Cauchy distribution and one trainedon the distribution proposed by Kilcher et al. [2017]. In this section we list current work on interpolations in high dimensional latent spaces in generativemodels. We present two methods that perform well with noise priors with finite first moments,i.e. the mean. Again, we define a linear interpolation between two points x and x as a function f Lx ,x : [0 , (cid:51) λ (cid:55)→ (1 − λ ) x + λx . In some cases we will use the term interpolation for the image of the function, as opposed to thefunction itself. We will list four properties an interpolation can have that we believe are importantin the context of generative models:
Property 1.
The interpolation should be continuous with respect to λ, x and x . Property 2.
For every x , x the interpolation f x ,x should represent the shortest path between thetwo endpoints. 5igure 3: Comparison of linear interpolations from DCGAN trained on the Cauchy distribution andone trained on the distribution proposed by Kilcher et al. [2017]. Property 3.
If two points x (cid:48) , x (cid:48) are a in the interpolation between x and x , then the wholeinterpolation from x (cid:48) to x (cid:48) should be included in the interpolation between x and x . Property 4. If Z defines a distribution on the D -dimensional latent space and Z (1) , Z (2) are inde-pendent and distributed identically to Z , then for every λ ∈ [0 , the random variable f Z (1) , Z (2) ( λ ) should be distributed identically to Z .The first property enforces that an interpolation should not make any jumps and that interpolationsbetween pairs of similar endpoints should also be similar to each other. The second one is purpose-fully ambiguous. In absence of any additional information about the latent space it feels natural touse the Euclidean metric and assume that only the linear interpolation has this property. There hasbeen some work on equipping the latent space with a stochastic Riemannian metric Arvanitidis et al.[2017] that additionally depends on the generator function. With such a metric the shortest pathcan be defined using geodesics. The third property is closely associated with the second one andcodifies common-sense intuition about shortest paths. The fourth property is in our minds the mostimportant desideratum of the linear interpolation, similarly to what Kilcher et al. [2017] stated. Tounderstand these properties better, we will now analyze the following interpolations. The linear interpolation is defined as f Lx ,x ( λ ) = (1 − λ ) x + λx . It obviously has properties 1-3. Satisfying property 4 is impossible for the most commonly usedprobability distributions, as they have finite mean, which was shown in observation 2.2.
As in Shoemake [1985], White [2016], the spherical linear interpolation is defined as f SLx ,x ( λ ) = sin [(1 − λ )Ω]sin Ω x + sin[ λ Ω]sin Ω x , where Ω is the angle between vectors x and x .This interpolation is continuous nearly everywhere (with the exception of antiparallel endpoint vec-tors) and satisfies property 3. It satisfies property 2 in the following sense: if vectors x and x have the same length R , then the interpolation corresponds to a geodesic on the sphere of radius R .Furthermore: Observation 3.1.
Property 4 is satisfied if Z has uniform distribution on the zero-centered sphereof radius R > . ketch of proof. Let λ ∈ [0 , and let f Z (1) , Z (2) ( λ ) be concentrated on the zero-centered sphere.The distribution of all pairs sampled from Z (1) × Z (2) is identical to the product of two uniformdistributions on the sphere, thus invariant to all isometries of the sphere. Then f Z (1) , Z (2) ( λ ) alsomust be invariant to all isometries, and the only probability distribution having this property is theuniform distribution. Introduced in Agustsson et al. [2017], the normalized interpolation is defined as f Nx ,x ( λ ) = (1 − λ ) x + λx (cid:112) (1 − λ ) + λ . It satisfies property 1, but neither property 2 nor 3, which can be easily shown in the extreme caseof x = x . As for property 4: Observation 3.2.
The normalized interpolation satisfies property 4 if Z ∼ N ( , I ) .Sketch of proof. Let λ ∈ [0 , . The random variables Z (1) and Z (2) are both distributed accordingto N ( , I ) . Then, using elementary properties of the normal distribution: (1 − λ ) Z (1) + λ Z (2) (cid:112) (1 − λ ) + λ ∼ N ( , I ) . If vectors x and x are orthogonal and have equal length, then this interpolation is equal to thespherical linear interpolation from the previous section. Here we present a general way of designing interpolations that satisfy properties 1, 3, and 4. Let: ◦ L be the D -dimensional latent space, ◦ Z define the probability distribution on the latent space, ◦ C be distributed according to the D -dimensional Cauchy distribution on L , ◦ K be a subset of L such that all mass of Z is concentrated on this set, ◦ g : L → K be a bijection such that g ( C ) be identically distributed as Z on K .Then for x , x ∈ K we define the Cauchy-linear interpolation as f CLx ,x ( λ ) = g (cid:0) (1 − λ ) g − ( x ) + λg − ( x ) (cid:1) . In other words, for endpoints x , x ∼ Z :1. Transform x and x using g − .2. Linearly interpolate between the transformations to get x = (1 − λ ) g − ( x ) + λg − ( x ) for all λ ∈ [0 , .3. Transform x back to the original space using g .With some additional assumptions we can define g as CDF − C ◦ CDF Z , where CDF − C is theinverse of the cumulative distribution function (CDF) of the Cauchy distribution, and CDF Z is theCDF of the original distribution Z . If additionally Z is distributed identically to the product of D independent one-dimensional distributions, then we can use this formula coordinate-wise. Observation 3.3.
With the above assumptions the Cauchy-linear interpolation satisfies property 4. Originally referred to as distribution matched . ketch of proof. Let λ ∈ [0 , . First observe that g − ( Z (1) ) and g − ( Z (2) ) are independent anddistributed identically to C . Likewise, (1 − λ ) g − ( Z (1) ) + λg − ( Z (2) ) ∼ C . By the assumption on g we have g ((1 − λ ) g − ( Z (1) ) + λg − ( Z (1) )) ∼ Z . We might want to enforce the interpolation to have some other desired properties. For example: tobehave exactly as the spherical linear interpolation, if only the endpoints have equal norm. For thatpurpose we require additional assumptions. Let: ◦ Z be distributed isotropically, ◦ C be distributed according to the one-dimensional Cauchy distribution, ◦ g : R → (0 , + ∞ ) be a bijection such that g ( C ) is distributed identically as (cid:107) Z (cid:107) on (0 , + ∞ ) .Then we can modify the spherical linear interpolation formula to define what we call the sphericalCauchy-linear interpolation : f x ,x ( λ ) = (cid:16) sin [(1 − λ )Ω]sin Ω x (cid:107) x (cid:107) + sin[ λ Ω]sin Ω x (cid:107) x (cid:107) (cid:17)(cid:104) g (cid:0) (1 − λ ) g − ( (cid:107) x (cid:107) ) + λg − ( (cid:107) x (cid:107) ) (cid:1)(cid:105) , where Ω is the angle between vectors x and x . In other words:1. Interpolate the directions of latent vectors using spherical linear interpolation.2. Interpolate the norms using Cauchy-linear interpolation from previous section.Again, with some additional assumptions we can define g as CDF − C ◦ CDF (cid:107) Z (cid:107) . For example:let Z be a D -dimensional normal distribution with zero mean and identity covariance matrix. Then (cid:107) Z (cid:107) ∼ (cid:112) χ D and CDF √ χ D ( x ) = CDF χ D ( x ) = 1Γ( D/ γ (cid:18) D , x (cid:19) , for every x ≥ . Thus we set g ( x ) = ( CDF − C ◦ CDF χ D )( x ) , with g − ( x ) = (cid:113) ( CDF − χ D ◦ CDF C )( x ) . Observation 3.4.
With the assumptions as above, the spherical Cauchy-linear interpolation satisfiesproperty 4.Sketch of proof.
We will use the fact that two isotropic probability distributions are equal if distri-butions of their euclidean norms are equal. The following holds: ◦ All the following random variables are independent: Z (1) (cid:107) Z (1) (cid:107) , Z (2) (cid:107) Z (2) (cid:107) , (cid:107) Z (1) (cid:107) , (cid:107) Z (2) (cid:107) . ◦ (cid:107) Z (1) (cid:107) and (cid:107) Z (2) (cid:107) are both are distributed identically to (cid:107) Z (cid:107) . ◦ Z (1) (cid:107) Z (1) (cid:107) and Z (2) (cid:107) Z (2) (cid:107) are both distributed uniformly on the sphere of radius .Let λ ∈ [0 , . Note that sin [(1 − λ )Ω]sin Ω Z (1) (cid:107) Z (1) (cid:107) + sin[ λ Ω]sin Ω Z (2) (cid:107) Z (2) (cid:107) (1)8s uniformly distributed on the sphere of radius , which is a property of spherical linear interpola-tion. The norm of f Z (1) , Z (2) ( λ ) is distributed according to g ((1 − λ ) g − ( (cid:107) Z (1) (cid:107) ) + λg − ( (cid:107) Z (2) (cid:107) )) which is independent of (1). Thus, we have shown that f Z (1) , Z (2) ( λ ) is isotropic.For the equality of norm distributions we will use a property of Cauchy-linear interpolation : g ((1 − λ ) g − ( (cid:107) Z (1) (cid:107) ) + λg − ( (cid:107) Z (2) (cid:107) )) is distributed identically to (cid:107) Z (cid:107) . Thus norm of f Z (1) , Z (2) ( λ ) isdistributed equally to (cid:107) Z (cid:107) .Figure 4 shows comparison of Cauchy-linear and spherical Cauchy-linear interpolations on 2D planefor data points sampled from different distributions. Figure 5 shows the smoothness of Cauchy-linearinterpolation and a comparison between all the aforementioned interpolations. We also compare thedata samples decoded from the interpolations by the DCGAN model trained on the CelebA dataset;results are shown on Figure 6. (a) Uniform (b) Cauchy (c) Normal (d) Normal - Spherical Figure 4: Visual comparison of Cauchy-linear interpolation (a, b, and c) for points sampled fromdifferent distributions and spherical Cauchy-linear (d). (a) (b)
Figure 5: Visual representation of interpolation property 1 shown for Cauchy-linear interpolation (a)and comparison between all considered interpolation methods for two points sampled from a normaldistribution (b) on a 2D plane.We will briefly list the conclusion of this chapter. Firstly, linear interpolation is the best choice ifone does not care about the fourth property, i.e. interpolation samples being distributed identicallyto the end-points. Secondly, for every continuous distribution with additional assumptions we candefine an interpolation that has a consistent distribution between endpoints and mid-samples, whichwill not satisfy property 2, i.e. will not be the shortest path in Euclidean space. Lastly, there existdistributions for which linear interpolation satisfies the fourth property, but those distributions cannothave finite mean.To combine conclusions from the last two chapters: in our opinion there is a clear direction in whichone would search for prior distributions for generative models. Namely, choosing those distribution9or which the linear interpolation would satisfy all four properties listed above. On the other hand,in this chapter we have shown that if one would rather stick to the more popular prior distributions, itis fairly simple to define a nonlinear interpolation that would have consistent distributions betweenendpoints and midpoints.Figure 6: Different interpolations on latent space of a GAN model trained on standard Normaldistribution.
In this section we investigate the claim that in close proximity to the origin of the latent space, gen-erative models will generate unrealistic or faulty data Kilcher et al. [2017]. We have tried differentexperimental settings and were somewhat unsuccessful in replicating this phenomena. Results ofour experiments are summarized in Figure 7. Even in higher latent space dimensionality D = 200 ,the DCGAN model trained on the CelebA dataset was able to generate a face-like images, althoughwith high amount of noise. We investigated this result furthermore and empirically concluded thatthe effect of filling the origin of latent space emerges during late epochs of training. Figure 8 showslinear interpolations through origin of the latent space throughout the training process.Data samples generated from samples located strictly inside the high-probability-mass sphere maynot be identically distributed as the samples used in training, but from the decoded result they seemto be on the manifold of data. On the other hand we observed that data generated using latent vectorswith high norm, i.e. far outside the sphere, are unrealistic. This might be due to the architecture,specifically the exclusive use of ReLU activation function. Because of that, input vectors with largenorms will result in abnormally high activation just before final saturating nonlinearity (usually tanh or sigmoid function), which in turn will make the decoded images highly color-saturated. It seemsunsurprising that the exact value of the biggest sensible norm of latent samples is related to norm oflatent vectors seen during training.The only possible explanation of the fact that model is able to generate sensible data from out-of-distribution samples is a very strong model prior, both architecture and training algorithm. Wedecided to empirically test strength of this prior in the experiment described below.We trained a DCGAN model on the CelebA dataset using a set of different noise distributions, all ofwhich should suffer from the aforementioned empty region in the origin of latent space. Afterwards,10igure 7: Linear interpolations through origin of latent space for different experimental setups:a) uniform noise distribution on [ − , D , b) uniform noise distribution on a sphere S D , c) fullyconnected layers, d) normal noise distribution with latent space dimensionality D = 150 , e) Normalnoise distribution with latent space dimensionality D = 200 Figure 8: Emergence of sensible samples decoded near the origin of latent space throughout thetraining process demonstrated in interpolations between opposite points from latent space.using those models, we generate data images decoded from out-of-distribution samples. We wouldnot except to generate sensible images, as those latent samples should have never been seen duringtraining. We visualize samples decoded from inside of the high-probability-mass sphere and linearinterpolations traversing through it. We test a few different prior distributions: normal distribution N ( , I ) , uniform distribution on hypercube U ( − , , uniform distribution on sphere S ( , , Dis-crete uniform distribution on set {− , } D . Experiment result are shown in Figure 9 with more inthe appendix D.It might be that the origin of the latent space is surrounded by the prior distribution, which maybe the reason to generate sensible data from it. Thus we decided to test if the model still worksif we train it on an explicitly sparse distribution. We designed a pathological prior distribution inwhich after sampling from a given distribution, e.g. N ( , I ) , we randomly draw D − K coordinatesand set them all to zero. Figure 11 shows samples from such a distributions in 3-dimensional case.Again, we trained the DCGAN model and generated images using latent samples from the dense distribution, multiplying them beforehand by (cid:112) K/D to keep the norms consistent with those usedin training. Results are shown in Figure 10 with more in the appendix.Lastly, we wanted to check if the model is capable of generating sensible data from regions of latentspace that are completely orthogonal to those seen during training. We created another pathological a) Samples from train and test distributions.(b) Linear interpolations between random points from train distribution. Figure 9: Images generated from a DCGAN model trained on 100-dimensional normal distribution.Train samples (from the distribution used in training) and test samples (same as train distribution butwith lower variance) (a) and interpolations between random points from the training distribution (b).prior distribution, in which, after sampling from a given distribution (e.g. N ( , I ) ) we set the D − K last coordinates of each point to zero. As before, we trained the DCGAN model and generatedimages using samples from original distribution but this time with first K dimensions set to zero.Results are shown in Figure 12 with more in the appendix.To conclude our experiments we briefly remark on the results. For the first experiment, imagesgenerated from the test distribution are clearly meaningful despite being decoded from samplesfrom low-probability-mass regions. One thing to note is the fact that they are clearly less diversethan those from training distribution.Decoded images from our second experiment are in our opinion similar enough between the trainand test distribution to conclude that the DCGAN model does not suffer from training on a sparselatent distribution and is able to generalize without any additional mechanisms.Out last experiment shows that while the DCGAN is still able to generate sensible images fromregions orthogonal to the space seen during training, those regions still impact generative power andmay lead to unrealistic data if they increase the activation in the network enough.We observed that problems with hollow latent space and linear interpolations might be caused bystopping the training early or using a weak model architecture. This leads us to conclusion that oneneeds to be very careful when comparing effectiveness of different latent probability distributionsand interpolation methods. We investigated the properties of multidimensional probability distributions in context of latent noisedistribution of generative models. Especially we looked for pairs of distribution-interpolation, wherethe distributions of interpolation endpoints and midpoints are identical.We have shown that using D -dimensional Cauchy distribution as latent probability distributionmakes linear interpolations between any number of latent points hold that consistency property.We have also shown that for popular priors with finite mean, it is impossible to have linear inter-polations that will satisfy the above property. We argue that one can fairly easily find a non-linearinterpolation that will satisfy this property, which makes a search for such interpolation less interes-12 a) Samples from train and test distributions.(b) Linear interpolations between random points from train distribution. Figure 10: Images generated from a DCGAN model trained on sparse 100-dimensional normaldistribution with K = 50 . Train samples (from distribution used in training) and test samples(samples from a dense normal distributions with adjust norm) (a) and interpolations between randompoints from the training distribution (b).ting. Those results are formal and should work for every generative model with fixed distribution onlatent space. Although, as we have shown, Cauchy distribution comes with few useful theoreticalproperties, it is still perfectly fine to use normal or uniform distribution, as long as the model ispowerful enough.We also observed empirically that DCGANs, if trained long enough, are capable of generating sen-sible data from latent samples coming out-of-distribution. We have tested several pathological casesof latent priors to give a glimpse of what the model is capable of. At this moment we are unable toexplain this phenomena and point to it as a very interesting future work direction.13igure 11: Visualization of the sparse training distribution for D = 3 , K = 1 , brown points beingsampled from the train distribution, green from the test distribution.Figure 12: Images decoded from the DCGAN model trained on a pathological References
Gaussian distributions are soap bubbles. . Accessed: 2018-05-22.Eirikur Agustsson, Alexander Sage, Radu Timofte, and Luc Van Gool. Optimal transport mapsfor distribution preserving operations on latent spaces of generative models. arXiv preprintarXiv:1711.01970 , 2017.Georgios Arvanitidis, Lars Kai Hansen, and Søren Hauberg. Latent space oddity: on the curvatureof deep generative models. arXiv preprint arXiv:1710.11379 , 2017.Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. arXiv preprintarXiv:1605.09782 , 2016. 14incent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Ar-jovsky, and Aaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704 ,2016.Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In
Advances in neural infor-mation processing systems , pages 2672–2680, 2014.Yannic Kilcher, Aurelien Lucchi, and Thomas Hofmann. Semantic interpolation in implicit models. arXiv preprint arXiv:1710.11381 , 2017.Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprintarXiv:1312.6114 , 2013.Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In
Proceedings of the IEEE International Conference on Computer Vision , pages 3730–3738, 2015.Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word represen-tations in vector space. arXiv preprint arXiv:1301.3781 , 2013.Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deepconvolutional generative adversarial networks. arXiv preprint arXiv:1511.06434 , 2015.Ken Shoemake. Animating rotation with quaternion curves. In
ACM SIGGRAPH computer graph-ics , volume 19, pages 245–254. ACM, 1985.Tom White. Sampling generative networks: Notes on a few effective techniques. arXiv preprintarXiv:1609.04468 , 2016. 15 ppendices
A Experimental setup
All experiments are run using DCGAN model, the generator network consists of a linear layer with8192 neurons, follow by four convolution transposition layers, each using × filters and stridesof 2 with number of filters in order of layers: 256, 128, 64, 3. Except the output layer where tanh function activation is used, all previous layers use ReLU . Discriminator’s architecture mirrors the onefrom the generator with a single exception of using leaky ReLU instead of vanilla
ReLU functionfor all except the last layer. No batch normalization is used in both networks. Adam optimizer withlearning rate of e − and momentum set to . . Batch size 64 is used throughout all experiments.If not explicitly stated otherwise, latent space dimension is 100 and the noise is sampled from amultidimensional normal distribution N ( , I ) . For the CelebA dataset we resize the input images to × . The code to reproduce all our experiments is available at: coming soon! B Cauchy distribution - samples and interpolations
Figure 13: Generated images from samples from Cauchy distribution with occasional "failed" im-ages from tails of the distribution. 16igure 14: Generated images from samples from Cauchy distribution with different latent spacedimensionality.Figure 15: Linear interpolations between random points on a GAN trained on Cauchy distribution.17igure 16: Linear interpolations between opposite points on a GAN trained on Cauchy distribution.Figure 17: Linear interpolations between hand-picked points from tails of the Cauchy distribution.18
More Cauchy-linear and spherical Cauchy-linear interpolations
Figure 18: Cauchy-linear interpolations between opposite points on a GAN trained on Normal dis-tribution.Figure 19: Cauchy-linear interpolations between random points on a GAN trained on Normal dis-tribution. 19igure 20: Spherical Cauchy-linear interpolations between random points on a GAN trained onNormal distribution.
D More experiments with hollow latent space (a) Samples from train and test distributions.(b) Linear interpolations between random points from train distribution.
Figure 21: Images generated from a DCGAN model trained on 100-dimensional uniform distribu-tion on hypercube [ − , . Train samples (from the distribution used in training) and test samples(same as train distribution but multiplied by α < ) (a) and interpolations between random pointsfrom the training distribution (b). 20 a) Samples from train and test distributions.(b) Linear interpolations between random points from train distribution. Figure 22: Images generated from a DCGAN model trained on 100-dimensional uniform distribu-tion on sphere S . Train samples (from the distribution used in training) and test samples (sameas train distribution but multiplied by α < ) (a) and interpolations between random points from thetraining distribution (b). (a) Samples from train and test distributions.(b) Linear interpolations between random points from train distribution. Figure 23: Images generated from a DCGAN model trained on 100-dimensional discrete distribution − , . Train samples (from the distribution used in training) and test samples (same as traindistribution but multiplied by α <1