A Spectral Regularizer for Unsupervised Disentanglement
AA Spectral Regularizer for Unsupervised Disentanglement
Aditya Ramesh
Youngduck Choi Yann LeCun Abstract
A generative model with a disentangled represen-tation allows for control over independent aspectsof the output. Learning disentangled representa-tions has been a recent topic of great interest, butit remains poorly understood. We show that evenfor GANs that do not possess disentangled rep-resentations, one can find curved trajectories inlatent space over which local disentanglement oc-curs. These trajectories are found by iterativelyfollowing the leading right-singular vectors of theJacobian of the generator with respect to its in-put. Based on this insight, we describe an efficientregularizer that aligns these vectors with the coor-dinate axes, and show that it can be used to inducedisentangled representations in GANs, in a com-pletely unsupervised manner.
1. Introduction
Rapid progress has been made in the development ofgenerative models capable of producing realistic samplesfrom distributions over complex, high-resolution natural im-ages (Karras et al. (2017), Brock et al. (2018)). Despite thesuccess of these models, it is unclear how one can achievecontrol over the data-generation process when labeled in-stances are not available. Perturbing an individual compo-nent of a latent variable typically results in an unpredictablechange to the output. When the model has a disentangled rep-resentation, these changes become interpretable, and eachcomponent of the latent variable affects a distinct attribute.So far, the problem of learning disentangled reprersentationsremains poorly understood. Most approaches for doing thishave focused on VAEs (Kingma & Welling (2013), Rezendeet al. (2014)), which produce blurry samples in practice. Thefew approaches that have been developed for GANs, such asInfoGAN (Chen et al., 2016), have had comparably limited OpenAI Continuation of work done at New York University Department of Computer Science, Yale University Departmentof Computer Science, New York University. Correspondence to:Aditya Ramesh
Proceedings of the th International Conference on MachineLearning , Long Beach, California, PMLR 97, 2019. Copyright2019 by the author(s).
Figure 1: Trajectory obtained by following the leading eigenvector of M z ( z ) from a fixed embedding z ∈ R in the latent space of a GAN trained onthe LSUN Bedrooms dataset at × resolution. The camera angle is var-ied smoothly, while the content of the bedroom is preserved. The parameters usedfor Algorithm 1 are given by α, ρ, N = 1 . · − , . , ; we show it-erates , , . . . , . The architecture and training procedure are as describedin (Mescheder et al., 2018). Videos of the trajectories for the first four eigenvec-tors can be viewed at this URL: https://drive.google.com/open?id=1_SxlvcjakzkL8vNY4hporwtvg8mLQw4S . success.Learning disentangled representations is of interest, becausethey allow us gauge the ability of generative models to ben-efit downstream tasks. For example, a robotics applicationmay require a vision model that reliably determines theorientation of an object, subject to changes in viewpoint,lighting conditions, and intra-class variation in appearance.A generative model with independent control over the fullextent of each of these attributes may help to achieve thelevel of robustness that is required for these applications.Moreover, the ability to synthesize concepts in ways that arenot present in the original data (e.g., rotation of an object toa position not encoutered during training) is a useful bench-mark for reasoning and out-of-distribution generalization.Often times, the goal of training a generative model is toobserve that this type of generalization occurs.Following (Higgins et al., 2016), we say a generativemodel G : Z → X , where Z ⊂ R m and X ⊂ R n , pos-sesses a disentangled representation when it satisfies twoimportant properties. The first property is independence ofthe components of the latent variable. Roughly speaking,this means that a perturbation to any component of a la-tent variable should result in a change to the output that isdistinct, in some sense, from the changes resulting from per-turbations to the other components. For example, if the firstcomponent of the latent variable controls hair length, thenthe other components should not influence this attribute. The a r X i v : . [ s t a t . M L ] F e b Spectral Regularizer for Unsupervised Disentanglement second property is interpretability of the changes in X re-sulting from these perturbations. Suppose that the process ofsampling from the distribution p data modeled by G can be re-alized by a simulator that is parameterized over a list of inde-pendent, scalar-valued factors or attributes. These attributesmight correspond to concepts such as lighting, azimuth, gen-der, age, and so on. This property is met when G ( z + αe i ) results in a change along exactly one of these attributes, foreach i ∈ [1 , m ] , where e i is the i th standard basis vectorand α > . Quantifying the extent to which this propertyholds is generally challenging.Classical dimensionality reduction techniques, such as PCAand metric MDS, can be used to obtain a latent represen-tation satisfying the first property. For instance, the formerensures that changes to the output resulting from changesto distinct components of the latent variable are orthogo-nal, which is a particularly restrictive form of distinctness.These techniques are guaranteed to faithfully represent theunderlying structure only when the data occupy a linearsubspace. For many applications of interest, such as imagemodeling, the data manifold could be a highly curved andtwisted surface that does not satisfy this assumption. TheSwiss Roll dataset is a simple example for which this isthe case: two points with small euclidean distance couldhave large geodesic distance on the data manifold. Sinceboth techniques measure similarity between each pair ofpoints using euclidean distance, distant outputs in X couldget mapped to nearby embeddings in Z . Thus, the secondproperty often fails to hold in practice.Deep generative models have the potential to learn repre-sentations that satisfy both of the required properties fordisentanglement. Our work focuses on the case in which G is a GAN generator. Let J G ( z ) ∈ R n × m be the Jacobianof G evaluated at the latent variable z ∈ Z . Our main con-tributions are as follows:1. We show that the leading right-singular vectorsof J G ( z ) can be used to obtain a local disentangledrepresentation about a neighborhood of z ;2. We show that following the path determined by a lead-ing right-singular vector in the latent space of a GANyields a trajectory along which local changes to theoutput are interpretable; and3. We formulate a regularizer that induces disentangledrepresentations in GANs, by aligning the top k right-singular vectors v , . . . , v k with the first k coordinatedirections, e , . . . , e k .
2. Related Work
To date, the two most successful approaches for unsuper-vised learning of disentangled representations are the β - VAE (Higgins et al., 2016) and InfoGAN (Chen et al., 2016).The former proposes increasing the weight β for the KL-divergence term in the objective of the VAE (Kingma &Welling (2013), Rezende et al. (2014)), which is normally setto one. This weight controls the tradeoff between minimiz-ing reconstruction error, and minimizing the KL-divergencebetween the approximate posterior and the prior of the de-coder. The authors observe that as the KL-divergence isreduced, more coordinate directions in the latent space ofthe decoder end up corresponding to disentangled factors.To measure the extent to which this occurs, they describea disentanglement metric score, which we use in Section 6.Follow-up work (Burgess et al., 2018) analyzes β -VAE fromthe perspective of information channel capacity, and de-scribes principled way of controlling the tradeoff betweenreconstruction error and disentanglement. Several variantsof the β -VAE have also since been developed (Kim & Mnih(2018), Chen et al. (2018), Esmaeili et al. (2018)).InfoGAN Chen et al. (2016) augments the GAN objec-tive (Goodfellow et al., 2014) with a term that maximizesthe mutual information between the generated samples andsmall subset of latent variables. This is done by means of anauxiliary classifier that is trained along with the generatorand the discriminator. Adversarial training offers the poten-tial for a high degree of realism that is difficult to achievewith models like VAEs, which are based on reconstructionerror. However, InfoGAN finds fewer disentangled factorsthan VAE-based approaches on datasets such as CelebA (Liuet al., 2015) and 3DChairs (Aubry et al., 2014), and offerslimited control over each latent factor (e.g., maximum rota-tion angle for azimuth). The mutual information regularizeralso detriments sample quality. Hence, the development ofan unsupervised method for learning disentangled represen-tations with GANs, while meeting or exceeding the qualityof those found by VAE-based approaches, remains an openproblem.Our work makes an important step in this direction. In con-trast to previous approaches, which are based on informationtheory, we leverage the spectral properties of the Jacobian ofthe generator. This new perspective not only allows us to in-duce high-quality disentangled representations in GANs, butalso to find curved paths over which disentanglement occurs,for GANs that do not possess disentangled representations.
3. Identifying Local Disentangled Factors
We begin by describing how the singular value decomposi-tion of J G ( z ) can be used to define a local generative modelabout z that satisfies the first property of disentangled rep-resentations. In particular, we show that the left-singularvectors of J G ( z ) form an orthonormal set of directionsfrom G ( z ) along which the magnitudes of the instanta-neous changes in G are maximized. Since perturbations Spectral Regularizer for Unsupervised Disentanglement from z along the right-singular vectors result in changesfrom G ( z ) along the left-singular vectors, the right-singularvectors result in distinct changes to the output of G . Thissimple relationship allows us to define a local generativemodel that satisfies the independence property.First, we show that the perturbations from z along theright-singular vectors maximize the magnitude of the instan-taneous change in G ( z ) . Given an arbitrary vector v ∈ R m ,the directional derivative lim (cid:15) → G ( z + (cid:15)v ) − G ( z ) (cid:15) = J G ( z ) v. (1)measures the instantaneous change in G resulting from aperturbation along v from z . The magnitude of this changeis given by (cid:107) J G ( z ) v (cid:107) = ( v t J G ( z ) t J G ( z ) v ) / =: ( v t M z ( z ) v ) / =: n z ( v, z ) , (2)which is a seminorm involving the positive semidefinite ma-trix M z ( z ) := J G ( z ) t J G ( z ) ∈ R m × m . The unit-normperturbation from z that maximizes n z ( v, z ) is given by v := max v ∈ S m − n z ( v, z ) = max v ∈ S m − v t J G ( z ) t J G ( z ) v, (3)where S m − ⊂ R m is the unit sphere. This is the firsteigenvector of M z ( z ) , which coincides with the first right-singular vector of J G ( z ) . It follows from the singular valuedecomposition of J G ( z ) that the first left-singular vector isgiven by u = σ − J G ( z ) v , where σ is the first singularvalue. Hence, a perturbation from z along v maximizesthe magnitude of the instantaneous change in G ( z ) , andthis change occurs along u .Next, we consider the unit-norm perturbation orthogonalto v that maximizes n z ( v, z ) . It is given by v := max v ∈ S d − ∩ span( v ) ⊥ v t J G ( z ) t J G ( z ) v. (4)This is the second eigenvector of M z ( v, z ) , which coin-cides with the second right-singular vector of J G ( z ) . Asbefore, we get u = σ − J G ( z ) v , where σ is the secondsingular value. So a perturbation from z along v resultsin an instantaneous change in G ( z ) along u . Continu-ing in this way, we consider the k th unit-norm perturbationorthogonal to v , . . . , v k − that maximizes n z ( v, z ) , foreach k ∈ [2 , r ] , where r := rank( M z ( z )) . This shows thatthe right-singular vectors of J G ( z ) maximize the magni-tude of the instantaneous change in magnitude of G ( z ) , andthese changes occur along the corresponding left-singularvectors.Now, we use the right-singular vectors of J G ( z ) to de-fine a local generative model about z . Consider the func- (a) (b)(c) (d)Figure 2: Trajectories obtained by following the first three eigenvectors of M z ( z ) from a fixed embedding z ∈ R in the latent spaces of two GANs with iden-tical architectures, trained on the dSprites dataset (Matthey et al., 2017). Subfig-ure (a) shows plots of the trajectories { γ k } k ∈ [1 , , and subfigure (b) shows theoutputs of G at iterates , , . . . , of each trajectory γ k , for k ∈ [1 , ,from top to bottom, respectively. Subfigures (c) and (d) show the same informationfor an alignment-regularized GAN ( k, λ = 3 , . ); the trajectories are now axis-aligned, as expected. The parameters used for Algorithm 1 are given by α, ρ, N =5 · − , . , . Details regarding the model architecture are given in Ap-pendix C. tion ¯ G z : R r → R n , ¯ G z : α (cid:55)→ G z + (cid:88) i ∈ [1 ,r ] α i v i . The components of α control perturbations along orthonor-mal directions, and these directions also result in orthonor-mal changes to G ( z ) . Hence, ¯ G z satisfies the first propertyfor a generative model to possess a disentangled representa-tion, but only about a neighborhood of z . Figure 13 in Ap-pendix F investigates whether ¯ G z also satisfies the secondproperty: interpretability of changes to individual compo-nents of α . We can see that perturbations along the leadingeigenvectors M z ( z ) , especially the principal eigenvector,often result in the most drastic changes. These changes areinterpretable, and tend to make modifications to isolatedattributes of the face. To see this in more detail, we considerthe top two rows of subfigure (g). Movement along the firsttwo eigenvectors changes hair length and facial orientation;movement along the third eigenvector decreases the lengthof the bangs; movement along the fourth and fifth eigenvec-tors changes background color; and movement along thesixth and seventh eigenvectors changes hair color.
4. Finding Quasi-Disentangled Paths
Generative models known to possess disentangled repre-sentations, such as β -VAEs (Higgins et al., 2016), allow for Spectral Regularizer for Unsupervised Disentanglement continuous manipulation of attributes via perturbations to in-dividual coordinates. Starting from a latent variable z ∈ Z ,we can move along the path γ : t (cid:55)→ z + te i in order tovary a single attribute of G ( z ) , while keeping the othersheld fixed. GANs are not known to learn disentangled repre-sentations automatically. Nonetheless, the previous sectionshows that the local generative model ¯ G z does possess adisentangled representation, but only about a neighborhoodof the base point z . We explore whether it is possible toextend this local model to obtain disentanglement along acontinuous path from z . To do this, we construct a trajec-tory γ k : R → R n , t (cid:55)→ G ( γ k ( t )) by repeatedly followingthe k th leading eigenvector of M z ( γ k ( t )) , where γ (0) := z .The procedure used to do this is given by Algorithm 1.We first test the procedure on a toy example for whichit is possible to explicitly plot the trajectories. We usethe dSprites dataset (Matthey et al., 2017), which consistsof × white shapes on black backgrounds. Each shapeis completely determined by six attributes: symbol (square,ellipse, or heart), scale (6 values), rotation angle from to π (40 values), and x - and y -positions (30 values each).We trained a GAN on this dataset using a latent variablesize of three, fewer than the six latent factors that determineeach shape. Figure 2 shows that outputs of the generatoralong these trajectories vary locally along only one or twoattributes at a time. Along the first trajectory γ , the genera-tor first decreases the scale of the square, then morphs it intoa heart, increases the scale again, and finally begins to rotateit. Similar comments apply to the other two trajectories, γ and γ .Next, we test the procedure on a GAN trained on theCelebA dataset (Liu et al., 2015) at × resolution. Fig-ure 3(a) shows the trajectories γ starting at four fixed em-beddings z , . . . , z . Although the association between theordinal k of the eigenvector v k and the attribute of the imagebeing varied is not consistent throughout latent space, localchanges still tend to occur along only one or two attributes ata time. Figure 3(b) shows the trajectories γ , γ , γ and γ ,all starting from the same fixed embedding z . As is thecase for the dSprites dataset, we can see that trajectories γ k for distinct k tend to effect changes to G ( z ) along dis-tinct attributes. These results suggest that, along trajectoriesfrom z determined by a leading eigenvector of M z ( z ) ,changes in the output tend to occur along isolated attributes.
5. Aligning the Local Disentangled Factors
The β -VAE (Higgins et al., 2016) is known to learn disen-tangled representations, so that traveling along paths of theform γ : t (cid:55)→ z + te j , (5)for certain coordinate directions e j , produces changes toisolated attributes of G ( z ) . In the previous section, we saw Algorithm 1
Procedure to trace path determined by k th leading eigenvector. Require: mv : R m × R m → R d , z, v (cid:55)→ M z ( z ) v is a function that computes matrix-vector products with the implicitly-defined matrix M z ( z ) ∈ R m × m . Require: z ∈ R m is the embedding from which to begin the trajectory. Require: k ∈ [1 , m ] is ordinal of the eigenvector to trace, with k = 1 corresponding to theleading eigenvector. Require: α > is the step size. Require: ρ ∈ [0 , is the decay factor. Require: N (cid:62) is the required number of steps in the trajectory.12 procedure T RACE E IGENPATH ( mv , z, k, α, ρ, N )3 z ← z for i ∈ [1 , N ] do M z ( z ) ← E VALUATE N ORMAL J ACOBIAN (mv , z ) (cid:46) Details in Appendix A6
V, D, V t ← SVD ( M z ( z )) w i ← v k (cid:46) Take the k th eigenvector8 if i (cid:62) then if (cid:104) w i − , w i (cid:105) < then w i ← − w i (cid:46) Prevent backtracking by ensuring that ∠ ( w i − , w i ) (cid:54) π/ w i ← ρw i − + (1 − ρ ) w i (cid:46) Apply decay to smoothen the trajectory12 z i ← z i − − αw i return { z , . . . , z N } that such paths still exist for GANs, but they are not simplystraight lines oriented along the coordinate axes (see Fig-ure 2). We develop an efficient regularizer that encouragesthese paths to take the form of Equation 5, based on align-ment of the top k eigenvectors of M z ( z ) with the first k coordinate directions e , . . . , e k . Before proceeding, we de-scribe a useful visualization technique to help measure theextent to which this happens. Let M z ( z ) = V DV t be theeigendecomposition of M z ( z ) , where V is an orthogonalmatrix whose columns are eigenvectors, and D the diago-nal matrix of nonnegative eigenvalues, sorted in descendingorder. Now we define ˜ V : z (cid:55)→ V to be the function thatmaps z to the corresponding eigenvector matrix V , and let F := E z ∼ p z ˜ V ( z ) ◦ ˜ V ( z ) , (6)where p z is the prior over Z , and ‘ ◦ ’ denotes Hadamardproduct. If the k th column is close to a one-hot vector, withvalues close to zero everywhere except at entry j , then weknow that on average, the k th leading eigenvector of M z ( z ) is aligned with e j . A heatmap generated from this matrixtherefore allows us to gauge the extent to which each eigen-vector v k is aligned with the coordinate direction e k . Fig-ure 5 shows that this does not happen automatically fora GAN, even when it is trained with a small latent vari-able size. Interestingly, eigenvector alignment does occurfor β -VAEs. Appendix B explores this connection in moredetail.We begin by considering the case where we only seekto align the leading eigenvector v ∈ R m with the firstcoordinate direction e ∈ R m . A simple way to do thisis to obtain an estimate ˆ v for v using T power it-erations, and then renormalize ˆ v ∈ R m to a unit vec-tor. We can then maximize the value of the first compo-nent of the elementwise squared vector ˆ v ◦ ˆ v , and min-imize values of the remaining components. Using themask s := ( − , , . . . , ∈ R m , we define the regular- Spectral Regularizer for Unsupervised Disentanglement (a) Trajectories corresponding to the principal eigenvector for embeddings z , . . . , z ,from top to bottom, respectively. For embeddings z , z , and z , we show iter-ates , , . . . , , and for embedding z , we show iterates , , . . . , . The firsttrajectory varies azimuth; the second, gender; the third, hair length; and the fourth, haircolor.(b) Trajectories corresponding to the first, second, third, and fifth leading eigenvectorsof embedding z , from top to bottom, respectively; we show iterates , , . . . , .The first trajectory primarily varies azimuth; the second, gender; the third, age; and thefourth, hair color and presence of facial hair.Figure 3: Trajectories found by following leading eigenvectors from five fixed em-beddings z , . . . , z , using Algorithm 1 ( α = 1 . · − , ρ = 0 ). Details re-garding the model architecture are given in Appendix C. izer R : R m → R , R ( z ) := (cid:88) i ∈ [1 ,m ] ( s ◦ ˆ v ◦ ˆ v ) i . (7)Since ˆ v is constrained to unit norm, this regularizer isbounded. It can be incorporated into the loss for the genera-tor using an appropriate penalty weight λ > .Next, we consider the case where we would like to align thefirst two leading eigenvectors v , v ∈ R m with the first twocoordinate directions e , e ∈ R m . One potential approachis to first compute an estimate ˆ v to v using T power itera-tions, as before. Then, we could apply a modified version ofthe power method to obtain an estimate ˆ v for v , in whichwe project the result of each power iteration onto span(ˆ v ) ⊥ using the projection P := I − ˆ v ˆ v t . There are two prob-lems with this approach. Firstly, it could be inaccurate: un-less (cid:107) v − ˆ v (cid:107) < τ for sufficiently small τ > , which mayrequire a large number of power iterations, P will not bean accurate projection onto span( v ) ⊥ . Error in approxi-mating v would then jeopardize the approximation to v .Second, the approach is inefficient. We can only run thepower method to estimate v after the we have already ob-tained an estimate for v . If we use T power iterations toestimate each eigenvector, then estimating the first k eigen-vectors will require a total of kT power iterations. This istoo slow to be practical. Algorithm 2
Procedure to estimate the top k eigenpairs of M z ( z ) . Require: mv : R m → R m is a function that computes matrix-vector productswith the implicitly-defined matrix M z ( z ) ∈ R m × m . Require: V ∈ R m × k is a matrix whose columns are the initial estimates for theeigenvectors. Require: T (cid:62) is the required number of power iterations. (cid:15) ← − (cid:46) Guards against division by numbers close to zero procedure E S T I M AT E L E A D I N G E I G E N PA I R S ( mv , V, T ) Let M k ∈ R m × k be given by Equation 11 V ← M k ◦ V (cid:46) ‘ ◦ ’ denotes Hadamard product for i ∈ [1 , T ] do V i ← M ◦ mv( V i − ) Λ i ← diag( C O L U M N N O R M S ( V i )) V i ← V i (Λ i + (cid:15)I ) − (cid:46) Renormalize columns return Λ T , V T procedure C O L U M N N O R M S ( A ) (cid:46) A ∈ R p × q return ( (cid:107) a (cid:107) , . . . , (cid:107) a q (cid:107) ) (cid:46) a i ∈ R p is the i th column of A Our specific application of the power method enables anoptimization that allows for the k power iterations to berun simultaneously. Once we apply the regularizer to theGAN training procedure, the first eigenvector v will quicklyalign with the first coordinate direction e . We there-fore assume that v = e . This assumption would implythat span( v ) ⊥ = span( e , . . . , e m ) , so applying P wouldamount to zeroing out the first component of ˆ v after eachpower iteration. Since P would no longer depend on ˆ v ,we can run the power iterations for v and v in parallel. Toformally describe this, we let c p := 1 ∈ R p be the constantvector of ones, and let M := (cid:18) c m c m − (cid:19) . (8)Given a matrix ˆ V (2) t ∈ R m × whose columns are the currentestimates for v and v , respectively, we can describe thepower iterations for v and v using the recurrence ˆ V (2) t +1 := M ◦ ( M z ( z ) ˆ V (2) t ) . (9)Now, let ˆ V (2) ∈ R m × be the final estimate for v and v . To implement the regularizer, we let s :=(1 , − , , . . . , ∈ R m , S := (cid:0) s s (cid:1) ∈ R m × , and de-fine R ( z ) := (cid:88) i ∈ [1 ,m ] j ∈ [1 , ( S ◦ ˆ V (2) ◦ ˆ V (2) ) i,j . (10)It is straightforward to generalize this approach to thecase where we seek to align the first k eigenvec-tors v , . . . , v k with e , . . . , e k . For each eigenvector v i ,with i ∈ [2 , k ] , we assume that eigenvectors v , . . . v i − are already aligned with e , . . . , e i − . The projectionsonto span( e ) ⊥ , span( e , e ) ⊥ , . . . , span( e , . . . , e i − ) ⊥ can be implemented using columns , , . . . , i , respectively, Spectral Regularizer for Unsupervised Disentanglement
Algorithm 3
Procedure to evaluate the alignment penalty.
Require: k ∈ [1 , m ] is number of leading eigenvectors to align with e , . . . , e k . Require: mv , T are as defined in Algorithm 2. procedure E VA L UAT E A L I G N M E N T R E G U L A R I Z E R ( k, mv , T ) Let S k be given by Equation 12 α ← / ( k ( k + 1)) A ← diag( α · ( k, k − , . . . , S k ← S k A (cid:46)
Reweight columns to prioritize alignment of leadingeigenvectors V ← R A N D O M R A D E M AC H E R ( m, k ) ˆΛ , ˆ V ← E S T I M AT E L E A D I N G E I G E N PA I R S (mv , V , t ) return S U M ( S k ◦ ˆ V ◦ ˆ V ) procedure R A N D O M R A D E M AC H E R ( p, q ) return A ∈ R p × q , where a ij = 1 with probability / and − withprobability / of the mask M k ∈ R m × k . This mask, which is a gen-eralization of the one defined by Equation 8, is givenby R m × k (cid:51) M k := (cid:18) c k c k − e c k − ( e + e ) · · · c k − ( e + · · · + e k − ) c m − k c m − k c m − k · · · c m − k (cid:19) . (11)The resulting procedure to estimate the leading k eigen-vectors is described by Algorithm 2. Figure 4 shows thatthe runtime of Algorithm 2 scales linearly with respect tothe number of eigenvectors k and the number of power it-erations T . Next, we generalize Equation 10, in order todescribe how to evaluate the regularizer R k . Let s p ∈ R m be given by ( s p ) i = − if i = p and otherwise, and define S k := − · · · − · · ·
11 1 − · · · ... ... ... ... ... · · · − ... ... ... ... ... · · · = (cid:0) s · · · s k (cid:1) ∈ R m × k . (12)Algorithm 3 shows how S k is used with the result of Al-gorithm 2 to evaluate R k . Finally, Figure 2(c) shows thatincorporating the alignment regularizer into the generatorloss successfully aligns the trajectories produced by Algo-rithm 1 with the coordinate axes.If the alignment regularizer is implemented exactly as de-scribed by Equation 10, it will fail to have the intendedeffect. The reason for this has to do with the assumptionbehind the optimization used to run the k power iterationsin parallel. Before attempting to align v i with e i , we assumethat that v , . . . , v i − are already aligned with e , . . . , e i − .When this assumption fails to hold, the projections com-puted using the columns of M k will no longer be valid.Figure 5(a) shows that the matrix F does not have a diago-nal in its top-left corner, which is what we would expect tosee if v , . . . , v k were aligned with e , . . . , e k . Fortunately,there is a simple fix that remedies the situation. We wouldlike to encourage the optimizer to prioritize alignment of v i with e i over alignment of v i +1 , . . . , v k with e i +1 , . . . , e k , Figure 4: Median relative cost per generator update as a function of the number ofeigenvectors k to align, and the number of power iterations T to use for Algorithm 3.The cost is measured relative to the time required to make one RMSProp (Tieleman& Hinton, 2012) update to the parameters of a DCGAN generator (approximately . ms) with base feature map count and batch size . Complete detailsregarding the model architecture are given in Appendix C. Medians were computedusing the update times for the first iterations; the median absolute deviations aresmall enough as to be indistinguishable from the medians on the plot.Figure 5: Comparison of the top-left × corner of the matrix F (Equation 6)for alignment-regularized GANs ( k = 8 , T = 8 , λ = 0 . ), trained with reweight-ing (left) and without reweighting (right) of the columns of the matrix S k definedby Equation 12. Both GANs were trained on the CelebA dataset (Liu et al., 2015)at × resolution, with latent variable size 128. See Appendix C for completedetails regarding the model architecture and training procedure. for all i ∈ [1 , k − . A simple way to do this is to multi-ply the i th column of M k by a weight of ( k − i + 1) α . Wechoose α based on the condition that these weights sum toone, i.e., (cid:88) i ∈ [1 ,k ] iα = 1 , implying that α = 2 k ( k + 1) . (13)This reweighting scheme is implemented in lines 5–7 ofAlgorithm 3. Figure 5(b) confirms that this modificationinduces the desired structure in the top-left corner of F .
6. Results
We first make a quantitative comparison between our ap-proach and previous methods that have been used to obtaindisentangled representations. This requires us to measurethe extent to which the second property of disentangled rep-resentations – namely, interpretability of changes resultingfrom perturbations to individual coordinates of a latent vari-able – holds. Suppose that we had knowledge of the groundtruth latent factors for the dataset, along with a simulator
Spectral Regularizer for Unsupervised Disentanglement
Model Disentanglement ScoreGround truth 100Raw pixels 45.75 ± ± ± ± InfoGAN ± ± ± β -VAE ± Ours ± Table 1: Comparison of disentanglement metric scores for our method tothose reported in (Higgins et al., 2016). We use an alignment-regularizedGAN ( k, λ = 6 , . ) with latent variable size . Details regarding the model ar-chitecture and training procedure are given in Appendix C. that can synthesize new outputs given assignments to theselatent factors. Then we could generate pairs of outputs, suchthat the outputs in each pair differ only along a single at-tribute. Let ( x , x ) be one such pair, and z := G − ( x ) and z := G − ( x ) the corresponding latent variables forthe generator G : R m → R n . If G satisfies the second prop-erty, then we would expect z and z to be approximatelyequal along all components, except the one correspondingto the attribute that was varied. Hence, | z − z | should be aone-hot vector in expectation. (Higgins et al., 2016) proposean evaluation metric based on this idea. It involves traininga linear classifier to predict the index of the latent factor thatwas varied in order to generate each pair. At each step oftraining for the classifier, we sample a batch of input-targetpairs in accordance with the procedure described in (Higginset al., 2016), and update the classifier using the cross-entropyloss.Application of the evaluation metric to GANs is complicatedby the fact that a direct procedure to invert the generator isusually not available. Models such as the β -VAE consist ofan encoder that effectively functions as an inverse for thegenerative model, so this is not an issue. For the purposeof evaluation, we also train an encoder to invert the fixedgenerator, after GAN training has finished. This additionaltraining procedure uses a standard autoencoding loss, andthe details are specified in Appendix D. Table 1 comparesour approach to the ones evaluated in (Higgins et al., 2016)on the dSprites dataset (Matthey et al., 2017), which we de-scribe in Section 4. As stated in Appendix C, we added noiseto both the real and fake inputs of the discriminator in orderto stabilize GAN training on this dataset, for which the pix-els are binary-valued. Incorporating this modification intothe training procedure for InfoGAN may improve the scorereported by (Higgins et al., 2016). Our approach achievesa score competitive to that of DC-IGN, which makes use of supervised information. We note that the encoder trainedto invert the generator does not always succeed in finding alatent variable that yields an accurate reconstruction of theinput. This may lead to a reduction in the disentanglementscore, a problem that does not affect the other approachesfor which an accurate inference procedure is available.Next, we make a qualitative comparison between our ap-proach and β -VAE. We train a series of GANs on theCelebA dataset (Liu et al., 2015), with k ∈ { , , } for Algorithm 3 and varying values for the alignmentregularizer weight λ . We also train a series of β -VAEswith β ∈ { , , , , } . The results for our methodwith k, λ = 32 , . are shown in Figure 7, and the resultsfor β -VAE with β = 8 in Figure 8. Figure 6 shows thetop-left corners of the matrix F given by Equation 6, forthree configurations of our approach with k ∈ { , , } .A penalty weight of λ = 0 . was not sufficient to resultin a clear diagonal structure in the top-left corner of thematrix F for the configuration with k = 32 . Nonetheless,this configuration resulted in the largest number of disentan-gled factors, and the results are shown in Figure 7. The bestresults for the configurations with k ∈ { , } are shownin Appendix E. We found that the β -VAE configurationwith β = 8 resulted in the largest number of disentangledfactors, while still maintaining sufficiently low reconstruc-tion error so as to keep the sample quality acceptable. Theseresults are shown in Figure 8. In addition to better samplequality, our approach is able to learn concepts such as dif-ferent kinds of hair styles (coordinates 9 and 10 in Figure 7)that are not modeled by the β -VAE.Even with relatively large values for the penalty weight, theheatmaps shown in Figure 6 contain nonzero entries aboveand below the diagonal. This can result in some degree ofleakage of the attribute controlled by one coordinate into thenext. To see this in more detail, we examine the disentangledfactors found by the configuration with k = 16 , which areshown in Figure 12. Coordinates { , , } all involve changein hair darkness; coordinates { , , } all involve changein gender; coordinates { , , , } all involve smiling; andcoordinates { , } all involve change in the location of thehair partition. This is a limitation of our current approach,and modifications to the implementation of the alignmentregularizer in Algorithm 3 may help to mitigate this leakage.We plan to investigate such improvements in future work.
7. Conclusion
Our work approaches the problem of learning disentan-gled representations from the perspective of eigenvectoralignment. We develop a novel regularizer which, when in-corporated into the GAN objective, induces disentangledrepresentations of quality comparable to those obtained byVAE-based approaches (Higgins et al. (2016), Kim & Mnih
Spectral Regularizer for Unsupervised Disentanglement (a) Top-left × corner of F for GANwith k, λ = 8 , . . (b) Top-left × corner of F for GANwith k, λ = 16 , . . (c) Top-left × corner of F for GANwith k, λ = 32 , . .Figure 6: Comparison of the top-left corners of the matrix F (Equation 6) foralignment-regularized GANs with k ∈ { , , } . All GANs were trained withlatent variable size . Details regarding the model architecture and training proce-dure are given in Appendix C. (2018), Chen et al. (2018), Esmaeili et al. (2018)), with-out the need to introduce auxiliary models into the trainingprocedure (Chen et al., 2016). This approach is also notspecific to the GAN framework (Goodfellow et al., 2014),and could potentially be applied to autoregressive models,such as those used to generate text and audio. We believethis is an important direction for future investigation. Sofar, two different perspectives for viewing disentanglementhave been proposed: maximizing mutual information andeigenvector alignment. An investigation into the relationshipbetween the two could further our understanding of whatdisentanglement is and why it occurs. Acknowledgements
We would like to thank for Ryan-Chris Moreno Vasquezand Emily Denton for early discussions that led us to thinkcarefully about the paths determined by the leading eigen-vectors of M z ( z ) . We are also grateful for the suggestionsprovided by Mikael Henaff, Alec Radford, Yura Burda, andHarri Edwards, which improved the quality of our presenta-tion. Coordinate Description Example1 Background darkness2 Azimuth3 Bangs4 Gender, bangs, and smiling5 Smiling6 Hair color7 Hair color and hair style8 Lighting color and bangs9 Hair color and hair style10 Hair color and hair style11 Jawline12 Smiling and bangs13 Background color14 Age and hairline15 Age17 Location of hair partition18 Lighting and skin tone20 Mouth open21 Mouth open
Figure 7: Disentanglement results for alignment-regularized GAN ( k, λ = 32 , . )with latent variable size on the CelebA dataset, at × resolu-tion. Details regarding the model architecture and training procedure aregiven in Appendix C. Results with additional samples for each coordinatecan be found at this URL: https://drive.google.com/open?id=1E2TneLSAdyFYgN4GUzYKnHDYABo-G8RK . Coordinate Description Example1 Azimuth2 Background color3 Location of hair partition4 Skin color5 Hair direction8 Background color12 Hair length14 Jawline19 Skin tone20 Hair style22 Sunglasses23 Gender25 Background darkness26 Hairline and skin tone28 Lighting direction and rotation32 Hair size, color, and smiling
Figure 8: Disentanglement results for β -VAE ( β = 8 ) with latent variable size onthe CelebA dataset, at × resolution. Details regarding the model architectureand training procedure are given in Appendix C. Results with additional samplesfor each coordinate can be found at this URL: https://drive.google.com/open?id=1Zu5H_M0dOuE2NYeOJ9WZFSux4U5qM5Pq . Spectral Regularizer for Unsupervised Disentanglement
References
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean,J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.Tensorflow: A system for large-scale machine learning.In
OSDI , volume 16, pp. 265–283, 2016.Aubry, M., Maturana, D., Efros, A. A., Russell, B. C., andSivic, J. Seeing 3d chairs: exemplar part-based 2d-3dalignment using a large dataset of cad models. In
Pro-ceedings of the IEEE conference on computer vision andpattern recognition , pp. 3762–3769, 2014.Baydin, A. G., Pearlmutter, B. A., Radul, A. A., and Siskind,J. M. Automatic differentiation in machine learning: asurvey.
Journal of Machine Learning Research , 18(153):1–153, 2017.Brock, A., Donahue, J., and Simonyan, K. Large scale gantraining for high fidelity natural image synthesis. arXivpreprint arXiv:1809.11096 , 2018.Burgess, C. P., Higgins, I., Pal, A., Matthey, L., Watters, N.,Desjardins, G., and Lerchner, A. Understanding disentan-gling in β -vae. arXiv preprint arXiv:1804.03599 , 2018.Chen, T. Q., Li, X., Grosse, R., and Duvenaud, D. Isolatingsources of disentanglement in variational autoencoders. arXiv preprint arXiv:1802.04942 , 2018.Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever,I., and Abbeel, P. Infogan: Interpretable representationlearning by information maximizing generative adversar-ial nets. In Advances in neural information processingsystems , pp. 2172–2180, 2016.Esmaeili, B., Wu, H., Jain, S., Bozkurt, A., Siddharth, N.,Paige, B., Brooks, D. H., Dy, J., and van de Meent, J.-W.Structured disentangled representations. stat , 1050:29,2018.Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Gener-ative adversarial nets. In
Advances in neural informationprocessing systems , pp. 2672–2680, 2014.Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X.,Botvinick, M., Mohamed, S., and Lerchner, A. beta-vae:Learning basic visual concepts with a constrained varia-tional framework. 2016.Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progressivegrowing of gans for improved quality, stability, and varia-tion. arXiv preprint arXiv:1710.10196 , 2017.Kim, H. and Mnih, A. Disentangling by factorising. arXivpreprint arXiv:1802.05983 , 2018. Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980 , 2014.Kingma, D. P. and Welling, M. Auto-encoding variationalbayes. arXiv preprint arXiv:1312.6114 , 2013.Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learning faceattributes in the wild. In
Proceedings of InternationalConference on Computer Vision (ICCV) , 2015.Matthey, L., Higgins, I., Hassabis, D., and Lerchner,A. dsprites: Disentanglement testing sprites dataset.https://github.com/deepmind/dsprites-dataset/, 2017.Mescheder, L., Geiger, A., and Nowozin, S. Which trainingmethods for gans do actually converge? In
InternationalConference on Machine Learning , pp. 3478–3487, 2018.Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spec-tral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957 , 2018.Radford, A., Metz, L., and Chintala, S. Unsupervised rep-resentation learning with deep convolutional generativeadversarial networks. arXiv preprint arXiv:1511.06434 ,2015.Rezende, D. J., Mohamed, S., and Wierstra, D. Stochasticbackpropagation and approximate inference in deep gen-erative models. arXiv preprint arXiv:1401.4082 , 2014.Salimans, T. and Kingma, D. P. Weight normalization: Asimple reparameterization to accelerate training of deepneural networks. In
Advances in Neural InformationProcessing Systems , pp. 901–909, 2016.Tieleman, T. and Hinton, G. Lecture 6.5—RmsProp: Dividethe gradient by a running average of its recent magnitude.COURSERA: Neural Networks for Machine Learning,2012.Townsend, J. A new trick for calculating jacobian vectorproducts. https://j-towns.github.io/2017/06/12/A-new-trick.html , 2017.Xiang, S. and Li, H. On the effects of batch and weightnormalization in generative adversarial networks. stat ,1050:22, 2017.
Spectral Regularizer for Unsupervised Disentanglement
A. Forward- and reverse-mode automatic differentiation
The entries of J G are not explicitly stored in memory: it is an implicit-defined matrix. As such, information about J G mustbe accessed via automatic differentiation (AD), of which there are two kinds: forward-mode and reverse-mode. We brieflydescribe them here, and refer the reader to (Baydin et al., 2017) for a more comprehensive survey.Suppose that we are given a differentiable function f : R m → R n corresponding to a feedforward neural network with L layers, and let G := ( V, E ) be its representation as a computation graph. For each k ∈ [1 , L ] , let R k be the set of nodescorresponding to the k th layer, and let r k : R α k − → R α k be the function computed by the nodes in R k , with α := m and α L := n . Then f = r L ◦ r L − ◦ · · · ◦ r ◦ r . Finally, let ¯ x ∈ R m denote the variable for the input to f , and ¯ b k − ∈ R α k − the variable for the input to r k , with ¯ b := ¯ x . Given a vector v ∈ R m , forward-mode AD computes Jv . It works accordingto the recurrence v k := (cid:0) D ¯ x r k (cid:12)(cid:12) ¯ x = x (cid:1) v = (cid:16) D ¯ b k − r k (cid:12)(cid:12) ¯ b k − = b k − (cid:17) (cid:0) D ¯ x r k − ◦ . . . ◦ r (cid:12)(cid:12) ¯ x = x (cid:1) v = (cid:16) D ¯ b k − r k (cid:12)(cid:12) ¯ b k − = b k − (cid:17) v k − , where k ∈ [1 , L ] and v := v . After step k , we will have obtained the product of the Jacobian of r k with respect to x , and v ;after step L , we will have obtained the desired product Jv .Forward-mode AD computes Jv far more efficiently than the approach of first evaluating J , and subsequently multiplyingby v . Suppose for simplicity that α k = n for all k ∈ [0 , L ] , so that J ∈ R n × n . To compute J independently, we would usethe chain rule, which gives D ¯ x f (cid:12)(cid:12) ¯ x = x = (cid:16) D ¯ b L − r L (cid:12)(cid:12) ¯ b L − = b L − (cid:17) (cid:16) D ¯ b L − r L − (cid:12)(cid:12) ¯ b L − = b p − (cid:17) · · · (cid:16) D ¯ b r (cid:12)(cid:12) ¯ b = x (cid:17) . Each layer r k , for k ∈ [1 , L ] , must compute its n × n Jacobian, and each layer after the first must multiply its Jacobian withthe n × n Jacobian of the preceding layer with respect to x . This process is described by the recurrence D ¯ x r k (cid:12)(cid:12) ¯ x = x = (cid:16) D ¯ b k − r k (cid:12)(cid:12) ¯ b k − = b k − (cid:17) (cid:0) D ¯ x r k − (cid:12)(cid:12) ¯ x = x (cid:1) . Assuming that
Θ(1) operations are required to compute each element of the Jacobian, the total number of operations requiredis proportional to N := n + ( L − n + n ) ∈ Θ( n ) , despite the fact that J has only n elements. On the other hand, the k th step of the forward-mode recurrence requires Θ( n ) operations, so computing Jv only requires Θ( Ln ) operations. By running forward-mode with v = e i for each i ∈ [1 , n ] ,we can form J column-by-column in only Θ( Ln ) operations. Henceforth, we will measure operation count in terms ofinvocations to the AD engine, rather than by counting elementary operations.Of the two kinds of AD, it is reverse-mode that finds the most use in machine learning. Given a vector w ∈ R n , itcomputes J t w . Most applications involve minimizing a scalar-valued loss function with respect to a vector of parameters,which corresponds to the case where n = 1 . This special case is otherwise known as backpropagation. Unlike forward-mode,which begins at the first layer and ends at the last, reverse-mode begins at the last layer and ends at the first. It worksaccording to the recurrence w tk := w t (cid:16) D ¯ b k − r k (cid:12)(cid:12) ¯ b k − = b k − (cid:17) = w t (cid:16) D ¯ b k r L ◦ · · · ◦ r k +1 (cid:12)(cid:12) ¯ b k = b k (cid:17) (cid:16) D ¯ b k − r k (cid:12)(cid:12) ¯ b k − = b k − (cid:17) = w tk +1 (cid:16) D ¯ b k − r k (cid:12)(cid:12) ¯ b k − = b k − (cid:17) , where k ranges from L to , and w L := w . After L steps, we will have obtained the desired product J t w . Machine learningframeworks typically expose the full interface to the reverse-mode AD engine, rather than specializing to the case ofbackpropagation. E.g., in TensorFlow (Abadi et al., 2016), one can change the value of w for tf.gradients from avector of ones to a custom value specified by the grad_ys parameter.If desired, we can compute the entire Jacobian using either type of AD. By running forward-mode with v = e i foreach i ∈ [1 , m ] , we can form J column-by-column, using m total invocations to AD. Similarly, by running reverse-mode Spectral Regularizer for Unsupervised Disentanglement import tensorflow as tf23 def forward_gradients(ys, xs, d_xs):4 v = tf.placeholder_with_default(tf.ones_like(ys), shape=ys.get_shape())5 g = tf.gradients(ys, xs, grad_ys=v)6 return tf.gradients(g, v, grad_ys=d_xs)78 def j_v(ys, xs, vs):9 return forward_gradients(ys, xs, vs)1011 def jt_v(ys, xs, vs):12 return tf.gradients(ys, xs, vs)1314 def jt_j_v(ys, xs, vs):15 jv = j_v(ys, xs, vs)16 return tf.gradients(ys, xs, jv)1718 def j_jt_v(ys, xs, vs):19 jt_v_ = jt_v(ys, xs, vs)20 return forward_gradients(ys, xs, jt_v_) Listing 1: TensorFlow implementation of Jacobian-vector operations.(a) (b)Figure 9: Comparison between the average squared eigenvector matrices F (Equation 6) for a GAN generator with latent variable size (left) and the decoder from a β -VAE ( β = 8 ) with latent variable size (right), both trained on the CelebA dataset. Complete architecture and training details are provided in Appendix C. The matrix F forthe VAE decoder contains several columns that are close to one-hot vectors. By contrast, the same matrix for the GAN generator exhibits little structure, despite the fact that itis trained with a small latent variable size. with w = e i for each i ∈ [1 , n ] , we can form J row-by-row using n total invocations to AD. If n > m , the former is typicallyfaster; otherwise, the latter is preferable. We note that while AD offers a relatively efficient approach for evaluating J , doingso at each iteration of optimization becomes impractical.Many popular machine learning frameworks (e.g., TensorFlow) do not implement forward-mode natively, since it is seldomused for machine learning. Surprisingly, this is not a limitation: reverse-mode can be used to implement forward-mode. Given a differentiable function f : R m → R n , we can compute w t J for a given vector w ∈ R n , using reverse-mode AD.Treating the input to f as a constant, we can regard w t J as a function g : R n → R m , w (cid:55)→ w t J . The derivative of g withrespect to w is given by J t , and so another application of reverse-mode AD allows us to compute v t J t = ( Jv ) t . Hence,reverse-mode AD can be used to implement forward-mode AD. This trick was first described by (Townsend, 2017). Weprovide a TensorFlow implementation of the procedures to compute Jv, J t v, J t Jv , and JJ t v in Listing 1. B. Eigenvector Alignment for β -VAEs The results from Section 4 suggest that alignment of the eigenvectors of M z ( z ) with the coordinate axes might besufficient to induce disentanglement in the latent representation of a generative model. The β -VAE is known to learn sucha representation when β is increased, so that the KL-divergence between the approximate posterior and the prior of thedecoder is made sufficiently small. Suppose that a β -VAE exhibits disentanglement along the j th coordinate direction in Spectral Regularizer for Unsupervised Disentanglement
Figure 10: Investigation of whether the matrix F can be used to determine which directions in latent space correspond to disentangled factors for a β -VAE. For each i ∈ [1 , ,we compute the maximum value along the i th row of matrix F shown in Figure 9(b). Then, we plot a point indicating whether or not disentanglement occurs along this direction,as determined by visual inspection. The directions that do result in disentanglement are shown in Figure 8. Figure DCGAN Base Feature Map Count Latent Variable Size Notes13(c) 64 32 113(e) 64 64 113(g) 64 128 113(d),5, 4, 6(a), 6(b), 7 64 128 13, 11, 12 128 128 113(f) 64 256 113(h) 64 512 12 64 3 1, 31 64 10 1, 39(a) 64 16 19(b), 10 64 32 2
Table 2: GAN and β -VAE architectures used for all figures and tables. Note 1: weight normalization (Salimans & Kingma, 2016) was applied both to the generator and thediscriminator, with the scale g fixed to one for the discriminator. Note 2: spectral weight normalization (Miyato et al., 2018) was applied both to the encoder and the decoder,with learned scales for both. Note 3: gaussian noise with standard deviation . was added to both real and fake inputs to the discriminator. latent space. Then paths of the form γ : t (cid:55)→ z + te j will produce changes to G ( z ) along isolated factors of variation. Ifsuch a path coincides with the trajectory found by Algorithm 1 for the k th leading eigenvector throughout latent space, thenthis eigenvector must be aligned with the j th coordinate axis. Hence, we would expect the k th column of the matrix F givenby Equation 6 to be a one-hot vector close to e j . Figure 9 shows that, in fact, several columns of F have high similarity tocoordinate directions.Next, we investigate whether a column of F having high similarity with a coordinate direction e j implies that disentanglementactually occurs along γ : t (cid:55)→ z + te j . Figure 10 shows that with the exception of three coordinates having similaritygreater than 0.35, this turns out not to be the case. In other words, several of the eigenvectors naturally align with coordinatedirections during training, but disentanglement does not reliably occur along all of them. More work needs to be done inorder to better understand the relationship between eigenvector alignment and disentangled representations. C. Generator Architectures and Training Procedure
All models in this work are based on the DCGAN (Radford et al., 2015) architecture. Table 2 describes the architectures ofthe generator and discriminator (in the case of GANs) and the encoder and decoder (in the case of VAEs) associated with eachfigure and table. All models use the translated PReLU activation function (Xiang & Li, 2017); the ReLU leaks are learned,and clipped to the interval [0 , after each parameter update. The only data preprocessing we applied was to scale the pixelvalues to the interval [0 , . The GANs were trained using the original, non-saturating GAN loss described in (Goodfellowet al., 2014) with multivariate normal prior. Both the generator and the discriminator were trained using RMSProp (Tieleman& Hinton, 2012) with step size − , decay factor . , and (cid:15) = 10 − . Each parameter update was made using a batch sizeof , and the models were trained for a total of , updates. The VAEs were trained using a gaussian likelihood modelfor the decoder, and the log-diagonal covariance parameterization for the encoder. The fixed, per-pixel standard deviation ofthe decoder was chosen such that, disregarding constant terms, the log-likelihood corresponds to the reconstruction error,normalized by the product of the number of channels and the number of pixels. To optimize the evidence lower bound, weused Adam (Kingma & Ba, 2014) with β , β , (cid:15) = 0 . , . , − . Each parameter update was made using a batch sizeof , and the models were trained for a total of , , updates. The KL-divergence weight β was increased linearly Spectral Regularizer for Unsupervised Disentanglement
Algorithm 4
Procedure to generate a batch for disentanglement metric classifier.
Require: n inst (cid:62) is the number of aligned pairs to use, in order to create each instance in the batch. Require: n batch (cid:62) is the batch size. Require:
Enc : R n → R m is the encoder functioning as the inverse of the generator G . c ← (3 , , , , (cid:46) Ranges for the five attributes that determine each shape procedure S AMPLE S HAPE ( i, v ) (cid:46) Sample shape with attribute i fixed to value v for j ∈ [1 , do u j ← R ANDOM U NIFORM (1 , c j ) u ← ( u , u , u , u , u ) u i ← v return M AKE S HAPE ( u ) procedure M AKE I NSTANCE ( n inst , Enc ) i ← R ANDOM U NIFORM (2 , (cid:46) Sample index of attribute to fix v ← R ANDOM U NIFORM (1 , c i ) (cid:46) Sample value for fixed attribute z ← ∈ R m for k ∈ [1 , n inst ] do z ← Enc( S AMPLE S HAPE ( i, v )) z ← Enc( S AMPLE S HAPE ( i, v )) z ← z + | z − z | return z/n inst , i procedure M AKE B ATCH ( n inst , n batch , Enc )inputs ← ∅ targets ← ∅ for k ∈ [1 , n batch ] do x, y ← M AKE I NSTANCE ( n inst , Enc) inputs ← inputs ∪ { x } targets ← targets ∪ { y } return inputs , targets from to the final value over the first , , updates. We applied the function x (cid:55)→ x ) + 1 / to the outputs ofboth the GAN generators and the VAE decoders; this ensures that the output pixel values are in the interval [0 , . To predictthe mean and log-diagonal covariance with the VAE encoders, we apply the translated ReLU activation function (Xiang &Li, 2017) to the final convolutional features, followed by two fully-connected layers, one for each statistic. D. Implementation Details for Disentanglement Metric Score
Algorithm 4 describes the procedure used to generate the batches to train and evaluate the classifier for the disentanglementmetric (Higgins et al., 2016). We use n inst , n batch = 64 , , and update the classifier using SGD with nesterov acceleratedgradient (step size − , momentum . ). We train the classifier using a total of , parameter updates, and evaluate itsperformance using instances, as reported by (Higgins et al., 2016). To invert the pretrained GAN generator, we traina VAE decoder with twice the base feature map count for the generator, using , parameter updates. The details for thistraining procedure are identical to those for regular VAE training described in Appendix C, except that the generator is notupdated, and the KL-divergence weight β is set to zero. In other words, we use a standard autoencoding loss with a fixeddecoder. Spectral Regularizer for Unsupervised Disentanglement
E. Additional GAN Disentanglement Results
Coordinate Description Example1 Background darkness and hair color2 Azimuth, lighting, and hair color3 Hairline and hair color4 Azimuth5 Shadow6 Smiling, age, skin tone, gender7 Smiling, age, jawline8 Jawline and hairstyleFigure 11: Disentanglement results for alignment-regularized GAN ( k, λ = 8 , . ). Details regarding the model architecture and trainingprocedure are given in Appendix C. Results with additional samples for each coordinate can be found at this URL: https://drive.google.com/open?id=1Wnd5jIxopFsBRlylMUN2HfWhICXGiU3u . Spectral Regularizer for Unsupervised Disentanglement
Coordinate Description Example1 Background and hair darkness2 Azimuth and hair darkness3 Hair length, hair darkness, and lighting4 Smiling and gender5 Hair length, hair darkness, and gender6 Smiling, bangs, and gender7 Smiling and hairstyle8 Smiling, jawline, and glaring expression9 Smiling, bangs, and mouth open10 Hairline11 Raised eyebrows and skin tone12 Raised eyebrows and location of hair partition13 Raised eyebrows and location of hair partition14 Lighting colorFigure 12: Disentanglement results for alignment-regularized GAN ( k, λ = 16 , . ). Details regarding the model architecture andtraining procedure are given in Appendix C. Results with additional samples for each corodinate can be found at this URL: https://drive.google.com/open?id=1BM-P-hMF7sV_0smNFD_iTAKpq6FeNsCo . Spectral Regularizer for Unsupervised Disentanglement
F. Supplementary Figures
Spectral Regularizer for Unsupervised Disentanglement (a) Log spectra for CelebA models with n f = 64 and m varied.(b) Log spectra for LSUN Bedroom models with n f = 256 and m varied.(c) CelebA ( m = 32 , n f = 64 , (cid:15) = 0 . ), embeddings z , z . (d) LSUN Bedroom ( m = 128 , n f = 64 , (cid:15) = 0 . ), embeddings z , z .(e) CelebA ( m = 64 , n f = 64 , (cid:15) = 0 . ), embeddings z , z . (f) LSUN Bedroom ( m = 256 , n f = 64 , (cid:15) = 0 . ), embeddings z , z .(g) CelebA ( m = 128 , n f = 64 , (cid:15) = 0 . ), embeddings z , z . (h) LSUN Bedroom ( m = 512 , n f = 64 , (cid:15) = 0 . ), embeddings z , z .Figure 13: Effect of perturbing a latent variable z along the leading eigenvectors of M z ( z ) . Subfigures (a) and (b) show the top eigenvalues of M z ( z i ) , where { z , . . . , z } are fixed embeddings; small eigenvalues are omitted. Subfigures (c)–(d) compare the effects of perturbations along random directions to perturbations along leading eigenvectors.Each subfigure consists of two stacked two-row grids. The leftmost images of each grid are both identical and equal to G ( z i ) , for some i ∈ [1 , . The first row shows theeffect of perturbing z i along 13 directions sampled uniformly from the sphere of radius (cid:15) , while the second row shows the effect of perturbing z i along the first 13 leadingeigenvectors of M z ( z i ) , by the same distance (cid:15)(cid:15)