Using Deep LSD to build operators in GANs latent space with meaning in real space
UUsing Deep LSD to build operators in GANs latent spacewith meaning in real space
J. Quetzalc´oatl Toledo-Mar´ın ∗ and James A. Glazier Biocomplexity Institute, Indiana University,Bloomington, IN 47408, USALuddy School of Informatics,Computing and Engineering, IN 47408, USA (Dated: February 11, 2021)Generative models rely on the key idea that data can be represented in terms of latent variableswhich are uncorrelated by definition. Lack of correlation is important because it suggests that thelatent space manifold is simpler to understand and manipulate. Generative models are widely usedin deep learning, e.g. , variational autoencoders (VAEs) and generative adversarial networks (GANs).Here we propose a method to build a set of linearly independent vectors in the latent space of aGANs, which we call quasi-eigenvectors. These quasi-eigenvectors have two key properties: i) Theyspan all the latent space, ii) A set of these quasi-eigenvectors map to each of the labeled featuresone-on-one. We show that in the case of the MNIST, while the number of dimensions in latentspace is large by construction, 98% of the data in real space map to a sub-domain of latent spaceof dimensionality equal to the number of labels. We then show how the quasi-eigenvalues can beused for Latent Spectral Decomposition (
LSD ), which has applications in denoising images and forperforming matrix operations in latent space that map to feature transformations in real space. Weshow how this method provides insight into the latent space topology. The key point is that the setof quasi-eigenvectors form a basis set in latent space and each direction corresponds to a feature inreal space.
I. INTRODUCTION
Generative models (GMs) are a class of Machine Learn-ing (ML) model which excel in a wide variety of tasks[1]. The optimization of a GM finds a function G thatmaps a set of latent variables in latent space to a set ofvariables in real space representing the data of interest( e.g. , images, music, video, etc .), i.e. G : (cid:60) M → (cid:60) d where d > M (cid:29)
1. When building a GM, we firstdefine the support of the latent variables, then obtainthe function G by optimizing a loss function. Loss func-tion choice depends on application, e.g. , maximum log-likelihood is common in Bayesian statistics [2], Kull-back–Leibler divergence is common for variational au-toencoders (VAEs) [3] and the Jensen-Shannon entropyand the Wasserstein distance are common with genera-tive adversarial networks (GANs) [4]. Latent variableshave a simple distribution, often a separable distribu-tion ( i.e., P ( { z i } Mi =1 ) = (cid:81) Mi =1 P ( z i )). Thus, when we fita latent variable model to a data set, we are finding adescription of the data in terms of ”independent compo-nents” [2]. Often the latent representation of data lives ina simpler manifold than the original data while preserv-ing relevant information. For instance, Ref. [5] proposesa time-frequency representation of a signal that allowsthe reconstruction of the original signal, which relies inwhat they define as ” consensus ”. Their proposed methodgenerates sharp representations for complex signals.Trained deep neural network can function as surrogatepropagators for time evolution of physical systems [1]. ∗ [email protected] While the latent variables are constructed to be inde-pendent identically distributed (i.i.d.) random variables,training entangles these latent variables. Latent variabledisentanglement is an active area of research employing awide variety of methods. For instance, in Ref. [6], the au-thors train a GAN including the generator’s Hessian as aregularizer in the loss function, leading, in optimum con-ditions, to linearly independent latent variables, whereeach latent variable independently controls the strengthof a single feature. Ref. [7] constructs a set of quantized vectors in the latent space using a VAE, known as vec-tor quantized variational autoencoder (VQ-VAE). Each quantized vector highlights a specific feature of the dataset. This approach has been used in OpenAI’s jukebox[8]. A major drawback of these approaches is the lackof freedom in relating specific features in real space withspecific latent space directions. This can be overcome by conditionalizing the generative model [9]. However, con-ditionalization can reduce the latent space smoothness and interpolation capacity , since the condition is usuallyenforced by means of discrete vectors as opposed to acontinuous random latent vector.Here we propose a method to relate a specific chosenlabeled feature with specific directions in latent spacesuch that these directions are linearly independent. Hav-ing a set of linearly-independent latent vectors associatedwith specific labeled features allows us to define opera-tors that act on latent space ( e.g. a rotation matrix)and correspond to feature transformations in real space.For instance, suppose a given data set in real space cor-responds to the states of a molecular dynamic simula-tion, i.e. , | x i (cid:105) → | x ( t i ) (cid:105) and suppose | x ( t i ) (cid:105) = G| z i (cid:105) and | x ( t i + ∆ t ) (cid:105) = G| z j (cid:105) . How can we construct an operator a r X i v : . [ c s . L G ] F e b in latent space, O ∆ t , such that | z j (cid:105) = O ∆ t | z i (cid:105) ?. For con-struction to be possible, the operator G must be locally linear. Furthermore, in order to build the operator O ,we need a basis that spans latent space. While linearitythese might seem counterintuitive given how NNs work,growing evidence suggests such linearity in practice. Forinstance, there is an ongoing debate on how deep shoulda NN be to perform a specific task. Moreover, it hasbeen proposed the equivalence between deep NNs andshallow wide NNs [10]. For at least one image-relatedGAN, simple vector arithmetic in latent space leads tofeature transformations in real space ( e.g. , removal ofsunglasses, change in hair color, gender, etc .) [11]. How-ever, we still do not understand how specific features inreal space map to latent space and how are these featuresarranged in latent space ( latent space topology ) or whysome GANs behave like linear operators. The latent rep-resentation of data with a given labeled feature forms acluster. However, the tools employed to show this cluster-ing effect quite often consist in a dimensional reduction e.g. , t-SNE [12] collapses the latent representation intotwo or three dimensions. Other methods include princi-pal component analysis, latent component analysis andimportant component analysis [2, 13, 14]. Our methoddoes not collapse or reduce the latent space, allowing usto inspect latent space topology by spanning all latentspace directions. We demonstrate the method by apply-ing it to MNIST. FIG. 1. Schematic of spaces and operators. P is an operatorin real space that evolves the state | x ( t i ) (cid:105) to | x ( t i + ∆ t ) (cid:105) . E is an Encoder and G is the Generator that maps latentvariables to real space. O is an operator in latent space. Theblack arrow shows the time propagation done by applyingthe operator P to | x ( t i ) (cid:105) which yields | x ( t i + ∆ t ) (cid:105) . The bluearrows show the path where the data | x ( t i ) (cid:105) gets encodedinto latent space, | z i (cid:105) , then the operator O is applied to thelatent vector yielding a new latent vector | z j (cid:105) . Finally, thenew latent vector get decoded and yields | x ( t i + ∆ t ) (cid:105) . In the next section we introduce our mathematicalmethods and notation. In section III we apply themethod to the MNIST data set. In section IV we showhow we can use this method to understand the topologyof the latent space. In section V we apply this methodto denoise images. In section VI we show how we canperform matrix operations in latent space which map to image transformations in real space.
II. METHOD
Assume a vector space which we call real space and de-note the vectors in this space | x (cid:105) with | x (cid:105) ∈ (cid:60) d . Assumea set {| x i (cid:105)} Ni =1 , which we call the dataset with N thedataset size. Similarly, we assume a vector space, whichwe call the latent space and denote these vectors | z (cid:105) with | z (cid:105) ∈ (cid:60) M (in general, M ≤ d ). We also consider threedeep neural networks, a Generator G , an Encoder E anda Classifier C . We can interpret G as a projector from la-tent space to real space, i.e. , | x i (cid:105) = G| z i (cid:105) , and interpret E as the inverse of G . However, given the architecture ofvariational autoencoders, notice that if | z a (cid:105) = E| x i (cid:105) and | z a (cid:48) (cid:105) = E| x i (cid:105) , in general, | z a (cid:105) (cid:54) = | z a (cid:48) (cid:105) , since these vectorsare i.i.d. vectors sampled from a Gaussian distributionwith mean and standard deviation dependent on | x i (cid:105) [3].Finally, the Classifier projects real-space vectors into thelabel space, i.e. , | y k (cid:105) = C| x i (cid:105) , where | y k (cid:105) ∈ L , where L denotes the label space. We assume that each vector | y k (cid:105) is a one-hot-vector. The length of | y k (cid:105) equals the numberof labels | L | = l and k = 1 , ..., l . Henceforth, we assumethat l < M .We define {| ξ i (cid:105)} Mi =1 to be a set of basis vectors in la-tent space such that (cid:104) ξ i | ξ j (cid:105) = Cδ ij . Henceforth we callthe set of basis vectors {| ξ i (cid:105)} Mi =1 the quasi-eigenvectors since they form a basis and each one represents a feature state in latent space. Notice that we can define the op-erator A = (cid:80) Mj =1 | ξ j (cid:105)(cid:104) ξ j | , which implies A| ξ i (cid:105) = C | ξ i (cid:105) .Any vector in latent space can be expressed as a linearsuperposition of these quasi-eigenvectors, viz , | z (cid:105) = M (cid:88) j =1 c j | ξ j (cid:105) . (1)where | c i | = |(cid:104) ξ i | z (cid:105)| is the amplitude of | z (cid:105) with respectto | ξ i (cid:105) and gives a measure of | z (cid:105) ’s projection with thequasi-eigenvector | ξ i (cid:105) . Constructing a set of basis vectorsis straightforward. However, we wish each labeled fea-ture to corresponds one-to-one with a quasi-eigenvector.Since we are assuming that l < M , there will be a set ofquasi-eigenvectors that do not correspond to any labeledfeature.To obtain a set of orthogonal quasi-eigenvectors, weuse the Gram-Schmidt method. Specifically:1. We train the GAN, using the training set {| x i (cid:105)} Ni =1 as in Ref. [4].2. We train the Classifier independently, using thetraining set.3. We train a VAE using the trained Generator as thedecoder. We also use the Classifier to classify theoutput of the VAE. We include in the loss functiona regularizer λ ·L class , where λ is a hyperparameterand L class denotes the Classifier’s loss function. Atthis stage, we only train the Encoder, keeping theGenerator and Classifier fixed.4. Define n to be an integer such that M = n × l .Then, for each label, we allocate n sets of latentvectors and we denote them | z kα,i (cid:105) , where α denotesthe label, i = 1 , ..., n and k = 1 , ..., V . Here V isthe number of elements (latent vectors) in each setcorresponding to the pair ( i, α ). We build thesesets {| z kα,i (cid:105)} in two ways:(a) Using the training set, we encode each vector | x i (cid:105) → | z i (cid:105) = E| x i (cid:105) , then we decode the latentvector, i.e. , | z i (cid:105) → | x i (cid:105) = G| z i (cid:105) , and then weclassify the output, i.e. , | x i (cid:105) → | y i (cid:105) = C| x i (cid:105) .For each label l , there is a set of latent vectors.(b) We generate random latent vectors and mapthem to their labels using the Generator andthe Classifier as in 4(a).We denote as V the number of latent vectors ineach set.5. We take the average over V for each set of latentvectors {| z α,i (cid:105)} Vk =1 and denote that average | η (cid:105) α,i , i.e. , | η (cid:105) α,i = 1 V V (cid:88) j =1 | z jα,i (cid:105) . (2)Since the latent vectors are sampled from a multi-variate Gaussian distribution, the average | η α,i (cid:105) isfinite and unbiased. By defining operators in latentspace in terms of outer products of the | η α,i (cid:105) vec-tors, these latent space operators will have encodedin them the set of latent vectors | z kα,i (cid:105) .6. To impose orthogonality, we use the Gram-Schmidtmethod. Thus, from the vectors | η α,i (cid:105) we generatea set of quasi-eigenvectors | ξ (cid:105) α,i , i.e. , | ξ (cid:105) , = | η (cid:105) , (3) | ξ (cid:105) , = | η (cid:105) , − , (cid:104) η | ξ (cid:105) , , (cid:104) ξ | ξ (cid:105) , | ξ (cid:105) , (4) ... (5) | ξ (cid:105) l,n = | η (cid:105) l,n − l − (cid:88) α =1 n − (cid:88) i =1 l,n (cid:104) η | ξ (cid:105) α,iα,i (cid:104) ξ | ξ (cid:105) α,i | ξ (cid:105) α,i . (6)Such that: α,i (cid:104) ξ | ξ (cid:105) β,j = Cδ αβ δ ij (7)In Eq. (7), C is the value of the norm. The set ofquasi-eigenvectors {| ξ (cid:105) α,i } l,nα =1 ,i =1 span the latent spaceand, as we will show, a subset of them map to specificfeatures. The key point is that the set of quasi-eigenvectors forma basis set in latent space and each direction correspondsto a feature in real space. This structure allows us to givea better topological description of latent space, i.e. , howdoes labeled features map to latent space similar to howmolecular configurations map to the energy landscape [15]. In addition, we can use the set of quasi-eigenvectorsas tools for classification, denoising and topological trans-formations. We demonstrate these applications in thenext section using the MNIST dataset. III. MODEL
We trained a GAN, a Classifier and a VAE using theMNIST dataset which has 60 k and 10 k one-channel im-ages in the training and test set, respectively, with di-mensions 28 ×
28 pixels. We fixed the batch size to 25and number of epochs to 500 during all training runs. Wetrained the GAN using the training set, used the Jensen-Shannon entropy as the loss function [4], the ADAM opti-mizer with hyperparameters η = 0 . , β = 0 . , β =0 .
999 for both the Generator and the Discriminator, fixedthe latent space dimensionality to M = 100 and sampledthe random latent vectors from a multivariate Gaussiandistribution centered at the origin with standard devia-tion equal to 1 in all M dimensions. Independently, wetrained a Classifier on the training set, used crossentropyas loss function and a softmax as the activation functionin the last layer, the ADAM optimizer with hyperparam-eters η = 3 · − , β = 0 . , β = 0 .
99. The accuracyof the classifier on the test set reached ≈ . λ , ashyperparameter set to λ = 100. During the training ofthe Encoder, we kept both the Generator and the Clas-sifier fixed. In Fig. 2 we show the training results. Totrain the NNs we used Flux [16] in Julia [17] and the codecan be found in Ref. [18]. FIG. 2. A batch from the a ) dataset, b ) the same datasetencoded and decoded using the Generator as Decoder and c )random latent vectors given as input to the Generator. The latent space dimension is M = 100, while the num-ber of labels is | L | = 10. Thus, following step 4, for eachlabel we generated n = M/ | L | sets of latent vectors, eachset containing V = 5000 latent vectors. In Fig. 3(a) weshow sample latent vectors for labels 0 , , , , G . Then wetake the average over each set as in step 5. We checkedthat the average and standard deviation over each of theentries in the set of vectors {| η (cid:105) α,i } α,i converges. Inter-estingly, when taking the average over the set of latentvectors corresponding to a label and projecting back toreal space, the label holds. For instance, in Fig. III weshow the projected image of the average over V for eachset of latent vectors {| z α,i (cid:105)} Vk =1 in the case where the la-tent vectors were obtained following step 4(a), whereasFig. III corresponds to the case following step 4(b). Wehave also plotted the probability density function ( PDF )per label in latent space for both cases and added a Gaus-sian distribution with mean and standard deviation equalto 0 and 1, respectively, for reference. Notice that thePDF in Fig. III is shifted away from the Normal distri-bution, whereas in Fig. III all PDFs are bounded by theNormal distribution, because latent vectors generated di-rectly from latent space are, by definition, sampled froma multivariate Gaussian distribution with mean and stan-dard deviation equal to 0 and 1, respectively. On the con-trary, encoding real space vectors yields Gaussian vectorsoverall ( i.e. , the PDF over all latent vectors over all la-bels yields a Gaussian distribution, by definition) but themean and standard deviation can differ from 0 and 1 [3].Step 4(a) gives robustness to this method and step 4(b)allows us to generate as many latent vectors as we wantwith a specific label. Since the latent space dimension is M = 100, we need M averaged latent vectors | η (cid:105) α,i togenerate M orthogonal latent vectors. Since the numberof labels is α = { , ..., | L | − } , then n = 10. To this end,we generate one set ( i.e. , i = 1) following step 4(a) andnine sets ( i.e. , i = 2 , , ..., n ) following step 4(b).Fig. 4(a) shows the projection to real space of all the | η (cid:105) α,i vectors while Fig. 4(c) shows the inner product α,i (cid:104) η | η (cid:105) α (cid:48) ,i (cid:48) as a heatmap, which shows they are non-orthogonal. At this point, we have M vectors | η (cid:105) α,i inlatent space i) composed of the sum of V latent vec-tors, ii) each of these vectors maps to a specific feature.However, these vectors are not orthogonal. Using theGram-Schmidt method described in step 6, we obtain aset of vectors, | ξ (cid:105) α,i , in latent space such that i) each | ξ (cid:105) α,i vector encodes V latent vectors, ii) each | ξ (cid:105) α,i vec-tor maps to a specific labeled feature (see Fig. 4(b)), iii)the | ξ (cid:105) α,i vectors are orthogonal, as shown in Fig. 4(d).Since the Generator was trained using random vectorssampled from a multivariate Gaussian distribution cen-tered at zero with standard deviation 1, the value of thenorm of any random latent vector will be (cid:104) z | z (cid:105) ≈ M .Therefore, we fixed the norm of the quasi-eigenvectors tobe C = M (see Eq. (7)).Notice that while the non-orthogonal vectors | η (cid:105) α,i for the MNIST GAN project sharp images of easily-identifiable numbers in real space, not all quasi-eigenvectors project to images of numbers in real space.Only a few of the M linearly-independent directions inlatent space ( ≈
20) project to images of numbers in real (a)(b)(c)
FIG. 3. a ) Samples of latent vectors | z kα,i (cid:105) , for labels α =0 , , , , G| z kα,i (cid:105) yields images of numbers with label α . We show 1200 latent vectors, per label, projected intoreal space. Average over the latent vectors per label yields | η (cid:105) α,i . b) left panel Decoded latent vectors | η (cid:105) α,i . Thevectors | η (cid:105) α,i were obtained as described in step 4(a). b)right panel The histogram for each label is Gaussian withnon-zero mean. c) left panel
Decoded latent vectors | η (cid:105) α,i .The vectors | η (cid:105) α,i were obtained as described in step 4(b). c)right panel The histogram for each label is Gaussian withzero mean. space. We will show how to apply this property of thequasi-eigenvectors to the MNIST test set to classify im-ages in latent space and to denoise real-space images. Wealso show how to build a rotation operator in latent spacethat generates feature transformations in real space. (a) (b)(c) (d)
FIG. 4. a) Projection to real space images of the la-tent vectors {| η (cid:105) α,i } , α =0 ,i =1 obtained as described in step 5. b) Projection to real space images of the quasi-eigenvectors {| ξ (cid:105) α,i } , α =0 ,i =1 obtained as described in step 6. The α in-dex corresponds to the label (row) while the i index corre-spond to the set (column). c) The inner product of vec-tors {| η (cid:105) α,i } , α =0 ,i =1 . d) The inner product of the quasi-eigenvectors {| ξ (cid:105) α,i } , α =0 ,i =1 . IV. USING LSD AS A CLASSIFIER IN LATENTSPACE
We can express any latent vector | z (cid:105) , in terms of thequasi-eigenvectors, viz. | z (cid:105) = M (cid:88) k =1 c k | ξ k (cid:105) , (8)where the coefficients c k are given by, c k = (cid:104) ξ k | z (cid:105) /C . (9)Similar to principal component analysis, we are inter-ested in how much information about an image is encodedin the quasi-eigenvector with the largest amplitude | c i | .We encode images from the MNIST test set into latentspace, then express the latent vectors in terms of thequasi-eigenvectors (we call this expression latent spectraldecomposition or LSD ) and find the maximum amplitude | c i | for each latent vector. Recall that the amplitude | c i | is a measure of the projection of the latent vector withrespect to the quasi-eigenvector | ξ i (cid:105) . Thus, the largest amplitude corresponds to the quasi-eigenvector that con-tributes the most to the latent vector. Since the quasi-eigenvectors are associated with labeled features in realspace, we use the largest amplitude as a way to classifythe image. Fig. 5(a) shows a sample batch of 25 im-ages. The blue dots corresponds to the true labels (see yaxis), while the green (red) dots correspond to the casewhere label associated with the quasi-eigenvector withthe largest amplitude is the correct (incorrect) label. Inthis batch, only batch elements 9 and 22 have true labelsthat do not agree with the label of the quasi-eigenvalueof the image with the largest amplitude. Since each timethe Encoder encodes an image it generates a new randomlatent vector, then we could obtain a different outcomefor batch elements 9 and 22 as well as the rest of the batchelements for each trial. For this reason, we perform anensemble average over 20 trials. For each trial we takethe whole MNIST test set and compute the accuracy ofthe latent space decomposition (LSD) classifier (see reddots in Fig. 5(b)). We also computed the accuracy whenthe test set is encoded through the Encoder, then de-coded through the Generator and finally classified (seeblue dots in Fig. 5(b)). We have included the accuracyof the trained Classifier in Fig. 5(b) as an upper bound.While the trained Classifier has an accuracy of 98 . ∼ | c | > | c | > ... > | c M | and ask theposition of the ground-truth label? As previously men-tioned, in 92% of the cases the ground-truth label corre-sponds to the first position ( i.e. , | c | ). In 5% of the casesthe ground truth label corresponds to the second largestamplitude ( i.e. , | c | ). In Fig. 5(c) we have plotted thecumulative of the probability for the ground-truth labelbeing any of the first n positions. The dashed red line cor-responds to the trained Classifier accuracy. Notice thatthe probability of the label being in position 1, 2, 3 or 4of the LSD equals the accuracy of the trained classifier, i.e. , in 98 .
9% of the MNIST test-set images the groundtruth label is associated to a quasi-eigenvector such thatthe associated coefficient is either c , c , c or c . In thissense, it is possible that even when the amplitude of thequasi-eigenvector associated to the ground-truth label isnot the largest one, rather the 2nd or 3rd largest one,then | c | (cid:38) | c | or | c | (cid:38) | c | (cid:38) | c | . To test this idea, inFig. 6 we have plotted the normalized amplitude ( i.e. , | c i | / max {| c j |} ) vs the rank ( i.e. , sorted amplitudes fromlargest to smallest) for all images in the test set. Fig. 6a) corresponds to the images where the LSD amplitudeof the quasi-eigenvector associated with the ground-truth (a) (b)(c) FIG. 5. a) A batch of the MNIST test set classified by LSDusing the largest amplitude. The largest amplitude | c i | corre-sponds to the quasi-eigenvector | ξ i (cid:105) that contributes the mostto the latent vector | z (cid:105) , and a subset of the quasi-eigenvectorsmap to each label one-on-one. Y axis corresponds to the la-bel, X axis to the image in the batch. Blue dots, groundtruth. Green (red) dots correspond to the case(s) where thelabel associated with the quasi-eigenvector with the highestamplitude is the correct (incorrect) label. b) Accuracy fordifferent trials using the MNIST test set. The green curve isthe Classifier’s accuracy (98 . ≈ ≈ c) Cumulative probability of theground truth label being any of the n first largest amplitudes(X axis). For n = 1 the probability is 92%. The probabilityof the ground truth label being one of the labels with the 4largest amplitudes is ≈ . label is the largest, whereas in Fig. 6 b) and c) the ampli-tude is the 2nd largest or 3rd largest, respectively. Giventhe large dataset, in Fig. 6 d), e) and f) we have plottedthe PDFs of the 2nd, 3rd, and 4th largest amplitudes foreach of plots Fig. 6 a), b) and c). To be clear, fromFig. 6 a), b) and c) we generated PDFs for the second-, third- and fourth-largest amplitudes in each plot andshow the PDFs in Figs. 6 d), e) and f), respectively.Notice that when the largest amplitude corresponds tothe ground-truth label (Fig. 6 a)), the second-, third-and fourth-largest amplitude PDFs are centered below0.6 (Fig. 6 d)). When the second-largest amplitude cor-responds to the ground-truth label (Fig. 6 b)) the PDF of the second-largest amplitude is shifted towards 1, whilethe PDFs of the third- and fourth-largest amplitude am-plitudes are centered below 0.7 (Fig. 6 e)). Finally, inthe case where the third-largest amplitude correspondsto the ground-truth label (Fig. 6 c)), the PDFs of thesecond- and third-largest amplitude are shifted towards1, while the PDF of the fourth-largest amplitude is cen-tered below 0.7 (Fig. 6 f)).The previous results give us a broad picture of la-tent space topology: the labeled features project to well-defined compact domains in latent space. Let us nowconsider how we can use this information to denoise im-ages. V. DENOISING WITH LSD
The main issue when reducing noise in images is dis-tinguishing noise from information. In this sense, a reli-able denoiser has to learn what is noise and what isn’t.One reason deep generative models are promising for de-noising data is that in optimum conditions the GM haslearned the exact data distribution. Of course, if the dataset has noise, the GM will also learn the embedded noisein the data set. However, by sampling the latent spacewe may find regions where the signal to noise ratio issufficiently large. For large M , this sampling is compu-tationally expensive. To avoid this cost, we propose toLSD as a denoiser. FIG. 6. LSD normalized amplitude ranking for the caseswhere the true label corresponds to the largest amplitudequasi-eigenvector a) , second largest amplitude b) and thirdlargest amplitude c) . Probability-density function of thesecond-, third- and fourth-largest LSD normalized amplitudewhen the true label corresponds to the largest amplitude d) ,second-largest amplitude e) and third-largest amplitude f ) . Recall that in the previous section we showed that witha 98% accuracy the information needed to assign a la-bel to the image is stored in either the first-, second-,third- or fourth-largest amplitude of the LSD. Therefore,we propose that once the test set is encoded into latentspace, we decompose the latent vector in terms of thequasi-eigenvectors and drop the contribution from quasi-eigenvectors with low amplitudes. In Fig. 7 we show theresults of this truncation for 125 random sample images.In Fig. 7(a) we describe how to understand these im-ages. Fig. 7(b) shows 5 columns, where each column has25 rows and each row has 7 images. In each row, the firstimage corresponds to the ground-truth image, the secondimage is the image decoded from all 100 LSD componentsof the ground truth image. The third, fourth, fifth, sixthand seventh images are the images decoded after truncat-ing the expansion after 1,2,3,4 and 10 LSD componentsof the ground truth image. In this method, denoisingmaintains the identity of the labeled feature in the im-age, e.g. , each row shows different representations of thesame number. In most cases in Fig. 7, the denoisedimage looks clearer and sharper. However, sometimesthe LSD components project back to the wrong number.However we can consider as many LSD components asthe dimension of the latent space, so even if taking thefirst n LSD components yields the wrong number, takingthe first n + 1 LSD components could yield the correctnumber. In the previous section we showed that usingonly the first 4 LSD components gave us a 98 .
9% chanceof obtaining the right number.
VI. OPERATIONS IN LATENT SPACE
Here we explore how to build operators in latent spacethat can yield feature transformations in real space. Hav-ing a set of orthogonal vectors that span latent spaceallows us to perform most operations in latent space asa series of rotations, since we can express the operatoras a superposition of the outer product of the quasi-eigenvectors. If we construct a rotation matrix, R , inlatent space, we can then recursively apply R to a setof encoded images. After each iteration we project theoutput to real space to see the effect of the latent-spacerotation. We can define a projection operator B ξ i ,ξ j , suchthat, B ξ i ,ξ j = 1 (cid:104) ξ i | ξ i (cid:105) | ξ j (cid:105)(cid:104) ξ i | . (10)This operator projects from | ξ i (cid:105) to | ξ j (cid:105) , i.e., B ξ i ,ξ j | ξ k (cid:105) = δ ξ k ,ξ k | ξ j (cid:105) , where δ ξ k ,ξ k denotes the Kroenecker delta func-tion. Similarly, we define the operator R ξ i ,ξ j (∆ θ, θ ) as R ξ i ,ξ j (∆ θ, θ ) ∝ (cos( θ + ∆ θ ) | ξ i (cid:105) + sin( θ + ∆ θ ) | ξ j (cid:105) ) · ( (cid:104) ξ i | cos( θ ) + (cid:104) ξ j | sin( θ )) , (11)which projects from cos( θ ) | ξ i (cid:105) + sin( θ ) | ξ j (cid:105) to cos( θ +∆ θ ) | ξ i (cid:105) + sin( θ + ∆ θ ) | ξ j (cid:105) .Starting from a set of images with label zero, we firstencoded them to latent space, then we applied the ro-tation operator R recursively, as follows: First, we per-form the rotation from the quasi-eigenvector associatedwith label zero to the quasi-eigenvector associated withlabel 1, viz. , R ξ α =0 ,i =1 ,ξ α =0 ,i =1 (∆ θ, θ ). Then, we per-formed a rotation from the quasi-eigenvector associated (a)(b) FIG. 7. a) Image of the number 5 taken from the MNIST testset. The first image correspond to the ground truth ( GT ),the second image corresponds to the projected image of the100 LSD components, the third, fourth, fifth, sixth and sev-enth images correspond to the projected images from the sumof the one, two, three, four and ten LSD components, respec-tively. b)
125 samples from the MNIST test set. Each sampleis a row with seven images as shown in a) . with label 1 to the quasi-eigenvector associated with la-bel 2, viz. , R ξ α =1 ,i =1 ,ξ α =2 ,i =1 (∆ θ, θ ), and repeat mutatismutandi until we reach the quasi-eigenvector associatedwith label α = 9. To keep the individual rotations in la-tent space small (and maintain the local linearity of thetransforms), we fixed the rotation step size ∆ θ ≈ π/ | z (cid:105) by (cid:113) (cid:104) z | z (cid:105) M . After each iteration, we project the latent vectorinto real space. In Fig. 8 we show this projection for aset of sample images. Notice how the numbers transformfrom 0 to 9. In principle, we could rotate through anyother set of sequential features in this way. The key ideais that having a set of quasi-eigenvectors that span latentspace each mapping to a specific label, we can define ametric in latent space defining the distance between thelatent-space representation of each label. VII. CONCLUSIONS
We have shown that it is possible to build a set of or-thogonal vectors (quasi-eigenvectors) in latent space thatboth span latent space and map to specific labeled fea-
Algorithm 1:
Latent space rotation pseudocode. initialization | z (cid:105) = E| x (cid:105) (initial condition)∆ θ = π/ α = 0 (initial label) i = 1 (set index) for α in { , , , ..., } dofor r in { , , } do | z (cid:105) = R ξ α,i ,ξ α,i ( r · ∆ θ, ( r − · ∆ θ ) | z (cid:105) (rotation) | z (cid:105) = | z (cid:105) / (cid:113) (cid:104) z | z (cid:105) M (norm) | x (cid:105) = G| z (cid:105) (projection to real space) endend tures. These orthogonal vectors reveal the latent spacetopology. We found that for MNIST, almost all the im-ages in the data set map to a small subset of the dimen-sions available in latent space. We have shown that wecan use these quasi-eigenvectors to reduce noise in data.We have also shown that we can perform matrix opera-tions in latent space that map to feature transformationsin real space. FIG. 8. Ten latent vectors projected into real space aftereach iteration where the latent vectors are rotated an angle∆ θ ≈ π/ i.e. ,the images transform from the number 0 to number 1 andthen from 1 to 2 until reaching number 9. On the one hand, the deeper the NN the better its ca-pacity in learning complex data and as depth increases,the non-linearity increases as well. From catastrophe the-ory [19], we know that in non-linear dynamical systemssmall perturbations can be amplified leading to bifurca-tion points leading to completely different solution fami-lies of these non-linear dynamical systems. On the otherhand, the results in Ref. [11] suggest a different picturewith what the authors call vector arithmetics . This arith-metic occurs in latent space. Similarly, there’s an ongoingdebate on how deep should a NN be to perform a specifictask. In addition, it has been proposed the equivalencebetween deep NNs and shallow wide NNs [10]. Our workcontributes to this discussion of the emergent effectivelinearity of NNs as transformations. While the NNs weused are intrinsically non-linear, they exhibit local lin-earity over a region of interest in latent space. This sub-space maps to labeled features. In this sense, we say the non-linear NNs are effectively linear over the domain ofinterest. We have shown this for MNIST successfully.From an application standpoint, mapping to dominantquasi-eigenvectors could be useful for medical imaging,diagnosis and prognosis if, e.g. , the labels denoted theseverity of a disease; for predicting new materials if thelabels denoted specific material features or external phys-ical parameters.
ACKNOWLEDGEMENTS
This research was supported in part by Lilly Endow-ment, Inc., through its support for the Indiana UniversityPervasive Technology Institute.This work is partially supported by the BiocomplexityInstitute at Indiana University, National Science Foun-dation grant 1720625 and National Institutes of Healthgrant NIGMS R01 GM122424. [1] F. No´e, A. Tkatchenko, K.-R. M¨uller, and C. Clementi,Annual review of physical chemistry , 361 (2020).[2] D. J. MacKay and D. J. Mac Kay, Information theory,inference and learning algorithms (Cambridge universitypress, 2003).[3] D. P. Kingma, (2017).[4] I. Goodfellow, arXiv preprint arXiv:1701.00160 (2016).[5] T. J. Gardner and M. O. Magnasco, Proceedings of theNational Academy of Sciences , 6094 (2006).[6] W. Peebles, J. Peebles, J.-Y. Zhu, A. Efros, and A. Tor-ralba, arXiv preprint arXiv:2008.10599 (2020).[7] A. Razavi, A. van den Oord, and O. Vinyals, in