Sets of autoencoders with shared latent spaces
SSets of autoencoders with shared latent spaces
Vasily MorzhakovNovember 7, 2018
Abstract
Autoencoders receive latent models of input data. It was shown in recent works that they also estimateprobability density functions of the input. This fact makes using the Bayesian decision theory possible.If we obtain latent models of input data for each class or for some points in the space of parameters ina parameter estimation task, we are able to estimate likelihood functions for those classes or points inparameter space. We show how the set of autoencoders solves the recognition problem. Each autoencoderdescribes its own model or context, a latent vector that presents input data in the latent space may becalled treatment in its context. Sharing latent spaces of autoencoders gives a very important propertythat is the ability to separate treatment and context where the input information is treated through theset of autoencoders. There are two remarkable and most valuable results of this work: a mechanism thatshows a possible way of forming abstract concepts and a way of reducing datasets size during training.These results are confirmed by tests presented in the article.
Tasks of classification and parameter estimation have strict probabilistic background that is Bayesian DecisionTheory and a Bayesian estimator. Originally, Artificial Neural Networks (ANN) were based on probabilisticmodels, but nowadays the best applied solutions usually dont estimate distributions of data, that behaviorbrings with it some fundamental problems that cant be eliminated without changing the approach:1. Adversarial patches [6] change the output of Convolutional Neural Networks. Uncontrollable back-propagation optimization of millions of parameters doesnt allow networks to be stable even in closeproximity of points presented in training sets.2. It is extremely difficult to locate the source of a problem in trained Neural Networks (NN) and eliminateit by partial retraining. It works as a black box.3. Multiple interpretations are not allowed, also the statistical nature of input data is ignored.Figure 1: An example of the adversarial attack from the article [6]
The probability density function (PDF) plays a very important role in the Bayesian decision theory. Itsrequired to know a PDF of input data for calculation a posteriori risk in choosing the best decision inclassification tasks. 1 a r X i v : . [ c s . C V ] N ov igure 2: An example of multiple interpretations in vision: ”old or young lady”It turned out that autoencoders are very suitable for estimating of the PDF. It could be explained in thefollowing way: the training data set defines the density of input data, meaning the more the training samplesare placed around a point in the input space, the better the autoencoder will reconstruct the input there.Besides, there is a latent vector in the bottleneck of the autoencoder, if an input vector is projected to thearea in the latent space that was not involved previously in training, then this input vector is unlikely.There are a few works where the connection between autoencoders and PDFs is proved theoretically:Alain, G. and Bengio, Y. What regularized autoencoders learn from the data generating distribution. 2013.[1] ,Kamyshanska, H. 2013. On autoencoder scoring [2],Daniel Jiwoong Im, Mohamed Ishmael Belghazi,Roland Memisevic. 2016. Conservativeness of Untied Auto-Encoders [3].In this article another approach to estimation of a PDF will be shown that gives clues to construction ofsets of autoencoders with the shared latent space. We are considering an arbitrary autoencoder x ∗ = f ( g ( x )), where g ( x ) - encoder, f ( z ) - decoder. The encoderprojects the input space to the corresponding latent space.A probability density function for x ∈ X and z ∈ Z is equal to: p ( x ) = (cid:90) z p ( x | z ) p ( z ) dz (1)Our goal is to obtain a relationship between p ( x ) and p ( z ), because p(z) will be an issue related to thisresearch in the following.For simplicity, let’s assume that the input vector x is presented as x = f ( g ( x )) + n after autoencodertraining, where n is gaussian noise, there is a latent model that we were able to receive, for example, bytraining a multiple-layer neural network. Then the noise distribution n = f ( z ) − x with the deviation σ is: p ( n ) = const × exp ( − ( x − f ( z )) T ( x − f ( z ))2 σ ) = p ( x | z ) (2)( x − f ( z )) T ( x − f ( z )) is the distance between x and its reprojection through the latent space back to X .This distance reaches its minimum value at some point z ∗ . Partial derivatives of the exponents argumentin the equation (2) will be zero in the direction z i , where z i are axes in Z .0 = ∂f ( z ∗ ) ∂z i T ( x − f ( z ∗ )) + ( x − f ( z ∗ )) T ∂f ( z ∗ ) ∂z i f ( z ∗ ) ∂z i T ( x − f ( z ∗ )) is a scalar, then: 0 = ∂f ( z ∗ ) ∂z i T ( x − f ( z ∗ )) (3)Choosing the point where the distance (cid:107) x − f ( z ) (cid:107) has its minimum value is founded on the weightsoptimization of the autoencoder. While training, least square error (L2) loss between input and output isminimized for all train samples: min θ, ∀ x ∈ Xtrain (cid:107) x − f θ ( g θ ( x )) (cid:107) θ - autoencoder’s weights.After success autoencoder training, varying weights brings g ( x ) to the z ∗ , and we can consider it as esti-mation.We also can present f ( z ) through the first Taylor member around z ∗ in (2) f ( z ) = f ( z ∗ ) + ∇ f ( z ∗ )( z − z ∗ ) + o (( z − z ∗ ))so the equation (2) is now p ( x | z ) ≈ const × exp ( − (( x − f ( z ∗ )) − ∇ f ( z ∗ )( z − z ∗ )) T (( x − f ( z ∗ )) − ∇ f ( z ∗ )( z − z ∗ ))2 σ ) == const × exp ( − ( x − f ( z ∗ )) T ( x − f ( z ∗ ))2 σ ) exp ( − ( ∇ f ( z ∗ )( z − z ∗ )) T ( ∇ f ( z ∗ )( z − z ∗ ))2 σ ) ×× exp ( − ( ∇ f ( z ∗ ) T ( x − f ( z ∗ )) + ( x − f ( z ∗ )) T ∇ f ( z ∗ ))( z − z ∗ )2 σ )Note that the last multiplier is equal to 1 according to the equation (3). The first multiplier doesn’t dependon z and it can be brought outside the integral sign. Another assumption we will make is that p ( z ) is asmooth function and it can be replaced by p ( z ∗ ) around z ∗ .After making all assumptions, the integral (1) can be estimated: p ( x ) = const × p ( z ∗ ) exp ( − ( x − f ( z ∗ )) T ( x − f ( z ∗ ))2 σ ) (cid:90) z exp ( − ( z − z ∗ ) T W ( x ) T W ( x )( z − z ∗ )) dz, z ∗ = g ( x )where W ( x ) = ∇ f ( z ∗ ) σ , z ∗ = g ( x ).The last intergal is the n-dimensional Euler-Poisson integral: (cid:90) z exp ( − ( z − z ∗ ) T W ( x ) T W ( x )( z − z ∗ )2 ) dz = (cid:115) det ( W ( x ) T W ( x ) / π )Finally, the distribution p ( x ) has the following approximation: p ( x ) = const × exp ( − ( x − f ( z ∗ )) T ( x − f ( z ∗ ))2 σ ) p ( z ∗ ) (cid:115) det ( W ( x ) T W ( x ) / π ) , z ∗ = g ( x ) (4)We have shown that the input data distribution p ( x ) can be estimated by multiplication of three factors:1. The distance between the input vector and its reconstruction2. The distribution p ( z ) at the projected point z ∗ = g ( x )3. The integral value, that is calculated directly from autoencoder’s weightsDespite all assumption we have made the result still makes sense and has the right to exist.3 Set of autoencoders
The idea of using a set of autoencoders for choosing the most likely solution in classification is not a new one[2]. As we showed above, the distribution p class ( x ) can be estimated by a trained autoencoder for each class. p class ( x ) corresponds to the likelihood function. Then, according to the Bayes’ decision theory, we simplychoose argmax class p class ( x ).Thanks to (4) we estimate log p class ( x ) for each class:log p ( x ) = const const × ( − ( x − f ( z ∗ )) T ( x − f ( z ∗ ))) + log ( p ( z ∗ ) (cid:115) det ( W ( x ) T W ( x ) / π ) ) , z ∗ = g ( x ) (5)It is possible to form vectors from all terms in the equation (5) for each autoencoder and tune const , const Layer Autoencoder Classification NN
The most novel part of this research is sharing the latent space between autoencoders. Training autoencoderwe are trying to receive ”treatment” of the input data, that describes it sufficiently for the reconstruction.In the case of using a set of autoencoders it may be that this treatment can be the same for differentautoencoders. For example, in a computer vision task we can define each autoencoder for a correspondingorientation. Then a point in the latent space of an autoencoder will define the properties of an object inthe field of view. Those properties are actually the same for all orientations, then its presentation in latentspaces of different autoencoders should be the same. There are no obvious consequences of this proposal,but it will help to improve the estimation of p(z) in (4) and receive more abstract concepts from the inputdata. The distribution p(z) becomes shared for all autoencoders and all samples we have in all contexts areprojected into the same space Z. Also this will provide transfer of samples into different contests that allowsone-shot learning. 4
Substitutability of classification tasks and parameter estimationtasks
We still hasn’t discussed how we choose the number of autoencoders of the number of contexts. It seemsthere is no strict rule. We can train an autoencoder for a class or for a value in some parameter space. Toclarify it an example is needed. Let’s consider the task of face recognition:Input images are human faces, then two different approaches are allowed: 1) context is face orientation.In this case reconstruction of an input image requires ”treatment” that is a face-identification code. Duringtraining we will need to show the same face from different directions, ”freezing” its latent code. 2) contextis face’s identity. In this case reconstruction of an input image requires face’s orientation. During trainingwe will show different faces from the same direction.An Optimal Bayesian decision will be chosen in regard to face’s orientation in the first case, and in regardto face’s identity in the second.
Besides regular autoencoders’ training we need to make the latent shared between them. In order to receivetied latent spaces the same essense should be demonstrated in different contexts. For example, speaking ofMNIST images, cross-training could be presented schematically in figure 3.Figure 3: Scheme of cross-trainingFirstly, the encoder of an autoencoder encode an image, then the received latent code is decoded by thedecoder of another autoencoder.This pipeline translates an image from one context to another.Regular self-training steps for each autoencoder alternate with cross-training steps. At the end all au-toencoders will receive shared latent space or ”treatment” will be tied in different contexts.
Let’s consider a sample that is based on the MNIST dataset. It would demonstrate principles of training setsof autoencoders with the shared latent space. Finally, in this sample two effects will be shown: forming anabsract concept of ”cube” and one-shot learning.Images of digits are sprayed on cube edges.Autoencoders will reconstruct edges in this sample, contexts are orientations of those edges.Figure 5 shows a ”zero” digit in 100 orientational contexts, the first 34 of them correspond to differentorientations of a side edge and the other 76 to orientations of the top edge.5igure 4: Cube samplesFigure 5: The same digit is in different contextsIt is proposed that each of these 100 images have the same ”treatment” or the same value of their latentvector. Arbitrary pairs of them are used for cross-training.Following the proposed cross-training approach autoencoders were successfully trained. A latent code of oneautoencoders may be decoded by another and sensible translation will be reconstructed.Figure 6 shows the result of context transfering. An input image is encoded by the encoder 10th anddecoded by others autoencoders: 6igure 6: Context transferingThus, a latent representation (treatment) of an input image may be reconstructed in any of trainedcontexts.
In the equation (4) deviation σ appears and it was chosen as a constant for all components of an input vector.However, if some components have no connection with latent models, then this deviation will be significantlyhigher for those components. σ is in the denominator, it means the higher deviation in residual is the lesscontribution to probability estimation it has. It can be considered as partly masking of inputs.In the shown example of cube edges masks are obvious.The approach was simplified a bit: residual between an input vector and its reconstruction is multiplied bythe masks. In general, more precise deviation estimation is required.7igure 7: Masks for cube edges
10 Idea of on-shot learning
In some cases it’s reasonable to operate with latent codes ignoring the context. This allows us to recognize apattern in different contexts if it was demonstrated only in one of them. For one-shot learning it’s importantto show a new pattern for recognition. For the purpose of testing let’s try to distinguish two different patternsthat were not presented in the MNIST dataset:Figure 8: Two new patternsThese signs will be shown in one orientation, after that a hyperplane between two points that correspondto treatments of two signs should separate these signs in other orientations.8t’s important to notice that autoencoders will not be trained on new signs. Thanks to variety of digitsin MNIST it is possible to project new signs in the latent space and project it back to the image space indifferent orientations. For example, the sign V will be decoded in the following way:Figure 9: V sign reconstructions in different contextsFinnaly, successfull separation was obtained. Showing just one sample per class allowed us to recognizesignes in all 100 contexts. Figure 10: Distance distributionIt’s also possible to visualize the result on separate images:9igure 11: Distance distributionThis approach is similar to the ideas of transfer-learning for one-shot learning proposed in [7].
11 Idea of abstract concepts
The latent representation (treatment) can be ignored in some cases, so just contexts likelihood may be passedto higher processing level. Lets consider samples of contexts likelihood vectors for cubes in the same orien-tation but with different signs sprayed on their edges (digits 5 and 9). These vectors are shown as 10x10 maps:10igure 12: Likelihood context mapsMaps are almost identical despite the different textures on cube edges. It means that likelihood vectorsallows us to formulate a new concept that is 3D cube. So we need to train one more level with just oneautoencoder that will be responsible for reconstructing different orientations of 3D cubes. Put simply, thenext level will force the 3D cube to rotate.Thanks to ignoring the latent code it’s possible to train a new concept showing just one digit on cube edges.So, the experiment will be constructed in the following way:1. The training dataset contains images of cubes with rotation angle from 0 to 90 degrees. On all edgesthe same digit 5 is sprayed.2. Likelihood contexts vector is passed to the next level where only one autoencoder responsible for thecube model is placed.3. Then back projection is being implemented. For points in the latent space in the second level it ispossible to decode reconstructed likelihood contexts vector, add latent codes in the first level andreconstruct an image with a rotated cube on it.The training dataset consisted of 54321 images, here are some examples:Figure 13: Cubes that were rotated from 0 to 90 degrees11ubes rotated from 0 to 90 degrees. The Autoencoder on the second level had only one component in thelatent vector, because it was known that we were dealing with only one degree of freedom in rotation. Aftertraining this component could be varied from 0 to 1 and decode the likelihood contexts vector:Figure 14: Likelihood context map’s varityThen, this vector is translated back to the first level, local maximums are chosen and an arbitrary latentcode of edges texture is taken. Let it be a code of a 3 digit. Changing the component of latent code on thesecond level we receive the following reconstructed images:Figure 15: Back projection with the changed latent codeAlso, we can generate (imagine) a new cube that was not presented in training data with a V sign on edges:Figure 16: Back projection with the changed latent codeThus, we actually have an autoencoder that is responsible for a new cube concept on the second level ofprocessing. In practice, it is a very important mechanism in recognition tasks. Abstract concepts on higherlevels allows addressing of ambiguities.
12 Conclusions
A mathematical approach for estimation of likelihood function of different models that describes the inputdata is proposed. Models were assumed to be autoencoders. The following ideas concerning these autoen-coders are proposed: their latent space can be shared and that latent codes correspond to treatment ofinformation. It is also defined that autoencoders or latent models are by itself contexts of that treatment.It was shown that autoencoders are equal to or better than fully-connected neural networks for recognitiontasks on the MNIST dataset.An effect of one-shot learning is demonstrated. It is based on separating of treatment and context andon transfer learning.Ability of forming new concepts is shown on a sample where a 3D cube concept was received in two-levelprocessing scheme.
13 Source codes
Link to all source codes of experiments described in the article: https://gitlab.com/Morzhakov/SharedLatentSpace eferences [1] Alain,G. and Bengio,Y. What regularized autoencoders learn from the data generating distribution, 2013.[2] Kamyshanska, H. and Memisevic R. On autoencoder scoring. 2013[3] Daniel Jiwoong Im, Mohamed Ishmael Belghazi, Roland Memisevic.Conservativeness of Untied Auto-Encoders. 2016.[4] Morzhakov,V. Redozuv, A., An Artificial Neural Network Architecture Based on Context Transformationsin Cortical Minicolumns. 2017.[5] Redozubov, A. Holographic Memory: A novel model of information processing by Neural Microcircuits. Springer Series in Cognitive and Neural Systems Volume 11 , 2017, 271-295[6] Xiaoyong Yuan, Pan He, Qile Zhu, Xiaolin Li. Adversarial Examples: Attacks and Defenses for DeepLearning. 2018.[7] Yan Duan, Marcin Andrychowicz, Bradly Stadie, Jonathan Ho,Jonas Schneider, Ilya Sutskever, PieterAbbeel, Wojciech Zaremba. One-Shot Imitation Learning. 2017.[8] Gaussian Integral http://en.wikipedia.org/wiki/Gaussian integralhttp://en.wikipedia.org/wiki/Gaussian integral