[PDF] StarNet: Gradient-free Training of Deep Generative Models using Determined System of Linear Equations

Abstract

In this paper we present an approach for training deep generative models solely based on solving determined systems of linear equations. A network that uses this approach, called a StarNet, has the following desirable properties: 1) training requires no gradient as solution to the system of linear equations is not stochastic, 2) is highly scalable when solving the system of linear equations w.r.t the latent codes, and similarly for the parameters of the model, and 3) it gives desirable least-square bounds for the estimation of latent codes and network parameters within each layer.

Full PDF

SStarNet: Gradient-free Training of Deep GenerativeModels using Determined System of Linear Equations

Amir Zadeh, Santiago Benoit, Louis-Philippe Morency

LTI, Computer Science DepartmentCarnegie Mellon UniversityPittsburgh, PA 15213 {abagherz,sbenoit,morency}@cs.cmu.edu

Abstract

In this paper we present an approach for training deep generative models solelybased on solving determined systems of linear equations. A network that usesthis approach, called a StarNet, has the following desirable properties: 1) trainingrequires no gradient as solution to the system of linear equations is not stochastic,2) is highly scalable when solving the system of linear equations w.r.t the latentcodes, and similarly for the parameters of the model, and 3) it gives desirableleast-square bounds for the estimation of latent codes and network parameterswithin each layer.

Generative modeling requires learning a latent space, and a decoder that maps the latent samplesto an output manifold. We study the two cases of using inverse-funnel feedforward decoders andconvolutional decoders through the lens of a system of linear equations. Our proposed algorithm iscalled StarNet.A system of liner equations is a well-studied area in linear algebra, with various algorithms rangingfrom elimination methods to solutions based on matrix pseudoinverse. The beneﬁts of StarNetlearning process are as follows:• No gradients are required to solve a system of liner equations. Therefore, training canbe done at a lower computational cost, and fewer hyperparameters. Training deep neuralnetworks often requires calculating the gradient of an objective function (e.g. log likelihood)w.r.t the parameters of a model. Gradient calculation is computationally heavy, and oftenscales poorly since batch updates are serial throughout an epoch.• Certain algorithms to solve a system of linear equations can be highly scalable. Unlike batchlearning, where each datapoint waits its turn within the epoch, a system of linear equationsoffers a framework that is more scalable. Certain steps of the StarNet learning algorithm canlinearly scale up to as many computational nodes as there are datapoints within a dataset.• There are strong theoretical backings for a solution to a system of linear equations. It isimperative that deep models have explainable behavior, during both learning and application.Learning via a system of linear equations offers residuals for each layer, which can beused to identify and better approach multiple important issues in machine learning such asmode-collapse. Furthermore, for each datapoint, an estimate of how well each layer ﬁts thedatapoint can be measured, thus better understanding the intrinsic success or failure cases ofdeep learning. arXiv Preprint. Ongoing research at CMU. a r X i v : . [ c s . L G ] J a n system of linear equations offers unique (exact/approximate) solutions if it is determined (well-determined/over-determined). This paper presents such conditions for two commonly used decodertypes: feedforward and convolutional. These conditions are not strong, and a signiﬁcant portion ofthe currently used decoders ﬁt within these conditions. Our StarNet learning algorithm is deﬁned byprogressively training a deep neural network from the last layer to the ﬁrst. It follows a coordinatedescent method to reach a plateau in solving the system of linear equations w.r.t both the latentcodes and the parameters of the model, at each layer. Most of the trained models in this paper reachconvergence in less than epochs - a signiﬁcant boost over gradient models. Due to the fact that thelatents of the StarNet are not restricted (e.g. an encoder in Auto-Encoder or PCA), even a single layerStarNet offers remarkable generation capabilities. Generative modeling is a fundamental research area in machine learning. Notable related works reoutlined below.In a neural framework, Auto-Encoders (AE) and their variational implementation (VAE [4]) havebeen commonly used for learning based on a reconstruction objective. GANs [3] deviate from thisframework by optimizing an objective that pushes two networks towards an equilibrium where agenerators output is indistinguishable from the real generative distribution. An encoderless applicationof VI to generative modeling, Variational Auto-Decoders (VAD [10]) use only a generator but withan objective similar to VAE - with the added beneﬁt of robustness to noise and missing data. Allapproaches rely on computation of gradients during train time (with GAN and VAD commonlyrequiring the same during test time to invert the generative model, i.e. posterior inference). Flow-based models use invertible mappings between input and output of same dimensions. Convolutionalvariants are limited to × convolutions [5]. StarNet does not require same dimensions for inputand output, and is not restricted to × convolutions. However, similar to the latter, StarNet doesexhibit certain mild limitations when solving for CNNs.Aside deep techniques, Principle Component Analysis (PCA [9]) is a popular method for learninglow-dimensional representation of data. PCA is identical to a linear auto-encoder with a lassoreconstruction objective. PCA algorithm does not requires gradient computations. It differs from theStarNet in that the feed forward operation of PCA is restricted by a linear encoder, while the StarNetimposes no restrictions on the latent space. In simple terms, StarNet has no direct mapping betweeninput and latents; the latents are recovered using pesudoinverse of the model weights. StarNet is a method for training certain neural decoders without gradient computations. We covertwo major types in this paper: 1) feedforward StarNet, and 2) convolutional StarNet. For each type,we outline the necessary conditions required for successful training and inference.

We ﬁrst start by a feedforward StarNet, which uses a feedforward decoder. The following are theoperations in tthe i -th layer of a neural network: h i = a ( W i · h i − ) (1)With h i ∈ R d i as output, h i − ∈ R d i − as input, W ∈ R d i × d i − as an afﬁne map, a as a non-linearactivation function. Note, for simplicity of equations later, the formula accounts for bias as a constantdimensions added to all h i − and a row added to W . Such operation above is unfolded K times for adeep neural decoder wih K layer.Let X = { x ( n ) } Nn =1 be a given i.i.d dataset. Essentially, in any generative or unsupervised framework,the hope is that h ( n ) i = K is as close as possible to the respective x ( n ) . Theorem 1 ∀ i , Equation 1 is solvable w.r.t h i − as long as d i − ≤ d i (i.e. decoder is inverse funnel).This is coupled with a − as the inverse of the activation. roof 1 The condition is required for well-determined, or over-determined solution to a system ofliner equations w.r.t h i − . Let ˆ h i be the non-activated neural responses in Equation 1. Naturally, ˆ h i = a − ( h i ) . Foreach x ( n ) , we have the following: W i · h ( n ) i − = ˆ h ( n ) i − (2) The above is a system of linear equations with a unique solution in the case of d i − = d i and aleast-squared solution in the case of d i − < d i . The solution is exactly as: ˆ h ∗ i − = ( W Ti W i ) − W Ti ˆ h i (3) ˆ h ∗ i − is the least-squared solution (hence the name StarNet). ( W Ti W i ) − W Ti is the pseudo-inverseof the weights of the i th layer. It is important that the initialization of W does not produce rank-deﬁciency. Therefore, random initialization or its orthogonal projection (product of a decomposition)may work. Subsequently: h ∗ i − = a − (ˆ h ∗ i − ) (4) Commonly used activation functions are invertible, except for ReLU in the negative part of its domain.Alternatives such as leaky ReLU can be used instead.

Lemma 1

The above will not have a unique solution if d i − > d i , since nullity ( W ) > . Therefore,the architecture cannot go lowerin dimensions without the risk of rank deﬁciency. If so, strongassumptions on the observations x ( n ) have to be made, such as presence of a particular structure orcertain simplicities[1], often covered in compressed sensing [2]. In the scope of this paper, sincemajority of the neural decoders are inverse funnels, we leave the studies of rank deﬁcient weights forthe next edition of this preprint. Lemma 2

Since samples drawn from p ( x ) are assumed i.i.d, the solution to the Equation 2 can bedistributed across N independent computations. Therefore, allowing for linear job distribution overcomputational nodes. Lemma 3

The solution of the Equation 1 for a datapoint can be seen as the latent inference process(e.g. similar to an encoder feedforward process), which is part of both training and testing. It can beapplied recursively from higher to lower layers, with a succession of solving Equation 1 for layers K − , K − , · · · , , and applying inverse of the activation a at each layer. Therefore, the complexityof inference in StarNet remains linear w.r.t the number of layers K . This is an improvement overencoderless gradient-based approaches that require large number of epochs (e.g. approximateposterior inference in VAD). Lemma 4

If Equation 2 is over-determined, then a norm of the residuals of each equation (i.e.datapoint when solving for h ( · ) ) can be used as a measure for conﬁdence, detecting outliers, or mode-collapse. An in-distribution datapoint with high residual norm would signal inference suboptimalitygiven the layer parameters. Similarly, a datapoint may be judged to belong to a certain distributionusing a probability ﬁeld deﬁned over such a measure. Lemma 5

A layer which has d i = d i − is not a useful neural construct for StarNet. Such a layerwill collapse an afﬁne transformation to a degenerate identity projection. If a solution lies within the R d i − , then the Equation 4 will ﬁnd it, regardless of the depth of similarly shaped layers projectinginto that same space. Theorem 2

Equation 1 can also be solved w.r.t W i iff d i − ≤ N ; essentially, there needs to be moredatapoints than the number of input neurons in each individual layer of the network. Proof 2

Let W i,j be the j th row (that produces the j th dimension in the ouput of the layer) of theweights of the i th layer. Let H i − be the matrix of inputs (over the entire dataset) to the i th layer.Thus equation 1 can be written as: a − H Ti,j = H Ti − · W Ti,j (5)3 𝑐 ×𝑤1×𝑤2) ( (𝑐 4 ×2.𝑤1×2.𝑤2)

Figure 1: The operations of a conv-unpool layer with unpooling of × . Essentially a local patch(right) is created by reshaping the responses of the previous responses (left). Hence a new system of linear equations with exactly N equations and d i − unknowns. The sameequation can be written for each row of W i individually, leading to a scalable solution for W i , acrossas many nodes as the total rows for W i . A solution to this system can similarly be done by calculatingthe pseudoinverse of the knowns matrix, or elimination approaches. Lemma 6

Since it is required to have the network follow an inverse funnel (See Proof 1), then d K − < N is the only condition that needs to be checked aside the previous conditions. Convolution is essentially a local afﬁne map applied along patches of an input. Similar to feedforwardStarNet, a system of linear equations can be solved for convolutional decoders. In order to keep thesystem of linear equations determined for the latents, we also discuss combining the convolutionand pooling operations together for a conv-unpool layer. A conv-unpool layer swaps generatesneighboring pixels of a large response map as channels within a smaller response map. Figure 1shows the operations of a conv-unpool layer.

Theorem 3

The system of linear equations for a conv-unpool layer with a stride of remainsdetermined w.r.t the latents of the ( i − th layer, as long as c i ≤ c i − · u i − ; with c i , c i − being thenumber of channels in the i th and ( i − th layer of conv-unpool operations, and u i the unpoolingshape in both image directions (here assumed to be square shaped, but can also be rectangle). Lemma 7

The above condition can be restrictive for the border convolutions. Performing fullconvolutions will help with the solution for all the inputs to remain determined (even the very bordersof the input response map).

Theorem 4

The same system of linear equations can be solved w.r.t the convolutional parameters aslong as there are fewer convolutional parameters in a single convolutional kernel (e.g. for a × convolutions) than the total number of convolutions for the same kernel on the entire dataset. This issomewhat a relaxed condition since layers tend to be larger than the shape of a single convolutionalkernel. Lemma 8

When solving for the convolutional parameters, the number of equations is extremelylarge compared to the number of unknowns (e.g. , , equations for a single × convolutionin MNIST dataset with unknowns). This is due to the fact that the same kernels are applied overdifferent patches. Naturally, calculation of the pseudoinverse may become computational exhaustivevery quickly as the size of input image increases. We provide two practical solutions for this: 1) Wecan solve the system via a subset of the datapoints or equations. Naturally, this puts an emphasis onwhat datapoints are chosen. Hence, the subset should be reﬂective of the larger system of equations.In practice random datapoint or patch selection across the entire dataset works well, as shown inFigure 8, or 2) To get a more precise approximation of the convolutional weights, we can distributethe overdetermined equations across multiple computational nodes. Each computational node willstill solve an overdetermined system, but a smaller one than the main system deﬁned over the entiredataset. The ﬁnal weights are subsequently the average of the solution found by the nodes. N I S T (a)(b)(c) F - M N I S T (a)(b)(c) S VHN (a)(b)(c) C I F A R (a)(b)(c) Figure 2: Best viewed zoomed in and in color. Generation results for MNIST, Fashion-MNIST,SVHN and CIFAR10 using feedfoward StarNet with two layers. (a) denotes the target, (b) is thereconstruction after solving for the second layer of 256 neurons, (c) is the reconstruction after solvingfor the ﬁrst layer of 128 neurons.

Training StarNet follows a coordinate descent framework. We start from the last layer of the network,where the h ( n ) K = x ( n ) . We iteratively solve the system of linear equations w.r.t the latents, andsubsequently w.r.t the weights of the model, until convergence is reached. The inverse activation isapplied after training the layer to reach h ∗ ( n ) K − . The same approach is repeated for each layer, until wereach the ﬁrst layer latents: h ∗ ( n )1 . Test-time inference is identical to the layerwise training, exceptthe system of equations in each layer is only solved for the latents, and never solved for the weights(unless the purpose is ﬁnetuning).Code is provided under https://github.com/A2Zadeh/StarNet . In this section we discuss the experiments and the methodology. We ﬁrst start by outlining thedatasets, and subsequently discuss the training procedure and model architectures.

We experiment with the following four datasets.

MNIST, Fashion-MNIST:

MNIST [7] is a collection of handwritten digits, between [0 , . Fashion-MNIST is a similar dataset to MNIST, except for fashion items. Both datasets are baselines forimage generation. The images are monochromatic, with the shape of × . SVHN:

SVHN (Street View House Numbers) is a collection of real-world images of house numbers,obtained from Google Street View images [8]. One variant of this dataset includes bounding boxesaround each digit in a house number, while the other variant (used in this paper) separates all digitsindividually into × images, with digits [0 , similarly to MNIST. CIFAR10:

CIFAR10 is a collection of × color images belonging to 10 different classes,corresponding to different animals and vehicles [6]. The full dataset consists of , images perclass, , images in total. The training methodology for the feedforwad StarNet and convolutional StarNet are as follows:

Feedforward StarNet:

We train feedforward networks with two layers of , neurons. Theﬁnal dimension for the MNIST and Fashion-MNIST is ( × ), while SVHN and CIFAR10 https://github.com/zalandoresearch/fashion-mnist N I S T (a)(b)(c) F - M N I S T (a)(b)(c) S VHN (a)(b)(c) C I F A R (a)(b)(c) Figure 3: Best viewed zoomed in and in color. Generation results for MNIST, Fashion-MNIST,SVHN and CIFAR10 using convolutional StarNet with two layers. (a) denotes the target, (b) is thereconstruction after solving for the second layer conv-unpool, (c) is the reconstruction after solvingfor the ﬁrst layer conv-unpool.are ( × × ). The activation is leaky ReLU, with the negative slope as a hyperparameterdiscussed in Section 5. Convolutional StarNet:

We use a conv-unpool operation, with the pooling of in each direction.For all trained models, there are two layers of convolution, with the shape × . For MNISTand Fashion-MNIST, the ﬁrst layer has kernels, and the second layer has . For CIFAR10 andSVHN, we have kernels in the ﬁrst layer, and kernels in the second. All trained models use fullconvolution to ensure that the system remains determined w.r.t the border latents (See Lemma 7).The dimensions chosen for the above feedforward and convolutional StarNet models follow theconditions required in Section 3. Therefore, no degenerate identity solution can be learned.The comparison Auto-Encoder (AE) baseline uses the same architecture decoder. The encoder isthe same architecture as the decoder (no weights shared between the encoder and decoder). DefaultAdam parameters are used for training with a learning rate of e − . Batch size is set to . The generations for the datasets of MNIST, Fashion-MNIST, SVHN and CIFAR10 are shown inFigure 2 for feedforward StarNet, and Figure 3 for convolutional StarNet. While trained models useno gradients, the generations show high ﬁdelity to ground truth . We show the generations based oneach layer, and further discuss the results of the single layer generation in Section 6. We discuss some of the important ﬁndings of our experiments in this section.

Performance of a Single Layer Network:

A unique distinction between methods that use free-formparameterization for the latents (i.e. StarNet, VAD), from the methods that use deep layers forinference, is the fact that a single layer network is the most successful in generation. This is due to thefact that the latents are not controlled by an underlying neural structure during the optimization. Weposit that a neural architecture imposes a distribution over its range, whereas free-form optimizationallows latents to travel everywhere within the latent manifold. Therefore, free-form latents can learnthe same output of a deep network (or better ones), if it indeed satisﬁes the training objective.

Convergence Rate:

On the experimental datasets, StarNet reaches performance plateau mostlywithin epochs. Figure 6 shows the elastic (i.e. | · | + || · || ) loss over different epochs for bothMNIST and CIFAR10. Note, the loss only reports the error measure and is nowhere used for trainingthe StarNet. For reference, we include the convergence rate of an auto-encoder trained with Adamoptimizer with a default learning of e − . The Auto-Encoder is trained over the elastic objective. The high negative slope of 0.5 was found to work the best overall. poch 1 Epoch 2 SL SW SL SW

GT Epoch 1 Epoch 2

SL SW SL SWInit Init

GT MNIST CIFAR10

Figure 4: Best viewed zoomed in and in color. Progressive learning of StarNet for feedforwardarchitecture on MNIST and CIFAR10 datasets. Each epoch has two steps, solve w.r.t the latents(denoted as SL) and solve w.r.t the weights (SW). “Init” donotes the initial reconstruction givenuntrained latents and untrained weights. The images start showing high ﬁdelity to the input after theinitial epoch.

Epoch 1 Epoch 2

SL SW SL SW

GT Epoch 1 Epoch 2

SL SW SL SWInit Init

Epoch 3

SL SW

GT MNIST CIFAR10

Figure 5: Best viewed zoomed in and in color. Progressive learning of StarNet for convpoolarchitecture on MNIST and CIFAR10 datasets. Each epoch has two steps, solve w.r.t the latents(denoted as SL) and solve w.r.t the weights (SW). “Init” donotes the initial reconstruction givenuntrained latents and untrained weights. The images start showing high ﬁdelity to the input after theinitial epoch.

Learning Progress over Different Epochs:

We also study the progress of the generations overdifferent epochs of training. We do this for a single layer feedforward and conv-unpool networks(for MNIST and CIFAR10). The layers are the second layers ( neurons) from Section 4.2. Theresults are shown in Figures 4 for feedforward and 5 for convolutional. Both architectures show thatgenerations beyond epochs are already very close to the ground truth. Importance of Both Steps in Learning:

The learning progress in Figures 4 and 5 demonstrate thatwhile convergence is fast, initial solution (i.e. SL in the ﬁrst epoch) is not completely successful dueto random weights. We also initiate the training w.r.t the weights, and show the results for the MNISTdataset in Figure 7. Therefore, one cannot train a StarNet by only solving w.r.t either the latents orthe weights. Both have to be present.

Sampling for Convolutional StarNet Equations:

The system of linear equations in a convolutionalStarNet can have too many equations when solving for the parameters of the network. Naturally,this makes calculation of the pseudoinverse (or any other approach for solving system of equations)7 .000.020.040.060.080.100.120.140.16 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99

MNIST

AE StarNet 0.000.050.100.150.200.250.300.350.40 1 8 15 22 29 36 43 50 57 64 71 78 85 92 99

CIFAR10

AE StarNet

Figure 6: Convergence comparison between feedforward StarNet and an Auto-Encoder (AE) overMNIST and CIFAR10. Both decoders are feedfoward architectures with two layers of , (AE uses inverse for encoder). AE uses default Adam parameters with learning rate of e − .Convergence of StarNet is reached within epochs. Epoch 1 Epoch 2

SW SL SW SLInit GT Figure 7: Best viewed zoomed in. MNIST training for feedforward StarNet. In this case, trainingstarts with solving w.r.t the weights ﬁrst.computationally exhaustive. To mitigate this, we study the effect of random datatpoint sampling onthe learning process of the model parameters. For MNIST, Figure 8 shows that around datapointare enough statistics to solve for the weights. Going below that, to datapoints per epochs seems tomake the convergence too slow and the coordinate descent process of learning StarNet unstable. Still,around equations is a signiﬁcant reduction from the total number of equations, i.e. , , for full convolutions over MNIST. Note, all models still use all the datapoints for solving w.r.t thelatents since datapoints and latents are assumed i.i.d. Final Remarks on Scalability:

There are various ways to maximizing computational node uti-lization when scaling up the learning process of StarNet. First and foremost, when solving bothfeedforward and convolutional StarNet for the latents, the relation between the number of datapointsand compuational nodes remains linear; i.e. i.i.d datapoints can go to computational nodes.Furthermore, for both the feedforward and convolutional cases of StarNet, the solution for the weightscan be scaled in two ways: 1) independent weights (rows of weight matrix in feedforward, andkernels in convolution) can be mapped to different computational nodes, 2) in case there are stillmany more equations than the number of unknowns when solving for the weights (i.e. due to largesize of the dataset), a sampling approach can reduce the number of equations to a representativesubset, as studied in Figure 8. 8 poch 1 Epoch 2

SZ SW SZ SWInit

GT Epoch 3

SZ SW SZ SWInit SZ SWSZ SWSZ SW SZ SWInit SZ SW S a m p l e s = S a m p l e s = S a m p l e s = Figure 8: Best viewed zoomed in. On the left side, we show the number of samples used forsolving the system of linear equations in convolutional StarNet w.r.t the convolution weights. Weobserve that images randomly chosen in each epoch are sufﬁcient for training the weights. Thissampling serves to mitigate the heavy computations of an otherwise extremely computationally heavypseudoinverse in convolutional StarNet.

In this paper we presented a new training framework for certain generative architectures. StarNet usesno gradients during training and inference. It relies solely on layer-wise solving of systems of linearequations. This method of training converges fast, and allows for scalability across computationalnodes. We present generative experiments over 4 publicly available and well-studied datasets. Thegeneration results show the capabilities of StarNet.

References [1] David Donoho, Hossein Kakavand, and James Mammen. The simplest solution to an underde-termined system of linear equations. In , pages 1924–1928. IEEE, 2006.[2] David L Donoho. Compressed sensing.

IEEE Transactions on information theory , 52(4):1289–1306, 2006. 93] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, SherjilOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.

Advances in neuralinformation processing systems , 27:2672–2680, 2014.[4] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprintarXiv:1312.6114 , 2013.[5] Durk P Kingma and Prafulla Dhariwal. Glow: Generative ﬂow with invertible 1x1 convolutions.In

Advances in neural information processing systems , pages 10215–10224, 2018.[6] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images.2009.[7] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learningapplied to document recognition.

Proceedings of the IEEE , 86(11):2278–2324, 1998.[8] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.Reading digits in natural images with unsupervised feature learning. 2011.[9] Svante Wold, Kim Esbensen, and Paul Geladi. Principal component analysis.

Chemometricsand intelligent laboratory systems , 2(1-3):37–52, 1987.[10] Amir Zadeh, Yao-Chong Lim, Paul Pu Liang, and Louis-Philippe Morency. Variational auto-decoder. arXiv preprint arXiv:1903.00840arXiv preprint arXiv:1903.00840