[PDF] Biologically Motivated Algorithms for Propagating Local Target Representations

Abstract

Finding biologically plausible alternatives to back-propagation of errors is a fundamentally important challenge in artificial neural network research. In this paper, we propose a learning algorithm called error-driven Local Representation Alignment (LRA-E), which has strong connections to predictive coding, a theory that offers a mechanistic way of describing neurocomputational machinery. In addition, we propose an improved variant of Difference Target Propagation, another procedure that comes from the same family of algorithms as LRA-E. We compare our procedures to several other biologically-motivated algorithms, including two feedback alignment algorithms and Equilibrium Propagation. In two benchmarks, we find that both of our proposed algorithms yield stable performance and strong generalization compared to other competing back-propagation alternatives when training deeper, highly nonlinear networks, with LRA-E performing the best overall.

Full PDF

BBiologically Motivated Algorithms for Propagating Local Target Representations

Alexander G. Ororbia ∗ Rochester Institute of Technology102 Lomb Memorial Drive, Rochester NY, USA 14623 [email protected]

Ankur Mali ∗ Penn State UniversityOld Main, State College, PA 16801 [email protected]

Abstract

Finding biologically plausible alternatives to back-propagationof errors is a fundamentally important challenge in artiﬁcialneural network research. In this paper, we propose a learningalgorithm called error-driven Local Representation Alignment(LRA-E), which has strong connections to predictive coding, atheory that offers a mechanistic way of describing neurocom-putational machinery. In addition, we propose an improvedvariant of Difference Target Propagation, another procedurethat comes from the same family of algorithms as LRA-E.We compare our procedures to several other biologically-motivated algorithms, including two feedback alignment algo-rithms and Equilibrium Propagation. In two benchmarks, weﬁnd that both of our proposed algorithms yield stable perfor-mance and strong generalization compared to other competingback-propagation alternatives when training deeper, highlynonlinear networks, with LRA-E performing the best overall.

Behind the modern achievements in artiﬁcial neural networkresearch is back-propagation of errors (Rumelhart, Hinton,and Williams 1988) (or “backprop”), the key training al-gorithm used in computing updates for the parameters thatdeﬁne the computational architectures applied to problemsranging from computer vision to speech recognition. How-ever, though neural architectures are inspired by our currentunderstanding of the human brain, the connections to the ac-tual mechanisms of systems of natural neurons are often veryloose, at best. More importantly, backprop faces some of thestrongest neuro-biological criticisms, argued to be a highlyimplausible way in which learning occurs in the human brain.Among the many problems with back-propagation, someof the most prominent are: 1) the “weight transport problem”,where the feedback weights that carry back error signals mustbe the transposes of the feedforward weights, 2) forward andbackward propagation utilize different computations, and3) the error gradients are stored separately from the activa-tions. These problems, as originally argued in (Ororbia IIet al. 2017; Ororbia et al. 2018), largely center around onecritical component of backprop–the global feedback pathwayneeded for transporting error derivatives across the system.This pathway is necessary given the design of modern su-pervised learning systems–a loss measures error between ∗ Both authors contributed equally.Copyright c (cid:13) a model’s output units and a target, e.g., class label, andthe global pathway relates how the internal processing el-ements affect this error. When considering modern theo-ries of the brain (Grossberg 1982; Rao and Ballard 1999;Huang and Rao 2011), which posit that local computationsoccur at multiple levels of a somewhat hierarchical struc-ture, this global pathway should not be necessary to learneffectively. Furthermore, this pathway results makes trainingvery deep networks difﬁcult–due to many multiplicationsthat underly traversing along the global feedback pathway,error gradients explode/vanish (Pascanu, Mikolov, and Ben-gio 2013). To ﬁx this, gradients are kept within reasonablemagnitudes by requiring layers to behave sufﬁciently linearly.However, this remedy creates other highly undesirable side-effects, e.g., adversarial samples (Ororbia II, Kifer, and Giles2017), and prevents usage of neuro-biological mechanismssuch as lateral competition and discrete-valued/stochastic ac-tivations (since the pathway requires precise knowledge offunction derivatives (Bengio et al. 2015)).If we remove this global feedback pathway, we create anew problem–what are the learning signals for the hidden pro-cessing elements? This problem is one of the main concernsof the recently introduced

Discrepancy Reduction family oflearning algorithms (Ororbia II et al. 2017). In this paper,we will develop two learning algorithms within this family–error-driven Local Representation Alignment and adaptivenoise Difference Target Propagation. In experiments on twoclassiﬁcation benchmarks, we will show that these two algo-rithms generalize better than a variety of other biologicallymotivated learning approaches, all without employing theglobal feedback pathway required by back-propagation.

Coordinated Local Learning Algorithms

Algorithms within the Discrepancy Reduction (Ororbia II etal. 2017) family offer computational mechanisms for two keysteps when learning from patterns. These steps include:1. Search for latent representations that better explain theinput/output, also known as target representations. Thiscreates the need for local (higher-level) objectives that willguide current latent representations towards better ones.2. Reduce, as much as possible, the mismatch between amodel’s currently “guessed” representations and targetrepresentations. The sum of the internal, local losses is a r X i v : . [ c s . N E ] N ov lso deﬁned as the total discrepancy in a system, and canalso be thought of as a sort of pseudo-energy function.This general process forms the basis of what we call coor-dinated local learning rules . Computing targets with thesekinds of rules should not require an actual pathway, as inback-propagation, and instead make use of top-down andbottom-up signals to generate targets. This idea is particu-larly motivated by the theory of predictive coding (Panichello,Cheung, and Bar 2013) (which started to impact modern ma-chine learning applications (Li and Liu 2018)), which claimsthat the brain is in a continuous process of creating and up-dating hypotheses (using error information) to predict thesensory input. This paper will explore two ways in whichthis hypothesis updating (in the form of local target creation)might happen: 1) through error-correction in Local Represen-tation Alignment (LRA-E), and 2) through repeated encodingand decoding as in Difference Target Propagation (DTP). Itshould also be noted that one is not restricted to only usingneural building blocks–LRA-E could be used to train stackedGradient Boosted Decision Trees (GBDTs), which would befaster than in (Feng, Yu, and Zhou 2018), which employed aform of target propagation to calculate local updates.The idea of learning locally, in general, is slowly be-coming prominent in the training of artiﬁcial neural net-works, with recent proposals including decoupled neuralinterfaces (Jaderberg et al. 2016) and kickback (Balduzzi,Vanchinathan, and Buhmann 2015) (which was derivedspeciﬁcally for regression problems). Furthermore, (Whit-tington and Bogacz 2017) demonstrated that neural mod-els using simple local Hebbian updates (within a predic-tive coding framework) could efﬁciently conduct super-vised learning. Far earlier approaches that employed locallearning included the layer-wise training procedures thatwere once used to build models for unsupervised learn-ing (Bengio et al. 2007), supervised learning (Lee et al.2014), and semi-supervised learning (Ororbia II et al. 2015;Ororbia II, Giles, and Reitter 2015). The key problem withthese older algorithms is that they were greedy–a model wasbuilt from the bottom-up, freezing lower-level parameters ashigher-level feature detectors were learnt.Another important idea underlying algorithms such asLRA and DTP is that learning is possible with asymmetry–which directly resolves the weight-transport problem (Gross-berg 1987; Liao, Leibo, and Poggio 2016), another strongneuro-biological criticism of backprop. This is even possible,surprisingly, if the feedback weights are random and ﬁxed,which is at the core of two algorithms we will also compareto–Random Feedback Alignment (RFA) (Lillicrap et al. 2016)and Direct Feedback Alignment (DFA) (Nøkland 2016). RFAreplaces the transpose of the feedforward weights in backpropwith a similarly-shaped random matrix while DFA directlyconnects the output layer’s pre-activation derivative to eachlayer’s post-activation. It was shown in (Ororbia II et al. 2017;Ororbia et al. 2018) that these feedback loops would be bettersuited in generating target representations. Local Representation Alignment

To concretely describe how LRA is practically implemented,we will specify how LRA is applied to a 3-layer feedforward network, or multilayer perceptron (MLP). Note that LRAgeneralizes to models with more layers ( L ≥ ).The pre-activities of the MLP at layer (cid:96) are denoted as h (cid:96) while the post-activities, or the values output by the non-linearity φ (cid:96) ( · ) , are denoted as z (cid:96) . The target variable usedto correct the output units ( z L ) is denoted as y Lz ( y Lz = y ,or y Lz = x if we are learning auto-associative functions).Connecting one layer of neurons z (cid:96) − , with pre-activities h (cid:96) − , to another layer z (cid:96) , with pre-activities h (cid:96) , are synapticweights W (cid:96) . The propagation equations for computing pre-activtion and post-activation values for layer (cid:96) are: h (cid:96) = W (cid:96) z (cid:96) − , z (cid:96) = φ (cid:96) ( h (cid:96) ) (1)Before computing targets or updates, we ﬁrst must deﬁne theset of local losses, one per layer of neurons except for theinput neurons, that constitute the measure of total discrep-ancy inside the MLP, {L ( y z , z ) , L ( y z , z ) , L ( y z , z ) } .With losses deﬁned, we can then explicitly formulate theerror units e (cid:96) for each layer as well, since any given layer’serror units correspond to the ﬁrst derivative of that layer’sloss with respect to that layer’s post-activation values. Forthe MLP’s output layer, we could assume a categorical dis-tribution, which is appropriate for 1-of- k classiﬁcation tasks,and use the following negative log likelihood loss: L (cid:96) ( y (cid:96)z , z (cid:96) ) = − | z | (cid:88) i =1 y (cid:96)z [ i ] log z (cid:96) [ i ] , e (cid:96) = e (cid:96) ( y (cid:96)z , z (cid:96) ) = − y (cid:96)z z (cid:96) , (2)where the loss is computed over all dimensions | z | of thevector z (where a dimension is indexed/accessed by integer i ).Note that for this loss function, we assume that z is a vector ofprobabilities computed by using the softmax function as theoutput nonlinearity, z = exp ( h ) (cid:80) i exp ( h i ) . For the hidden layers,we can choose between a wider variety of loss functions,and in this paper, we experimented with assuming either aGaussian or Cauchy distribution over the hidden units. For theGaussian distribution (or L2 norm), we have the following: L (cid:96) ( z , y ) = 1(2 σ ) | z | (cid:88) i =1 ( y i − z i ) e (cid:96) = e (cid:96) ( y (cid:96)z , z (cid:96) ) = − ( y (cid:96)z − z (cid:96) ) σ (3)where σ represents ﬁxed scalar variance (we set σ = 1 / ).For the Cauchy distribution (or log-penalty), we obtain: L (cid:96) ( z , y ) = | z | (cid:88) i =1 log(1 + ( y i − z i ) ) e (cid:96) = e (cid:96) ( y (cid:96)z , z (cid:96) ) = − y (cid:96)z − z (cid:96) )(1 + ( y (cid:96)z − z (cid:96) ) ) . (4)For the activation function used in calculating the hiddenpost-activities, we use the hyperbolic tangent, or φ (cid:96) ( v ) = exp (2 v ) − exp (2 v )+1 . Using the Cauchy distribution proved particularlyseful in our experiments because it encourages sparse repre-sentations and aligns nicely with the biological considerationsof sparse coding (Olshausen and Field 1997) and predictivesparse decomposition (Kavukcuoglu, Ranzato, and LeCun2010) as well as the lateral competition (Rao and Ballard1999) that naturally occurs in groups of neural processing ele-ments. These are relatively simple local losses for measuringthe agreement between representations and targets and futurework should entail developing even better metrics. Algorithm 1

LRA-E: Target and update computations. // Procedure for computing error units & targets

Input: sample ( y , x ) and Θ = { W , W , W , E , E } function C OMPUTE T ARGETS ( ( y , x ) , Θ ) // Run feedforward weights to get activities h = W z , z = φ ( h ) h = W z , z = φ ( h ) h = W z , z = φ ( h ) y z ⇐ ye = − y z z , y z ← φ (cid:16) h − β ( E e ) (cid:17) e = − y z − z ) y z ← φ (cid:16) h − β ( E e ) (cid:17) e = − y z − z )Λ = ( z , z , z , h , h , h , e , e , e ) Return Λ // Procedure(s) for computing weight updates

Input: sample ( y , x ) and calculations Λ function C ALC U PDATES -V1( ( y , x ) , Θ , Λ ) ∆ W = ( e ⊗ φ (cid:48) ( h ))( z ) T ∆ W = ( e ⊗ φ (cid:48) ( h ))( z ) T ∆ W = ( e ⊗ φ (cid:48) ( h ))( x ) T ∆ E = − γ (∆ W ) T ∆ E = − γ (∆ W ) T Return (cid:0) ∆ W , ∆ W , ∆ W , ∆ E , ∆ E (cid:1) function C ALC U PDATES -V2( ( y , x ) , Θ , Λ ) ∆ W = e ( z ) T ∆ W = e ( z ) T ∆ W = e ( x ) T ∆ E = − γ (∆ W ) T ∆ E = − γ (∆ W ) T Return (cid:0) ∆ W , ∆ W , ∆ W , ∆ E , ∆ E (cid:1) With local losses speciﬁed and error units implemented,all that remains is to deﬁne how targets are computed andwhat the parameter updates will be. At any given layer z (cid:96) ,starting at the output units (in our example, z ), we calculatethe target for the layer below z (cid:96) − by multiplying the errorunit values at (cid:96) by a set of synaptic error weights E (cid:96) . Thisprojected displacement, weighted by the modulation factor β , is then subtracted from the initially found pre-activationof the layer below h (cid:96) − . This updated pre-activity is thenrun through the appropriate nonlinearity to calculate the ﬁnal In this paper, β = 0 . , found with only minor prelim. tuning. target y (cid:96) − z . This computation amounts to: e (cid:96) = − y (cid:96)z − z (cid:96) ) , ∆ h (cid:96) − = E (cid:96) e (cid:96) (5) y (cid:96) − z ← φ (cid:96) − (cid:16) h (cid:96) − − β (∆ h (cid:96) − ) (cid:17) . (6)Once the targets for each layer have been found, we canthen use the local loss L (cid:96) ( y (cid:96)z , z (cid:96) ) to compute updates to theweights W (cid:96) and its corresponding error weights E (cid:96) . Theupdate calculation for parameters at layer (cid:96) would be: ∆ W (cid:96) = ( e (cid:96) ⊗ φ (cid:48) (cid:96) ( h (cid:96) ))( z (cid:96) − ) T , ∆ E (cid:96) = − γ (∆ W (cid:96) ) T , (7)or, ∆ W (cid:96) = e (cid:96) ( z (cid:96) − ) T , ∆ E (cid:96) = − γ (∆ W (cid:96) ) T (8)where ⊗ indicates the Hadamard product and γ is a decayfactor (a value that we found should be set to less than . )meant to ensure that the error weights change more slowlythan the forward weights. An attractive property of LRAis that the derivatives of the pointwise activation functionscan be dropped, yielding the second variation of the updaterule, as long as the activation function is monotonically non-decreasing in its input (for stochastic activation functions,the output distribution for a larger input should stochasti-cally dominate the output distribution for a smaller input).This is also satisfying from a biological perspective since itis unlikely that neurons would utilize point-wise activationderivatives in computing updates (Hinton and McClelland1988). The update for error weights is simply proportional tothe negative transpose of the update computed for the match-ing forward weights, which is a computationally fast andcheap rule we propose inspired by (Rao and Ballard 1997).In Algorithm 1, the equations in this section arecombined to create the full procedure for training a3-layer MLP (using either CALCUPDATES-V1 ( · ) orCALCUPDATES-V2 ( · ) to compute weight updates), as-suming Gaussian local losses and their respective error units.The model is deﬁned by Θ = { W , W , W , E , E } (biases c (cid:96) omitted for clarity). We will refer to Algorithm 1 as LRA-E (which easily extends to

L > ).In Figure 1(a), we compare the updates calculated by LRA-E (as well as DFA and our proposed DTP- σ , described later)with those given by back-propagation, after each mini-batch,by plotting the angles over the ﬁrst 20 epochs of learningfor a 3-layer MLP (256 units per layer) trained with stochas-tic gradient descent (SGD) with mini-batches of 50 imagesamples using a categorical output loss and Gaussian lo-cal losses. As long as the angle of the updates computedfrom LRA are within 90 degrees of the updates obtained byback-propagation, LRA will move parameters towards thesame general direction as back-propagation (which greedilypoints in the direction of steepest descent) and will still ﬁndgood local optima. In Figure 1(a), this does indeed appearto be the case for the MLP example. The angle, fortunately,while certainly non-zero, never deviates too far from the di-rection pointed by back-propagation and remains relativelystable throughout the learning process. (Observe that DFAand DTP - σ have, interestingly enough, update angles thatare quite similar to LRA-E.) Alongside Figure 1(a), in Fig-ure 1(b), we plot our neural model’s total internal discrepancy, Except for W , which has no corresponding error weights E . Iteration A ng l e

1M 2M 3M 4M 5M

LRA−EDFADTP− s (a) Gradient angle (degrees) compared to BP. Iteration Lo ss / D i sc r epan cy

1M 2M 3M 4M 5M

V_DDV_L (b) Total discrepancy & output NLL.

Figure 1: In Figure 1(a), we compare the updates calculated by LRA-E, DFA, and DTP- σ against backprop (BP). In Figure 1(b),we show how total discrepancy for an LRA-trained MLP evolves during training on Fashion MNIST, alongside the output loss. D ( y , x ) (or V_DD ), which is a simple linear combinationof all of the internal local losses for a given data point. Ob-serve that while the (validation) output loss (

V_L ) continuallydecreases,

V_DD does not always appear to do so. We conjec-ture that this “bump”, which appears at the start of learning,is the result of the evolution of LRA-E’s error weights, whichare used to directly control the target generation process.So even though backprop and LRA-E might start down thesame path in error space (or on the loss surface), as indicatedby the initially low angle between updates, this trajectoryis not ideal for LRA’s units/targets. This means that errorweights will change more rapidly at training’s start, resultingin targets that vary quite a bit (raising internal loss values).However, once the error weights start to converge to an ap-proximate transpose of the feedforward weights, the processof correction becomes easier and

V_DD desirably declines.

Improving Difference Target Propagation

As mentioned earlier, Difference Target Propagation (DTP)(and also, less directly, recirculation (Hinton and McClel-land 1988; O’Reilly 1996)), like LRA-E, also falls underthe same family of algorithms concerned with minimizinginternal discrepancy, as shown in (Ororbia II et al. 2017;Ororbia et al. 2018). However, DTP takes a very differentapproach to computing alignment targets–instead of transmit-ting messages through error units and error feedback weightsas in LRA-E, DTP employs feedback weights to learn theinverse of the mapping created by the feedforward weights.However, (Ororbia et al. 2018) showed that DTP struggles toassign good local targets as the network becomes deeper, i.e.,more highly nonlinear, facing an initially promising albeitbrief phase in learning where generalization error decreases(within the ﬁrst few epochs) before ultimately collapsing (un-less very speciﬁc initializations are used). One potential causeof this failure could be the lack of a strong enough mechanismto globally coordinate the local learning problems createdby the encoder-decoder pairs that underlie the system. Inparticular, we hypothesize that this problem might be coming from the noise injection scheme, which is local and ﬁxed, of-fering no adaptation to each speciﬁc layer and making someof the layerwise optimization problems more difﬁcult thannecessary. Here, we will aim to remove this potential causethrough an adaptive layerwise corruption scheme.Assuming we have a target calculated from above y (cid:96)z , weconsider the forward weights W (cid:96) connecting the layer z (cid:96) − tolayer z (cid:96) and the decoding weights E (cid:96) that deﬁne the inversemapping between the two. The ﬁrst forward propagationstep is the same as in Equation 1. In contrast to LRA-E’serror-driven way of computing targets, we consider each pairof neuronal layers, ( z (cid:96) , z (cid:96) − ) , as forming a particular typeof encoding/decoding cycle that will be used in computinglayerwise targets. To calculate the target y (cid:96) − z , we update theoriginal post-activation z (cid:96) − using the linear combination oftwo applications of the decoding weights as follows: y (cid:96) − z = z (cid:96) − − (cid:0) φ (cid:96) − ( E (cid:96) z (cid:96) ) + φ (cid:96) − ( E (cid:96) y (cid:96)z ) (cid:1) (9)where we see that we decode two times, one from the origi-nal post-activation calculated from the feedforward pass ofthe MLP and another from the target value generated fromthe encoding/decoding process from the layer pair above,e.g. ( z (cid:96) +1 , z (cid:96) ) . This will serve as the target when trainingthe forward weights for the layer below W (cid:96) − . To train theinverse-mapping weights E (cid:96) , as required by the original ver-sion of DTP, zero-mean Gaussian noise, (cid:15) ∼ N (0 , σ ) withﬁxed standard deviation σ , is injected into z (cid:96) − followedby re-running the encoder and the decoder on this newlycorrupted activation vector. Formally, this is deﬁned as: (cid:98) y (cid:96) − z = z (cid:96) − + (cid:15) , (cid:98) z (cid:96) − = φ (cid:96) − ( E (cid:96) φ (cid:96) ( W (cid:96) (cid:98) y (cid:96) − z )) (10)This process we will refer to as DTP . In our proposed, im-proved variation of DTP, or

DTP - σ , we will take an “adap-tive” approach to the noise injection process (cid:15) . To developour adaptive noise scheme, we have taken some insightsfrom studies of biological neuron systems, which show thereare varying levels of signal corruption in different neuronallayers/groups (D. and Yngve ; Tomko and Crapper 1974;olhurst, Movshon, and Dean 1983; Shadlen and Newsome1998). It has been argued that this noise variability enhancesneurons’ overall ability to detect and transmit signals acrossa system (Shu et al. 2003; Kruglikov and Dertinger 1994;Shadlen and Newsome 1998) and, furthermore, that the pres-ence of this noise yields more robust representations (Cordoet al. 1996; Shadlen and Newsome 1998; Faisal, Selen, andWolpert 2008). There also is biological evidence demon-strating that an increase in the noise level across successivegroups of neurons is thought to help in local neural com-putation (Shadlen and Newsome 1998; Sarpeshkar 1998;Laughlin, de Ruyter van Steveninck, and Anderson 1998).In light of this, the standard deviation σ of the noise pro-cess should be a function of the noise across layers, and aninteresting way in which we implemented this was to make σ (cid:96) (the standard deviation of the noise injection at layer (cid:96) )a function of the local loss measurements. At the top layer,we can set σ L = α (a small, ﬁxed value such as α = 0 . worked well in our experiments). The standard deviation forthe layers below would be a function of where the noiseprocess is within the network, indexed by (cid:96) . This means that: σ (cid:96) = σ (cid:96) +1 − L (cid:96) ( y (cid:96) − z , z (cid:96) − ) (11)noting that the local loss chosen for DTP is a Gaussian loss(but with the input arguments ﬂipped–the target value is nowthe corrupted initial encoding and the prediction is the clean,original encoding, or L (cid:96) ( z = y (cid:96) − z , y = z (cid:96) − ) ).The updates to the weights are calculated by differentiat-ing each local loss with respect to the appropriate encoderweights, or ∆ W (cid:96) − = ∂ L ( z (cid:96) − , y (cid:96) − z ) ∂W (cid:96) − , or with respect to thedecoder synaptic weights ∆ E (cid:96) = ∂ L ( (cid:98) z (cid:96) , (cid:98) y (cid:96)z ) ∂E (cid:96) . Note that the or-der of the input arguments to each loss function for these twopartial derivatives is important for obtaining the correct signto multiply the gradients by, and furthermore staying alignedwith the original formulation of DTP (Lee et al. 2015a), .As we will see in our experimental results, DTP - σ is amuch more stable learning algorithm (especially with respectto the original DTP), especially when training deeper/widernetworks. DTP - σ beneﬁts from a stronger form of over-all coordination among its internal encoding/decoding sub-problems through the pair-wise comparison of local lossvalues that drive the hidden layer corruption. A Comment on the Efﬁciency of LRA-E and DTP

Note that

LRA-E , while a bit slower than backprop per update(given its use of the error weights to generate hidden layertargets), is much faster than

DTP and

DTP - σ . Speciﬁcally, ifwe focus on matrix multiplications used to ﬁnd targets, whichmake up the bulk of the computation underlying both pro-cesses, LRA-E only requires L − matrix multiplicationswhile DTP and

DTP - σ require L −

3) + L multiplications.In particular, DTP has a very expensive target generationphase, requiring 2 applications of the encoder parameters (1of these is from the network’s initial feedfoward pass) and3 applications of the decoder parameters to create targets totrain the forward weights and inverse-mapping weights. Experimental Results

In this section, we present experimental results of trainingMLPs using a variety of learning algorithms.

MNIST:

This dataset contains × images with gray-scale pixel feature values in the range of [0 , . The onlypreprocessing applied to this data is to normalize the pixelvalues to the range of [0 , by dividing them by 255. Fashion MNIST:

This database (Xiao, Rasul, and Vollgraf2017) contains x grey-scale images of clothing items,meant to serve as a much more difﬁcult drop-in replacementfor MNIST itself. Training contains 60000 samples and test-ing contains 10000, each image is associated with one of 10classes. We create a validation set of 2000 samples from thetraining split. Preprocessing was the same as on MNIST.For both datasets and all models, over 100 epochs, we cal-culate updates over mini-batches of 50 samples. Furthermore,we do not regularize parameters any further, e.g., drop-out(Srivastava et al. 2014) or weight penalties. All feedfowardarchitectures for all experiments were of either , , or hid-den layers of processing elements. The post-activationfunction used was the hyperbolic tangent and the top layerwas chosen to be a maximum-entropy classiﬁer (i.e., a soft-max function). The output layer objective for all algorithmswas to minimize the categorical negative log likelihood.Parameters were initialized using a scheme that gave bestperformance on the validation split of each dataset on a per-algorithm basis. Though we wanted to use very simple ini-tialization schemes for all algorithms, in preliminary experi-ments, we found that the feedback alignment algorithms aswell as DTP (and

DTP - σ ) worked best when using a uniformfan-in-fan-out scheme (Glorot and Bengio 2010). (Ororbia etal. 2018) conﬁrms this result, originally showing how thesealgorithms often are unstable or fail to perform well usinginitializations based on simple uniform or Gaussian distri-butions. For LRA-E, however, we initialized the parametersusing a zero-mean Gaussian distribution (variance of . ).The choice of parameter update rule was also somewhatdependent on the learning algorithm employed. Again, asshown in (Ororbia et al. 2018), it is difﬁcult to get good, stableperformance from algorithms, such as the original DTP, whenusing simple SGD. As done in (Lee et al. 2015b), we used theRMSprop (Tieleman and Hinton 2012) adaptive learning ratewith a global step size of λ = 0 . . For Backprop, RFA,DFA, and LRA-E, we were able to use SGD ( λ = 0 . ). Classiﬁcation Performance

In this experiment, we compare all of the algorithms dis-cussed earlier. These include back-propagation (Backprop),Random Feedback Alignment (RFA) (Lillicrap et al. 2014),Direct Feedback Alignment (DFA) (Nøkland 2016), Equilib-rium Propagation (Scellier and Bengio 2017) (Equil-Prop)and the original Difference Target Propagation (Lee et al.2015a) (DTP). Our algorithms include our proposed, im-proved version of DTP (

DTP - σ ) and the proposed error-driven Local Representation Alignment (LRA-E).The results of our experiments are presented in Tables1 and 2. Test and training scores are reported for the set Available at the URL: http://yann.lecun.com/exdb/mnist/.

Layers 5 Layers 8 LayersModel Train Test Train Test Train Test

Backprop .

78 3 .

02 2 . .

98 2 .

91 3 . Equil-Prop .

82 4 .

99 7 .

59 9 .

21 89 .

96 90 . RFA .

01 3 .

13 2 .

99 3 . .

59 3 . DFA .

07 4 .

17 3 .

71 3 .

88 3 .

81 3 . DTP .

74 2 . .

408 4 .

94 10 .

89 10 . DTP - σ (ours) .

00 2 .

38 0 .

00 2 .

57 0 .

00 2 . LRA -E (ours) .

86 2 .

20 0 .

16 1 .

97 0 .

08 2 . Table 1: MNIST supervised classiﬁcation results.

Backprop .

08 14 .

89 12 . .

98 11 .

55 13 . Equil-Prop .

72 14 .

01 16 .

56 20 .

97 90 .

12 89 . RFA .

99 12 .

74 12 .

09 12 .

89 12 .

03 12 . DFA .

04 13 .

41 12 .

58 13 .

09 11 .

59 13 . DTP . .

03 21 .

078 19 .

66 21 .

838 17 . DTP - σ (ours) . .

95 6 .

34 12 .

99 6 . . LRA -E (ours) .

25 13 .

51 9 .

84 12 .

31 9 .

74 12 . Table 2: Fashion MNIST supervised classiﬁcation results.

SGD Adam RMSpropModel Train Test Train Test Train Test

LRA, MNIST .

86 2 .

20 0 .

00 1 .

75 0 .

69 2 . LRA, Fashion MNIST .

25 13 .

51 5 .

38 12 .

42 12 .

67 14 . Table 3: Effect of the update rule on LRA when training a 3-layer MLP on MNIST. (a) DFA. (b) Equil-Prop. (c) DTP- σ . (d) LRA-E. Figure 2: Visualization of the topmost hidden layer of a 5-layer MLP trained by DFA, Equil-Prop, DTP- σ , and LRA-E.of model parameters that had lowest validation error. Ob-serve that LRA-E is the most stable and consistently well-performing algorithm compared to the other backprop alter-natives, closely followed by our DTP - σ . More importantly,algorithms like Equil-Prop and DTP appear to break downwhen training deeper networks, i.e., the 8-layer MLP. Notethat while DTP was used to successfully train a 7-layer net-work of 240 units (using RMSprop) (Lee et al. 2015a), wefollowed the same settings reported for networks deeper than and in our experiments uncovered that the algorithm beginsto struggle as the layers are made wider, starting with thewidth of . However, this problem is rectiﬁed using DTP- σ ,leading to much more stable performance and even to caseswhere the algorithm completely overﬁts the training set (as in the case of 3 and 5 layers for MNIST). Nonetheless, LRA-Estill performs best with respect to generalization across bothdatasets, despite using a naïve initialization scheme. Table3 shows the results of using update rules other than SGDfor LRA-E, e.g., Adam (Kingma and Ba 2014) or RMSprop(Tieleman and Hinton 2012) for a 3-layer MLP, (global stepsize . for both algorithms). We see that LRA-E is com-patible with other learning rate schemes and reaches bettergeneralization performance when using them.Figure 2 displays a t-SNE (Maaten and Hinton 2008) vi-sualization of the top-most hidden layer of a learned 5-layerMLP using either DFA, Equil-Prop, DTP - σ , and LRA-E onFashion MNIST samples. Qualitatively, we see that all learn-ing algorithms extract representations that separate out the a) 5-layer MLP. (b) 8-layer MLP Figure 3: Validation accuracy of

DTP vs our proposed

DTP - σ , as a function of epoch.data points reasonably well, at least in the sense that pointsare clustered based on clothing type. However, it appears that LRA-E representations yield more strongly separated clus-ters, as evidenced by somewhat wider gaps between them,especially around the pink, blue, and black colored clusters.Finally, DTP, as also mentioned in (Ororbia et al. 2018),appears to be quite sensitive to its initialization scheme. Forboth MNIST and Fashion MNIST, we trained DTP and ourproposed

DTP - σ with three different settings, including ran-dom orthogonal (Ortho), fan-in-fan-out (Gloro), and simplezero-mean Gaussian (G) initializations. Figure 3 shows thevalidation accuracy curves of DTP and DTP - σ as a functionof epoch for 5 and 8 layer networks with various weight ini-tializations. As shown in Figure 3, DTP is highly unstableas the network gets deeper while

DTP - σ is not. Furthermore, DTP - σ ’s performance appears to be less dependent on theweight initialization scheme. Thus, our experiments showpromising evidence of DTP - σ ’s generalization improvementover the original DTP . Moreso, as indicated by Tables 1 and2,

DTP - σ can, overall, perform nearly as well as LRA-E . Conclusions

In this paper, we proposed two learning algorithms: error-driven Local Representation Alignment and adaptive noiseDifference Target Propagation. On two classiﬁcation bench-marks, we show strong positive results when training deepmultilayer perceptrons. With respect to other types of neuralstructures, e.g., locally connected ones, we would expect ourproposed algorithms to work well, especially LRA-E, sincethe target computation/error unit mechanism is agnostic to theunderlying building blocks of the feedforward model (permit-ting extension to models such as residual networks). Futurework will include adapting these algorithms to larger-scaletasks requiring more complex, exotic architectures.

References [Balduzzi, Vanchinathan, and Buhmann 2015] Balduzzi, D.;Vanchinathan, H.; and Buhmann, J. M. 2015. Kickback cutsbackprop’s red-tape: Biologically plausible credit assignmentin neural networks. In

AAAI , 485–491. [Bengio et al. 2007] Bengio, Y.; Lamblin, P.; Popovici, D.;Larochelle, H.; et al. 2007. Greedy layer-wise training ofdeep networks.

Advances in Neural Information ProcessingSystems arXiv preprint arXiv:1502.04156 .[Cordo et al. 1996] Cordo, P.; Inglis, J. T.; Verschueren, S.;Collins, J. J.; Merfeld, D. M.; Rosenblum, S.; Buckley, S.;and Moss, F. 1996. Noise in human muscle spindles.

Nature

The Journal of Physiology

Nat.Rev. Neurosci. arXivpreprint arXiv:1806.00007 .[Glorot and Bengio 2010] Glorot, X., and Bengio, Y. 2010.Understanding the difﬁculty of training deep feedforwardneural networks. In

Proceedings of the Thirteenth Interna-tional Conference on Artiﬁcial Intelligence and Statistics ,249–256.[Grossberg 1982] Grossberg, S. 1982. How does a brain builda cognitive code? In

Studies of mind and brain . Springer. 1–52.[Grossberg 1987] Grossberg, S. 1987. Competitive learning:From interactive activation to adaptive resonance.

CognitiveScience

Neural information processing systems , 358–366.[Huang and Rao 2011] Huang, Y., and Rao, R. P. 2011. Pre-dictive coding.

Wiley Interdisciplinary Reviews: CognitiveScience arXiv preprint arXiv:1608.05343 .[Kavukcuoglu, Ranzato, and LeCun 2010] Kavukcuoglu, K.;Ranzato, M.; and LeCun, Y. 2010. Fast inference in sparsecoding algorithms with applications to object recognition. arXiv preprint arXiv:1010.3467 .[Kingma and Ba 2014] Kingma, D., and Ba, J. 2014. Adam:A method for stochastic optimization. arXiv preprintarXiv:1412.6980 .[Kruglikov and Dertinger 1994] Kruglikov, I. L., andDertinger, H. 1994. Stochastic resonance as a possiblemechanism of ampliﬁcation of weak electric signals in livingcells.

Bioelectromagnetics

Nat. Neurosci. arXiv:1409.5185[cs, stat] .[Lee et al. 2015a] Lee, D.-H.; Zhang, S.; Fischer, A.; and Ben-gio, Y. 2015a. Difference target propagation. In

Joint Eu-ropean Conference on Machine Learning and KnowledgeDiscovery in Databases , 498–515. Springer.[Lee et al. 2015b] Lee, D.-H.; Zhang, S.; Fischer, A.; and Ben-gio, Y. 2015b. Difference target propagation. In

Proceedingsof the 2015th European Conference on Machine Learningand Knowledge Discovery in Databases - Volume Part I ,ECMLPKDD’15, 498–515. Switzerland: Springer.[Li and Liu 2018] Li, J., and Liu, H. 2018. Predictive codingmachine for compressed sensing and image denoising. In

AAAI .[Liao, Leibo, and Poggio 2016] Liao, Q.; Leibo, J. Z.; andPoggio, T. A. 2016. How important is weight symmetryin backpropagation? In

AAAI , 1837–1844.[Lillicrap et al. 2014] Lillicrap, T. P.; Cownden, D.; Tweed,D. B.; and Akerman, C. J. 2014. Random feedback weightssupport learning in deep neural networks. arXiv preprintarXiv:1411.0247 .[Lillicrap et al. 2016] Lillicrap, T. P.; Cownden, D.; Tweed,D. B.; and Akerman, C. J. 2016. Random synaptic feedbackweights support error backpropagation for deep learning.

Na-ture communications

Journal of machinelearning research

Advancesin Neural Information Processing Systems , 1037–1045.[Olshausen and Field 1997] Olshausen, B. A., and Field, D. J.1997. Sparse coding with an overcomplete basis set: A strat-egy employed by v1?

Vision research

Neural computation arXiv preprint arXiv:1803.01834 .[Ororbia II et al. 2015] Ororbia II, A. G.; Reitter, D.; Wu, J.;and Giles, C. L. 2015. Online learning of deep hybrid ar-chitectures for semi-supervised categorization. In

MachineLearning and Knowledge Discovery in Databases (Proceed-ings, ECML PKDD 2015) , volume 9284 of

Lecture Notes inComputer Science . Porto, Portugal: Springer. 516–532.[Ororbia II et al. 2017] Ororbia II, A. G.; Haffner, P.; Reitter,D.; and Giles, C. L. 2017. Learning to adapt by minimizingdiscrepancy. arXiv preprint arXiv:1711.11542 .[Ororbia II, Giles, and Reitter 2015] Ororbia II, A. G.; Giles,C. L.; and Reitter, D. 2015. Online semi-supervised learn-ing with deep hybrid boltzmann machines and denoisingautoencoders. arXiv preprint arXiv:1511.06964 .[Ororbia II, Kifer, and Giles 2017] Ororbia II, A. G.; Kifer,D.; and Giles, C. L. 2017. Unifying adversarial training algo-rithms with data gradient regularization.

Neural computation

Frontiers in Psychology

International Conference on MachineLearning , 1310–1318.[Rao and Ballard 1997] Rao, R. P., and Ballard, D. H. 1997.Dynamic model of visual recognition predicts neural re-sponse properties in the visual cortex.

Neural computation

Natureneuroscience

NeuralComput

Frontiers in computa-tional neuroscience

J. Neurosci.

J.Neurosci.

The Journal of Machine Learning Research

VisionResearch

Brain Research

Neural Computation arXiv preprintarXiv:1708.07747arXiv preprintarXiv:1708.07747