Feedback alignment in deep convolutional networks
FFeedback alignment in deep convolutional net-works
Theodore H. Moskovitz , Ashok Litwin-Kumar , and L.F. Abbott Mortimer B. Zuckerman Mind, Brain and Behavior Institute, Department of Neuroscience, Columbia
University, New York, NY Department of Computer Science, Columbia University, New York, NY * [email protected] Abstract
Several recent studies have identified similarities between neural representationsin biological networks and in deep artificial neural networks. This has led torenewed interest in developing analogies between the backpropagation learning al-gorithm used to train artificial networks and the synaptic plasticity rules operativein the brain. These efforts are challenged by biologically implausible features ofbackpropagation, one of which is a reliance on symmetric forward and backwardsynaptic weights. A number of methods have been proposed that do not rely onweight symmetry but, thus far, these have failed to scale to deep convolutionalnetworks and complex data. We identify principal obstacles to the scalability ofsuch algorithms and introduce several techniques to mitigate them. We demon-strate that a modification of the feedback alignment method that enforces a weakerform of weight symmetry, one that requires agreement of weight sign but not mag-nitude, can achieve performance competitive with backpropagation. Our resultscomplement those of Bartunov et al. (2018) and Xiao et al. (2018b) and suggestthat mechanisms that promote alignment of feedforward and feedback weights arecritical for learning in deep networks.
While the hierarchical processing performed by deep neural networks is inspired by the brain, thereare a number of fundamental differences between these artificial networks and their biologicalcounterparts. In particular, the use of the backpropagation (BP) algorithm (Rumelhart et al.,1986) to perform gradient descent, which is central to the optimization of artificial networks,requires several assumptions that are difficult to reconcile with biology. Objections include theseparation of learning and inference into two separate phases and a requirement of symmetricsynaptic connectivity between forward and backward paths through the network, an issue knownas the weight transport problem (Grossberg, 1987). While feedback connections are common, suchsymmetry has not yet been observed in the brain.
Feedback alignment (FA), a modification to BP in which this symmetry is broken by replacingthe forward weights with randomized connections for the backward pass, avoids the weight trans-port problem (Lillicrap et al., 2016). While this method exhibits performance competitive withbackpropagation in simple fully-connected networks (Lillicrap et al., 2016; Nøkland, 2016), it hasperformed poorly when applied to deeper convolutional architectures and complex datasets (Liaoet al., 2015; Bartunov et al., 2018). In this work, we explore the obstacles that hinder the per-formance of FA in deeper networks and present modifications that allow these methods to remaincompetitive with BP. We also experiment with the enforcement of fixed excitatory (E) and in-hibitory (I) connectivity, and discuss its implications for learning in both the brain and artificialnetworks.
The weight transport problem for artificial neural networks was identified early on (Grossberg,1987; Crick, 1989; Zipser and Rumelhart, 1993). While a number of potential solutions had beenproposed previously (Crick, 1989; Brandt and Lin, 1996; Hinton, 2003), the FA method generatedsubstantial interest because it required no assumptions on the structure of the feedback weights usedto convey error signals, instead taking them to be fixed and random (Lillicrap et al., 2016). Initial1 a r X i v : . [ c s . N E ] J un ork demonstrated that FA was competitive with BP on the MNIST handwritten digit datasetand a random input-output task in multilayer fully-connected networks. In these networks, it wasobserved that as training progresses, the angle between the BP gradient and the FA error signalconverges from approximately orthogonal to roughly ◦ , meaning that the FA weight updates arecorrelated with but not identical to those in BP.Liao et al. (2015) applied FA to convolutional neural networks (CNNs), testing performance on avariety of tasks including visual and auditory classification. Without additional modifications, theperformance of FA was substantially worse than BP in contrast to the earlier experiments on fully-connected networks. To achieve competitive performance, the authors made three modificationsto the basic algorithm. The first of these is a technique termed uniform sign-concordant feedback(uSF), in which the feedback matrix B is set to B = sign ( W ) , where W represents the forwardweights. The second is the addition of Batch Normalization (Ioffe and Szegedy, 2015), and thethird is a set of techniques termed Batch Manhattan in which the magnitude of the gradient isdiscarded. Batch Manhattan in particular was introduced to avoid vanishing/exploding gradientsarising from the inherent weight asymmetry of FA. In this paper we demonstrate several alternativemechanisms for avoiding vanishing/exploding gradients without discarding the magnitude of theerror signal.Nøkland (2016) introduced direct feedback alignment (DFA), in which the output layer error signalis propagated directly to all upstream layers instead of through adjacent layers as in standardBP and FA. This idea was extended by (Baldi et al., 2016) to include residual connections fromdownstream layers other than the output layer. Here, we further develop these ideas into newgradient methods, as well as demonstrate their success on deeper models than have been usedpreviously.Recently, Bartunov et al. (2018) presented results testing FA on CIFAR-10 and ImageNet architec-tures. It is important to note that, in in the interest of biological plausibility, their models eschewedthe weight-sharing in convolutional layers that substantially improves generalization performancein most CNNs. Our models do take advantage of weight sharing to improve performance. Anumber of other approaches seek to further improve biological plausibility through the eliminationof a distinct error signal altogether (Le Cun, 1986; Bengio, 2014; Lee et al., 2015) or the use ofsegregrated dendrites and continuous learning (Guergiuev et al., 2016; Sacramento et al., 2018),but these approaches have also struggled when applied to deep networks and are outside the scopeof our investigation.During the final stages of the preparation of this manuscript, Xiao et al. (2018b) released resultssimilar to ours on even deeper networks. Our results are consistent with theirs, and we extend theanalysis by studying networks with fixed excitatory and inhibitory connections as well as severaldifferent modifications to FA. Together, these results provide strong evidence that sign-concordantfeedback improves performance compared to random feedback. We begin by describing the implementation of FA in convolutional networks, which is similar to itsimplementation in fully-connected networks. For a convolutional layer l , the affine pre-activation u l and the i th nonlinear feature map activation I li are calculated as u l +1 i = w l +1 i ∗ I li + b l +1 , I l +1 i = f ( u l +1 i ) , (1)where w i is the i th kernel, b is the bias term, ∗ denotes convolution, and f is the activation function.Assuming a loss function J , the gradient as used in backpropagation at layer l is δ li = ∂J∂u li = ( w l +1 i ∗ δ l +1 i ) (cid:12) f (cid:48) ( u li ) , (2)where (cid:12) is component-wise multiplication and w denotes a rotation of w by ◦ , i.e. a flipped kernel. FA replaces the feedforward kernel with a separate feedback matrix B to produce the errorsignal δ li = ∂J∂u li = ( B l +1 i ∗ δ l +1 i ) (cid:12) f (cid:48) ( u li ) . (3)In both methods, the parameter updates are calculated similarly, with ∆ W l ∝ δ l ( x l − ) T in fully-connected networks, and ∆ W l ∝ δ l ∗ I l − in CNNs.2
20 40 60 80 100
Training epochs E rr o r ( % ) Top-1 Error (a)
Training epochs
Top-5 Error
BPFAFA-uSF Init. E/IFA-uSF Init.FA-uSF SN (b)
Figure 1: Top-1 and Top-5 test errors on ImageNet for BP, FA without uSF or normalization(FA-no uSF), FA with uSF and the initialization method (FA-init), and FA with uSF and strictnormalization (FA-strict-norm). See text for method details.
We now describe modifications to FA that improve performance by reducing the possibility ofvanishing or exploding gradients and encouraging angular alignment between the feedforward andfeedback weights. The reasons for their effectiveness are explored in detail in Section 5. For uSF,the first feedback matrix setting we use is B lt = | B l | (cid:12) sign ( W lt ) , (4)where | · | is the element-wise absolute value function and t denotes the training iteration. We callthis technique, which uses no information about the magnitude of the current forward weights, the initialization (Init.) method. One modification of the initialization method that we also tested wasto enforce fixed excitatory/inhibitory connectivity, freezing the sign of the forward weights aftera certain number of epochs of training. This imposes a new constraint on the network’s synapticconnections, keeping them either excitatory (positive) or inhibitory (negative). Because under uSFthe feedback weights equal the sign of the forward weights, this also results in a constant feedbackmatrix. We call this the excitatory/inhibitory (E/I) method. We note that although this methodprevents single synapses from changing sign during training, individual neurons may still formsynapses of different signs, inconsistent with "Dale’s Law."The next setting we use, which incorporates the norm of the forward weights, is B lt = || W lt || (cid:12) sign ( W lt ) || sign ( W lt ) || . (5)We call this the strict normalization (SN) method. Note also that when uSF is used, some formof explicit normalization is required, as otherwise the magnitude of the feedback weights is ± regardless of the magnitude of the forward weights. Each method described above was applied to deep CNNs tested on visual classification. Modelswere trained on the MNIST handwritten digits dataset, the CIFAR-10 image set, and the ImageNetdataset. On all three datasets, we implemented relatively simple baseline architectures followingthe basic paradigm established by LeCun et al. (Lecun et al., 1998), namely several convolutionallayers each followed by a pooling layer, with one or several fully-connected layers on top. To testthe performance of our proposed methods on models with increased depth, we also applied themto more complicated architectures with nine convolutional layers (Springenberg et al., 2014) onCIFAR-10 and ImageNet. The exact architecture details are presented in Supplementary Tables2 and 3. In addition to FA, we trained models with direct feedback alignment (DFA) and a newmethod we call dense feedback alignment (DenseFA), which adds residual feedback connectionsfrom every downstream layer. Unfortunately, the memory requirements of these methods precludedtheir use on the larger models we trained. To further investigate the relationship between FA and3ethod MNIST CIFAR-10 1 CIFAR-10 2 ImageNet 1 ImageNet 2BP 0.8 17.2 11.0 79.5 45.5BP + Noise 0.8 17.4 11.0 79.2 46.0BP + Alignment 0.9 17.4 11.2 79.4 45.9FA 1.1 26.6 35.6 95.2 94.5FA-uSF Init. E/I 0.9 18.9 17.8 86.9 67.8FA-uSF Init. 0.7 17.6 13.1 79.6 60.1FA-uSF SN 0.7 17.7 12.6 78.9 54.4DFA 1.0 28.6 - - -DenseFA 0.7 16.9 - - -BP Const. - - - - 46.9FA-uSF Const. - - - - 51.2FA-uSF Const. E/I - - - - 66.1Table 1: Top-1 Test Error (%). The feedback alignment methods are competitive with, and in somecases exceed, backpropagation performance. ‘uSF’ denotes the use of sign-concordant feedback.‘Init.’ denotes the initialization method, and ‘SN’ denotes the strict normalization method. ‘E/I’denotes a model with frozen excitatory and inhibitory connections. These approaches are detailedin Section 3.1. ‘Const.’ denotes a model for which the L -norms of the feedforward weights wasfixed at initialization. This approach is described in Section 5.4. Model architectures are describedin detail in Supplementary Section 7.2.BP, we also investigated the performance of models that were trained with BP but with eitheradded noise (BP + Noise) or with weight matrices that were forced to align with arbitrary randommatrices (BP + Alignment). These techniques and their motivations are discussed in greater detailin Section 5.3. On ImageNet, as a means of circumventing the need for normalization altogether,we also experimented with BP (BP Const.) and FA (FA-uSF Const.) models with weight normsconstrained to be constant over training (see Section 5.4 for details).The MNIST model was trained for 25 epochs, the LeNet-style CIFAR-10 model was trained for154 epochs, and the all-convolutional CIFAR-10 architecture was trained for 256 epochs. BothImageNet models were trained for 100 epochs. In the E/I condition, we froze the signs of theweights (by clipping their values at just above or below zero) after 5% of the training time (i.e., 5epochs on ImageNet). All models were trained with the Adam optimizer (Kingma and Ba, 2014).Further training details can be found in Supplemental Section 7.1. Test results are summarizedin Table 1, with ImageNet test error curves plotted in Figure 1. Our results improve on thosereported by Bartunov et al. (2018) and are consistent with those of Xiao et al. (2018b). We now describe in more detail the relevant considerations when extending FA to deeper networksand the modifications we made to improve performance.
Consider a neural network with depth L trained to minimize loss J . Assume for simplicity that thelayers are sized equally so that each weight matrix is the same size and has a similar distributionof weights. The gradient with respect to the first layer activation x is: ∂J∂x = ∂J∂x L ∂x L ∂u L ∂u L ∂x L − ∂x L − ∂u L − ∂u L − ∂x L − . . . ∂x ∂u ∂u ∂x . (6)Observe that at each layer ∂u l ∂x l − = (cid:26) ( W l ) T if BP B l if FA. (7)If the weights are assumed to be independent, then ||∇ BPx J || ||∇ F Ax J ) || ∝∼ L (cid:89) l =1 ( || W l ) T || || B l || , (8)where || · || denotes the Euclidean norm. Because || W l || changes over the course of training and || B l || remains fixed, a network trained with feedback alignment runs a risk of experiencing vanish-ing or exploding gradients even if BP does not, depending on the ratio || B l || / || W l || . Moreover,this problem is exponential in the depth of the network (Figure 2a).4
20 40 60 80 100
Trainingepochs δ F A / δ B P NaiveRatios
Trainingepochs . . . . . . Glorot Ratios layer 1layer 2layer 3layer 4layer 5layer 6layer 7layer 8layer 9 (a) (b)
Figure 2: The layer-wise gradient ratios between FA and BP for a naïve initialization (a) and usinga variance-preserving initialization such as the one devised by Glorot and Bengio (2010) (b). Thisdemonstrates the importance of initialization in controlling for vanishing or exploding gradients asthe depth of the network increases.There are two possible families of approaches to solving this issue. First, careful initialization ofthe forward and backward weights can dramatically reduce fluctuations in the scale of activationsand error signals between layers. Initialization methods designed to control variance from layer tolayer as a means of effectively training deep networks are common (Glorot and Bengio, 2010; Xiaoet al., 2018a). One example is the initialization strategy introduced by (Glorot and Bengio, 2010)in which weights are initialized with a variance of / [ ( n in + n out )] , with n in the number of inputconnections from the previous layer to neurons of the subsequent layer and n out the number ofoutput connections. In BP, this method controls the variance of both the forward activations andthe backward gradients by averaging the number of incoming and outgoing connections. However,in FA the forward and backward passes are decoupled, and the most effective initialization istherefore to set the variance of the forward weights as /n in and the variance of the fixed, backwardweights as /n out . Note that n out is the number of incoming connections that a layer receivesduring the backward pass. While we found this method to be effective (Figure 2b), depth remainsa challenge, as the distribution of the forward weights can drift during training. With carefulinitialization, the scale of the drift is constant to within an order of magnitude, but there is stillan adverse effect on performance.The second family of approaches instead explicitly manages the sizes of the weights. For example,Liao et al. (2015) introduced a parameter update rule termed Batch Manhattan , which discardsthe magnitude of the error signal completely. That is, whereas the gradient descent weight updateis proportional to ∂J∂W , Batch Manhattan sets it proportional to sign (cid:0) ∂J∂W (cid:1) . In discarding themagnitude of the gradient, this method ignores the impact that the size of this signal has onthe effective learning rate of the network. The use of an adaptive, per-parameter optimizationalgorithm such as Adam (Kingma and Ba, 2014) can ameliorate this to a degree, but even inthis case we found a reduction in performance. We found that a more effective method waseither to constrain the norms of the feedback matrices to be equal to that of their feedforwardcounterparts, as in SN, or to simply scale by the initialized feedback weights, as in the initializationmethod.
When investigating FA in shallow feedforward networks, (Lillicrap et al., 2016) observed that astraining progressed, the signal calculated in FA and the gradient used by BP are roughly orthogonalat initialization but that the angle between them converges to approximately ◦ by the end oftraining. While we also observed this convergence in shallow networks, deeper networks exhibitedless alignment. More precisely, while we found that the angle converged to some degree for thetopmost layers, this was not the case for the lower layers (Figure 3a). In order to maintainalignment, we used the uSF method introduced by (Liao et al., 2015). When uSF is applied,the angle quickly drops below ◦ , but then rises slightly and levels off (Figure 3b). This is likelybecause at the beginning of training, depending on the initialization method, forward weight valuesare more tightly distributed than at the end of training.5 a) (c)(b) Trainingepochs G r a d i e n t a n g l e ( d e g r ee s ) FA layer-wisegradient alignment
Trainingepochs
FA-uSF layer-wisegradient alignment
Trainingepochs
BP-alignment layer-wisegradient alignment layer 1layer 2layer 3layer 4layer 5layer 6layer 7layer 8layer 9
Figure 3: The layer-wise angular alignment between the gradient computed with BP and with(a) FA, (b) FA with uSF, and (c) BP with an alignment constraint. Although the angles arecomparable, constrained BP still slightly outperforms FA (Table 1). These results demonstratethat viable solutions can be found without strictly following the gradient.
Because of the relationship between alignment and performance, we investigated models that weretrained with BP but with modifications that mimic features of FA. These included either addingnoise to the error signal or forcing the forward weights to align with an unrelated matrix (seeTable 1 for results). In the first condition, we added noise drawn from a normal distributioncentered at zero with variance approximately equal to that of the BP gradients. The angle betweenthe noisy learning signals and that of the true gradients remained at roughly ◦ throughouttraining, comparable to the alignment achieved by FA with uSF. This did not substantially reduceperformance (Table 1), indicating that the reduction in performance for FA cannot be accountedfor by gradient noise that is centered on the BP gradient.In the second condition, we applied an L penalty to the difference between the model weights θ = { w , ..., w L } and a random set of target matrices Λ = { v , ..., v L } (where L is the depth of thenetwork) in addition to the cross entropy loss: J ( y, ˆ y, θ, Λ) = − K (cid:88) k =1 y log ˆ y + λ L (cid:88) l =1 (cid:88) i,j ( w lij − v lij ) , (9)where y and ˆ y are the true and predicted logits, respectively, K is the number of label classes, and λ is the regularization weight. This penalty forces the forward weights to align with a set of fixedrandom matrices, as in FA, but without using the separate matrix for error propagation (Figure3). We found that setting λ = 0 . was effective at producing an alignment comparable to thatof FA. Performance did not noticeably suffer as a result of either of these changes. This indicatesthat the constraint of aligning with an arbitrary matrix does not in itself limit the performance ofdeep networks. While the various normalization methods we have introduced are effective at reducing the effect thatthe changing magnitude of the feedforward weights has on FA over time, to completely circumventthis issue, we trained several models on ImageNet in which the norm of the forward weights wasadjusted after each training iteration. Specifically, the weights were initialized normally, and aftereach iteration t the weights at each layer l were scaled as follows: w lt = || w l || w lt || w lt || , (10)where || w l || is the L -norm of the initial weights. Using a variance-preserving initialization method(Figure 2), this condition allows us to evaluate the effectiveness of FA without the need for addi-tional mechanisms to normalize the backward weights as the forward weights change. The resultsdemonstrate that constraining the norm in this way does not substantially affect the performanceof BP, while resulting in improved performance using FA with . Top-1 error, further narrowingthe gap between the two methods (Table 1, Figure 4).6ethod Top-1 Top-5BP 46.9 22.6FA-uSF 51.2 26.6FA-uSF EI 66.1 35.2
Training epochs E rr o r ( % ) ImageNet Test Error - Constrained Norms
BP-constFA-uSF-constFA-uSF-const EI
Figure 4: Top-1 (solid) and Top-5 (dotted) ImageNet test results and error curves for modelstrained with the L -norms of their forward weights fixed at the beginning training. This conditioneliminates the need for the training-time normalization methods, and further narrows the gapbetween BP and FA. Our results in Table 1 and Figure 4 demonstrate that freezing weight signs has a significant negativeimpact on learning. If weight signs stabilize over the course of training, constraining them to remaineither excitatory or inhibitory after this stabilization should have a minimal effect on performance.However, if weights do not stabilize, enforcing sign constraints may negatively impact performanceeven if the constraints are enforced after a large number of training epochs. To assess the stabilityof weight signs over training, we tracked the cumulative fraction of weights that exhibited a changein sign as a function of training iterations (Figure 5). Even after 300,000 training mini-batchesinto, many weight signs were still changing, indicating that excitatory or inhibitory identities arenot fixed even late in training. Accordingly, constraining weight signs dramatically reduces thespeed of learning, even if the constraints are applied late in training. It is unclear what impact thechoice of initialization has on this phenomenon. However, these results appear to be a substantialchallenge for biologically plausible models in which not only the signs of single synapses, but of allsynapses formed by an individual neuron, are fixed in accordance with Dale’s Law.
Taken together, these results extend previous work and demonstrate that modified FA learningalgorithms can perform with accuracy competitive with BP even for deep CNNs. Necessary modi-fications include controlling the magnitude of the error signal, which can be accomplished using thenormalization methods we investigated, and encouraging alignment, which can be accomplishedwith sign-concordant feedback. These conclusions are consistent with those of Xiao et al. (2018b).In our simulations, we found that employing these methods along with fixed weight norms allowour model to reach an ImageNet Top-1 error of . , which is competitive with the same archi-tecture trained with BP. A topic of experimental interest is identifying biological mechanisms withroles analogous to these methods.Homeostatic mechanisms regulating feedback connections could play a role in normalizing forwardand backward synaptic weights (Turrigiano and Nelson, 2004). We studied a variety of mechanismsto accomplish this normalization, from explicitly scaling the feedback weights to match the feed-forward weights in norm to constraining the feedforward weights to have an unchanging norm. Wefound that these methods improved upon methods that ignore gradient norms (Liao et al., 2015).Fixing the feedforward weight norm (Figure 4) led to the best results, suggesting that regulatingthe forward pass and normalizing the backward weights appropriately may be sufficient to achievehigh performance. We focused only on instantaneous normalization, but in the future, it wouldbe valuable to experiment with scaling the weights with a delay that reflects the time constant ofhomeostatic regulation. 7 . . . . . P r o p o r t i o n o f s i g n c h a n g e s Fixing E/I Connections freeze at 75kfreeze at 150kfreeze at 300k
Iterations E rr o r ( % ) Figure 5: Allowing weights to change sign has a dramatic effect on network performance. The toppanel examines the effect of enforcing fixed weight signs at various points in the initial stages oftraining on ImageNet test performance for models trained with FA-uSF and fixed weight norms.The bottom panel displays the proportion of weights that had changed signs since initializationfor each condition. We can see that, somewhat surprisingly, weights continue to flip sign at a highrate, and that allowing this mobility has a direct impact on network performance.Sign-concordant feedback was also crucial to performance, consistent with the idea that “vanilla”feedback alignment does not scale to deep networks (Bartunov et al., 2018; Xiao et al., 2018b). Anattractive possibility is that the segregation of biological neurons into excitatory and inhibitorysubtypes permits cell-type-specific wiring that promotes sign-concordant feedback. However, theadverse effects of enforcing fixed E/I connections (Figure 5) represent a significant obstacle tobiological realism. Our experiments show not only that unconstrained neurons continue to switchbetween excitatory and inhibitory modes throughout learning, but also that the freedom to do sois directly connected with improved task performance. Developing further understanding of thisphenomenon is important for understanding not only the brain, but also the learning process indeep networks.While our results and those of Xiao et al. (2018b) represent a step toward understanding theplausibility of such an algorithm, many questions remain, including how performance is affectedin networks without weight sharing (Bartunov et al., 2018) and how to incorporate continuouslearning without separate forward and backward passes (Guergiuev et al., 2016; Sacramento et al.,2018). Further experimental and computational studies are needed to develop clearer analogiesbetween biological and artificial neural network learning.8 eferences
P. Baldi, P. J. Sadowski, and Z. Lu. Learning in the machine: Random backpropagation and thelearning channel.
CoRR , 2016. URL http://arxiv.org/abs/1612.02734 .S. Bartunov, A. Santoro, B. A. Richards, G. E. Hinton, and T. Lillicrap. Assessing the Scalabilityof Biologically-Motivated Deep Learning Algorithms and Architectures.
ArXiv e-prints , 2018.URL http://arxiv.org/abs/1807.04587 .Y. Bengio. How auto-encoders could provide credit assignment in deep networks via target prop-agation.
CoRR , 2014. URL http://arxiv.org/abs/1407.7906 .R. Brandt and F. Lin. Can supervised learning be achieved without explicit error back-propagation?
Proceedings of International Conference on Neural Networks (ICNN) , 1996.F. Crick. The recent excitement about neural networks.
Nature , 337:129–132, 1989.X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural net-works. In Y. W. Teh and M. Titterington, editors,
Proceedings of the Thirteenth InternationalConference on Artificial Intelligence and Statistics , volume 9 of
Proceedings of Machine LearningResearch , pages 249–256. PMLR, 2010.S. Grossberg. Competitive learning: From interactive activation to adaptive resonance.
CognitiveScience , 11(1):23–63, 1987.J. Guergiuev, T. P. Lillicrap, and B. A. Richards. Towards deep learning with segregated dendrites.
ArXiv e-prints , 2016. URL http://arxiv.org/abs/1610.00161 .G. Hinton. The ups and downs of hebb synapses.
Canadian Psychology/Psychologie canadienne ,44(1):10–13, 2003.S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducinginternal covariate shift.
CoRR , 2015. URL http://arxiv.org/abs/1502.03167 .D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.
CoRR , 2014. URL http://arxiv.org/abs/1412.6980 .Y. Le Cun. Learning process in an asymmetric threshold network. In E. Bienenstock, F. F.Soulié, and G. Weisbuch, editors,
Disordered Systems and Biological Organization , pages 233–240, Berlin, Heidelberg, 1986. Springer Berlin Heidelberg.Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to documentrecognition. In
Proceedings of the IEEE , pages 2278–2324, 1998.D.-H. Lee, S. Zhang, A. Fischer, and Y. Bengio. Difference target propagation. In A. Appice,P. P. Rodrigues, V. Santos Costa, C. Soares, J. Gama, and A. Jorge, editors,
Machine Learningand Knowledge Discovery in Databases , pages 498–515, Cham, 2015. Springer InternationalPublishing.Q. Liao, J. Z. Leibo, and T. A. Poggio. How important is weight symmetry in backpropagation?
CoRR , 2015. URL http://arxiv.org/abs/1510.05067 .T. P. Lillicrap, D. Cownden, D. B. Tweed, and C. J. Akerman. Random synaptic feedback weightssupport error backpropagation for deep learning.
Nature Communications , 7, 2016.A. Nøkland. Direct feedback alignment provides learning in deep neural networks. In D. D. Lee,M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors,
Advances in Neural InformationProcessing Systems 29 , pages 1037–1045. Curran Associates, Inc., 2016.D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagatingerrors.
Nature , 323:533–536, 1986.J. Sacramento, R. Ponte Costa, Y. Bengio, and W. Senn. Dendritic error backpropagation in deepcortical microcircuits.
ArXiv e-prints , 2018. URL http://arxiv.org/abs/1801.00062 .J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Riedmiller. Striving for simplicity: Theall convolutional net.
CoRR , 2014. URL http://arxiv.org/abs/1412.6806 .N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simpleway to prevent neural networks from overfitting.
Journal of Machine Learning Research , 15:1929–1958, 2014. 9. G. Turrigiano and S. B. Nelson. Homeostatic plasticity in the developing nervous system.
NatureReviews Neuroscience , 5(2):97, 2004.L. Xiao, Y. Bahri, J. Sohl-Dickstein, S. S. Schoenholz, and J. Pennington. Dynamical Isometryand a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional NeuralNetworks.
ArXiv e-prints , 2018a. URL http://arxiv.org/abs/1806.05393 .W. Xiao, H. Chen, Q. Liao, and T. Poggio. Biologically-plausible learning algorithms can scale tolarge datasets.
ArXiv e-prints , 2018b. URL http://arxiv.org/abs/1811.03567 .D. Zipser and D. Rumelhart. The neurobiological significance of the new learning models.
Com-putational Neuroscience , pages 192–200, 1993.Research was supported by a Burroughs-Wellcome Award (A.L.-K.) and by NSF NeuroNex AwardDBI-1707398 and the Gatsby Charitable Foundation.10
Supplemental information
In the MNIST network, the initial learning rate was set to η = 0 . , and was reduced to × − after epochs. A stepped learning rate decay was also used in the CIFAR-10 models. In theLeNet-style model, the initial learning rate was set to η = 0 . , and decayed by a factor of . every epochs. In the all-convolutional model, the initial learning rate was set to η = 0 . , andmultiplied by a factor of . every epochs. A weight decay with strength λ = 0 . was alsoadded to all layers. The batch size was set to n = 50 for the MNIST network and n = 128 forthe CIFAR-10 models. Dropout (Srivastava et al., 2014) was applied with a drop probability of . after the densely-connected layer in the MNIST network (layer 5). In the all-convolutionalarchitecture, dropout was applied after the downsampling layers (3 and 6) with a drop probabilityof . , as well after the input layer with a drop probability of . . Layer MNIST CIFAR-10 Model 1 CIFAR-10 Model 2Input × × × × * × × × conv. ReLU × conv. ReLU × conv. ReLU2 × max-pool stride 2 × max-pool stride 2 × conv. ReLU3 × conv. ReLU × conv. ReLU × conv. ReLU stride × max-pool stride 2 × max-pool stride 2 × conv. ReLU5 1024 dense ReLU 384 dense ReLU × conv. ReLU6 10-way softmax 192 dense ReLU × conv. ReLU stride × conv. ReLU8 - - × conv. ReLU9 - - × conv. ReLU10 - - global average pooling ( × dim.)11 - - 10-way softmaxTable 2: Model architectures. *CIFAR-10 images were cropped as part of data augmentation toincrease the size of the training set.Layer ImageNet Model 1 ImageNet Model 2Input × × × × × conv. ReLU stride 2 × conv. ReLU stride 42 × max-pool stride 2 × conv. ReLU stride 23 × conv. ReLU stride 2 × conv. ReLU stride 34 × max-pool stride 2 × conv. ReLU stride 25 512 dense ReLU × conv. ReLU stride 16 512 dense ReLU × conv. ReLU stride 27 1000-way softmax × conv. ReLU stride 18 - × conv. ReLU stride 19 - × conv. ReLU stride 110 - global average pooling (4
In the MNIST network, the initial learning rate was set to η = 0 . , and was reduced to × − after epochs. A stepped learning rate decay was also used in the CIFAR-10 models. In theLeNet-style model, the initial learning rate was set to η = 0 . , and decayed by a factor of . every epochs. In the all-convolutional model, the initial learning rate was set to η = 0 . , andmultiplied by a factor of . every epochs. A weight decay with strength λ = 0 . was alsoadded to all layers. The batch size was set to n = 50 for the MNIST network and n = 128 forthe CIFAR-10 models. Dropout (Srivastava et al., 2014) was applied with a drop probability of . after the densely-connected layer in the MNIST network (layer 5). In the all-convolutionalarchitecture, dropout was applied after the downsampling layers (3 and 6) with a drop probabilityof . , as well after the input layer with a drop probability of . . Layer MNIST CIFAR-10 Model 1 CIFAR-10 Model 2Input × × × × * × × × conv. ReLU × conv. ReLU × conv. ReLU2 × max-pool stride 2 × max-pool stride 2 × conv. ReLU3 × conv. ReLU × conv. ReLU × conv. ReLU stride × max-pool stride 2 × max-pool stride 2 × conv. ReLU5 1024 dense ReLU 384 dense ReLU × conv. ReLU6 10-way softmax 192 dense ReLU × conv. ReLU stride × conv. ReLU8 - - × conv. ReLU9 - - × conv. ReLU10 - - global average pooling ( × dim.)11 - - 10-way softmaxTable 2: Model architectures. *CIFAR-10 images were cropped as part of data augmentation toincrease the size of the training set.Layer ImageNet Model 1 ImageNet Model 2Input × × × × × conv. ReLU stride 2 × conv. ReLU stride 42 × max-pool stride 2 × conv. ReLU stride 23 × conv. ReLU stride 2 × conv. ReLU stride 34 × max-pool stride 2 × conv. ReLU stride 25 512 dense ReLU × conv. ReLU stride 16 512 dense ReLU × conv. ReLU stride 27 1000-way softmax × conv. ReLU stride 18 - × conv. ReLU stride 19 - × conv. ReLU stride 110 - global average pooling (4 ×4