Biological credit assignment through dynamic inversion of feedforward networks
BBiological Credit Assignment throughDynamic Inversion of Feedforward Networks
William F. Podlaski ∗ Champalimaud ResearchChampalimaud Centre for the Unknown1400-038 Lisbon, Portugal
Christian K. Machens ∗ Champalimaud ResearchChampalimaud Centre for the Unknown1400-038 Lisbon, Portugal
Abstract
Learning depends on changes in synaptic connections deep inside the brain. Inmultilayer networks, these changes are triggered by error signals fed back from theoutput, generally through a stepwise inversion of the feedforward processing steps.The gold standard for this process — backpropagation — works well in artificialneural networks, but is biologically implausible. Several recent proposals haveemerged to address this problem, but many of these biologically-plausible schemesare based on learning an independent set of feedback connections. This complicatesthe assignment of errors to each synapse by making it dependent upon a secondlearning problem, and by fitting inversions rather than guaranteeing them. Here, weshow that feedforward network transformations can be effectively inverted throughdynamics. We derive this dynamic inversion from the perspective of feedbackcontrol, where the forward transformation is reused and dynamically interactswith fixed or random feedback to propagate error signals during the backwardpass. Importantly, this scheme does not rely upon a second learning problem forfeedback because accurate inversion is guaranteed through the network dynamics.We map these dynamics onto generic feedforward networks, and show that theresulting algorithm performs well on several supervised and unsupervised datasets.We also link this dynamic inversion to Gauss-Newton optimization, suggesting abiologically-plausible approximation to second-order learning. Overall, our workintroduces an alternative perspective on credit assignment in the brain, and proposesa special role for temporal dynamics and feedback control during learning.
Synaptic credit assignment refers to the difficult task of relating a motor or behavioral output ofthe brain to the many neurons and synapses that produced it (Roelfsema and Holtmaat, 2018) — aproblem which must be solved in order for effective learning to occur. While credit is assigned inartificial neural networks through the backpropagation of error gradients (Rumelhart et al., 1986;LeCun et al., 2015), a direct mapping of this algorithm to biology leads to several characteristicsthat are either in conflict with what is currently known about neural circuits, or that violate harderphysical constraints, such as the local nature of synaptic plasticity (Grossberg, 1987; Crick, 1989).Many biologically-plausible modifications to backpropagation have been proposed over the years(Whittington and Bogacz, 2019), with several recent studies focusing on one issue in particular,the fact that error is fed back using an exact copy of the forward weights (the “weight transport”or “weight symmetry” problem, Lillicrap et al. (2020)). Recently, it was discovered that randomfeedback weights are sufficient to train deep networks on modest supervised learning problems(Lillicrap et al., 2016). However, this method appears to have shortcomings in scaled-up tasks, as ∗ Correspondence: {william.podlaski, christian.machens}@research.fchampalimaud.org
Preprint. Under review. a r X i v : . [ q - b i o . N C ] J u l ell as in convolutional and bottleneck architectures (Bartunov et al., 2018; Moskovitz et al., 2018).Several studies have therefore aimed to identify the necessary precision of feedback (Nøkland, 2016;Xiao et al., 2018), and others have proposed to learn separate feedback connections (Bengio, 2014;Lee et al., 2015; Akrout et al., 2019; Lansdell et al., 2019). While it is indeed plausible that feedbackweights are updated alongside forward ones, these schemes complicate credit assignment by makingerror backpropagation dependent upon an additional learning problem (with uncertain accuracy), andby potentially introducing more learning phases.One important characteristic of biological neural circuits is their dynamic nature, which has beenharnessed in many previous learning models (Hinton et al., 1995; O’Reilly, 1996; Rao and Ballard,1999). Here, we take inspiration from this dynamical perspective, and propose a model of errorbackpropagation as a feedback control problem — during the backward pass, feedback connectionsare used in concert with forward connections to dynamically invert the forward transformation,thereby enabling errors to flow backward. Importantly, this inversion works with arbitrary fixedfeedback weights, and avoids introducing a second learning problem for the feedback. In thefollowing, we derive this dynamic inversion, map it onto deep feedforward networks, and demonstrateits performance on several supervised tasks, as well as an autoencoder task. We also make a linkbetween the algorithm and second-order learning. Finally, we discuss the biological implications ofthis perspective, and the relation to previous dynamic algorithms for credit assignment. We consider nonlinear feedforward networks with L layers. The forward pass (forward transformation;Fig. 1a) from one layer to the next is h l = g ( a l ) = g ( W l h l − ) , (1)where h l ∈ R d l is the activity of layer l , g ( · ) is an arbitrary element-wise nonlinearity, a l ∈ R d l isthe “pre-activation” activity of layer l , and W l ∈ R d l × d l − denotes the weight matrix from layers l − to l (including bias). The input data, network output, and supervised target are denoted x = h , y = h L , and t , respectively. The error is denoted e = y − t , and the loss function is L ( x , t ) . For such networks, each layer’s weights are commonly optimized using gradient descent: ∆ W l ∝ − ∂ L ∂ W l = − ∂ L ∂ a l ∂ a l ∂ W l = − δ l h Tl − , (2)with the backpropagated error δ l = ∂ L /∂ a l ∈ R d l . We write δ l in a generalized recursive form δ l − = ∂ a l ∂ a l − δ l = M l δ l ◦ g (cid:48) ( a l − ) = D l − M l δ l , (3)where ◦ is the Hadamard (element-wise) product, D l = diag ( g (cid:48) ( a l )) , and M l = W Tl ∈ R d l − × d l isthe feedback weight matrix (the source of the weight transport problem).As mentioned above, learning can sometimes be achieved with a fixed random feedback matrix,a strategy termed feedback alignment (FA), in part due to the observed alignment between theforward weights and the pseudoinverse of the feedback weights during training (Lillicrap et al., 2016).The authors of this study also describe a biologically-implausible idealization of this algorithm, pseudobackprop (PBP), which propagates errors through the pseudoinverse of the current feedforwardweights. These results, as well as other studies proposing to learn feedback as an inverted forwardtransformation (e.g., target prop, Bengio (2014); Lee et al. (2015)), motivate the perspective that thegoal of credit assignment is to invert the feedforward transformation of the network.We summarize these variants of backpropagation as different choices for M l in Eq. (3): M l = W Tl for backpropagation (BP) B l for feedback alignment (FA) W + l for pseudobackprop (PBP) (4)where B l is a fixed random matrix and W + l is the Moore-Penrose pseudoinverse of W l .2 b forward pass backward pass controllerlayer layer loss error(target ) loss error(target ) c controllerstate target Figure 1: Schematic of forward and backward passes. a : Standard forward pass from Eq. (1). b :Backpropagation formulated as an integral control problem — error between the forward transforma-tion ( ˜ δ l ) and the target error value ( δ l ) is integrated and fed back to produce a new target error δ l − . c : Dynamic inversion during the backward pass implements this control problem. We now introduce a simple recurrent architecture (Fig. 1b,c) which dynamically and implicitlyperforms inversions similar to those explicitly performed by pseudobackprop and target prop asoutlined above. Considering the backward pass of a linear feedforward network (Eq. (1), with g ( x ) = x ), the error from the l -th layer, δ l , should be transformed into an error for the ( l − -thlayer, δ l − . From a linear feedback control perspective, we can let the l -th layer feed a control signal, u ( t ) ∈ R d l , into the ( l − -th layer, such that the state of this layer, δ l − , when propagated throughthe feedforward transformation of the network, reproduces, as close as possible, the target errorvector, δ l , of layer l . We define this as a linear control problem of the following form: ˙ δ l − ( t ) = − δ l − ( t ) + B l u l ( t ) (5) ˜ δ l ( t ) = W l δ l − ( t ) , (6)where δ l − ( t ) ∈ R d l − is the system state of layer l − , ˜ δ l ( t ) ∈ R d l is the readout or forwardtransformation of this system, u l ( t ) ∈ R d l is the control signal fed back from layer l , and B l ∈ R d l − × d l is a matrix of arbitrary feedback weights. We define a fixed, target error value for thereadout, δ l , and a separate controller error, e l ( t ) = ˜ δ l ( t ) − δ l . A standard approach in designing a controller is to use a proportional-integral-derivative (PID)formulation (Åström and Murray, 2010) that acts on the controller error e l ( t ) , with dynamics ˙ u l ( t ) = K p ˙ e l ( t ) + K i e l ( t ) + K d ¨ e l ( t ) + K u u l ( t ) , (7)where K p , K i , and K d are coefficient matrices for the proportional, integral, and derivative compo-nents, respectively, along with an additional leak component with coefficients K u . For mathematicalsimplicity and biological plausibility, we only consider the integral and leak components (see Discus-sion for interpretation of other terms), setting their coefficients to K i = I l , and K u = − α I l . Thesecomponents have a direct interpretation in rate networks (Dayan and Abbott, 2001), and have beenused in other neuroscience and biological contexts (Miller and Wang, 2006; Somvanshi et al., 2015).We thus obtain the following leaky integral-only controller ˙ u l ( t ) = − α u l ( t ) + e l = − α u l ( t ) + W l δ l − ( t ) − δ l , (8)which acts on Eq. (5). For a fixed target δ l , this controller has the steady-state equality W l δ l − = δ l + α u l , (9)which suggests that the steady state of δ l − approximates the target δ l through the forward transfor-mation (for small α ). For α > , δ l − can be written as (using Eq. (5) in the steady-state), δ l − = B l ( W l B l − α I ) − δ l = ( B l W l − α I ) − B l δ l . (10)When α = 0 , only one of these equalities will hold, depending on the dimensionalities d l and d l − .For expository purposes, we also write the solution as a function of the control signal u l : δ l − = M l ( δ l + α u l ) , (11)3here M l = B l ( W l B l ) − for d l < d l − W + l for d l > d l − W − l for d l = d l − , (12)and W + l is the Moore-Penrose pseudoinverse of the forward matrix W l . We thus see that this systemdynamically inverts the forward transformation of the network (for small α ), implicitly solvingthe linear system W l δ l − = δ l . For d l ≥ d l − (expanding layer), W l has a well-defined leftpseudoinverse (or inverse, for d l = d l − ), and so the inversion follows directly from Eq. (9). Incontrast, for d l < d l − (contracting layer), the system may have infinite solutions. The dynamicsinstead solves the fully-determined system ( W l B l − α I ) u l = δ l , which is then projected through B l to obtain δ l − (i.e., one solution to the desired linear system, constrained by B l ). The dynamic inversion will only be useful if it is stable and fast. Integral-only control may exhibitsubstantial transient oscillations, which can be mitigated if the system dynamics are fast compared tothe controller. Assuming this separation of timescales, we can study the controller dynamics from Eq.(8) when the system is at its steady state ( δ l − = B l u l from Eq. (5)): ˙ u l ( t ) = ( W l B l − α I ) u l ( t ) − δ l . (13)Linear stability thus depends on the eigenvalues of ( W l B l − α I ) . Generally, the stability of interactingneural populations (and the eigenvalues of arbitrary matrix products), is an open question and wedo not aim to solve it here. We instead propose that clever initialization of B l will provide stabilitythroughout training (in addition to a non-zero leak, α ). One easy way to ensure this is to initialize B l = − W Tl (0) , which makes the matrix product negative semi-definite (zero index indicates thestart of training). From Eq. (12), this also means that for d l < d l − , dynamic inversion will use theMoore-Penrose pseudoinverse at the start of training. Note that this initialization does not suggest acorrespondence between the forward and backward weights throughout training, as they may becomeunaligned when the forward weights are updated. In the case where d l > d l − , the matrix product issingular and requires α > (but see Supplementary Materials for an alternative architecture). Returning to the general case of nonlinear feedforward networks, the controller in this case is ˙ u l ( t ) = − α u l ( t ) + W l g ( δ l − ( t )) − δ l , (14)where g ( · ) is an arbitrary nonlinearity such as tanh or ReLU (we keep the backward pass linear forsimplicity). Note that we have opted to derive the network from the perspective of the pre-activationvariables for consistency with backpropagation. Now, the steady-state for the system is W l g ( δ l − ) = δ l + α u l . (15)Deriving explicit relationships between δ l − and δ l , is trickier, especially with common transferfunctions like tanh and ReLU, which do not have well-defined inverses (at least numerically). Togain an intuition, we use somewhat sloppy notation and write an implicit, non-unique inverse g − ( · ) ,for which g − ( g ( δ l − )) ≈ δ l − and g ( g − ( δ l )) ≈ δ l . We can then write δ l − recursively as δ l − = g − ( M l δ l + α u l ) , (16)with M l from Eq. (12), suggesting that the network is able to approximately invert nonlineartransformations as well. We note that stability is no longer guaranteed, but in practice we find thatlinear stability still provides a decent indication of stability in the general case. Backpropagation in feedforward networks is a recursive, layer-wise process. However, when chainingtogether multiple dynamic inversions, each hidden layer must simultaneously serve as the recipientof control from the layer above, as well as the controller for the layer below. We propose threearchitectures which solve this problem in different ways, illustrated in Fig. 2.4 b / / c forward pass backward pass Figure 2: Schematic of forward (left) and backward (right) passes for chained dynamic inversion. a : Sequential backward pass (right), in which the error is inverted through one layer at a time, witheach layer first receiving control signals from the layer above, and then acting as the controller forthe layer below. B : Repeat layer backward pass , enabling each layer to both give and receive control,so that the full backward pass converges at once. C : Single loop backward pass features feedbackfrom the output layer to the first hidden layer, which effectively inverts each hidden layer.
The most direct way of mapping multiple dynamic inversions onto a feedforward network is toprescribe that each inversion happens sequentially — from the output to the first hidden layer — withonly one pair of layers dynamically interacting at a time ( sequential backward pass , Fig. 2a). Such ascheme begins by feeding the output error, δ L , into the output layer, which provides control to thelast hidden layer until convergence to the target δ L − . Next, this target is held fixed and is re-passedas input back into layer L − , which now acts as a controller for layer L − , to obtain the target δ L − . This is repeated until the first hidden layer converges to its target, δ . This scheme requires abackward pass with multiple steps for networks with more than one hidden layer ( L − steps).The fact that each hidden layer functions as both a recipient of control, and a controller itself,motivates the second architecture, in which the hidden layers have two separate populations, eachserving one of these roles ( repeat layer backward pass , Fig. 2b). For the forward pass to remainunchanged, these populations ( h Al and h Bl ) have an identity transformation between them. Duringthe backward pass, each controller receives the target value δ l as it settles, speeding up convergence.The steady state errors will be the same as in the sequential case, but with a single backward pass.An alternative approach in chaining multiple dynamic inversion control problems together is to turnthem into a single problem ( single loop backward pass , Fig. 2c). In this scheme, the output layeracts as the controller for the activity of the first hidden layer. This architecture thus uses a singledynamic inversion, with the forward transformation encompassing nearly the entire forward pass.Interestingly, for a linear network, such controls leads each subsequent hidden layer to converge tothe same error value as the previous two architectures. However, in general, the forward pass may behighly nonlinear, requiring a similarly complex control signal for successful inversion, and so thisarchitecture could converge to a different solution compared to the other two architectures. We define the backpropagated error signal δ l for dynamic inversion (DI) as the steady state of thefeedback control dynamics from Eq. (16), and weight updates as in Eq. (2). Biases are not includedin the dynamics of the backward pass, but are updated with the converged error signals similar tostandard backprop. Unlike backprop, however, the derivative of the nonlinear transfer function doesnot need to be applied, as it is implicitly incorporated in the inversion. As a point of comparison, wealso implement a non-dynamic inversion (NDI), in which the explicit linear inversion from Eq. (10)is combined with nonlinearity gradients following Eqs. (2) and (3).5 cb edf -14 -11 -8 -5 -2 E rr o r ( N S E ) No. Examples
No. Examples -4 -3 -2 -1 E rr o r ( N S E ) - g hjik E rr o r ( N S E ) No. Examples -2 -10 -4 -
045 0 200 400
No. Examples -6 -2 -4 NDIDI
Tr-Initα=0 Tr-Initα=10 -3 R-Initα=10 -2 BP FA PBPSDI linear30-20-10 nonlinear30-20-10-10
Figure 3: Results for linear (30-20-10) and nonlinear (30-20-10-10) regression tasks for BP, FA, PBP,and three realizations of DI, NDI, and SDI (only for nonlinear) with different weight initializationand leak values. Legend below panel g . Learning rate = − for all algorithms. a : Normalized meansquared error (NMSE) for linear regression. b : NMSE for linear regression with fixed-norm weightupdates. c : Stability of DI as measured by the maximum real eigenvalues of the system dynamics. d :Angle between the backpropagated error vectors to the hidden layer ( δ ) for NDI vs DI. e : Anglebetween feedback weights and negative transpose of forward weights (shown for DI and NDI). f :Angle between backpropagated error vectors for DI and PBP. g - j : Same as a , c , d , e but for nonlinearregression. f : Angle between the backpropagated error vectors to the hidden layer for NDI vs SDI.Dashed lines refer to output layer, and solid linear refer to middle layer in h - k . The inversion of the forward weights in DI suggests a resemblance to second-order learning, whichwe illustrate here with a specific example. We consider the update rules for the penultimate weightmatrix, W L − , of a network performing regression with squared-error loss. For simplicity in thecomparison, we also assume that the input to each layer is whitened. Following the formulation ofGauss-Newton (GN) optimization for deep networks in Botev et al. (2017) (Supplementary Materials),we can derive the block-diagonal GN update for W L − as ∆ W L − ∝ − ( W L D L − ) + eh TL − , (17)where D L − = diag ( g (cid:48) ( a L − )) . As can be seen, such a scheme features an inversion of the forwardpass similar to Eq. (16), but with an explicit first-order approximation of the the nonlinearity. Dynamicinversion instead handles nonlinearities implicitly in the dynamics. However, these should be similarfor small error values, suggesting that dynamic inversion approximates Gauss-Newton optimization. We tested dynamic inversion (DI) and non-dynamic inversion (NDI) against backpropagation (BP),feedback alignment (FA), and pseudobackprop (PBP) on four modest supervised and unsupervisedlearning tasks — linear regression, nonlinear regression, MNIST classification, and MNIST autoen-coding. We varied the leak values ( α ) and feedback initializations (“Tr-Init”, B l = − W Tl ; “R-Init”,random stable B l ) for DI and NDI. To impose stability for random initialization, we optimized thefeedback matrix B l using smoothed spectral abscissa (SSA) optimization (Vanbiervliet et al., 2009;Hennequin et al., 2014) at the start of training, and whenever it became unstable (SupplementaryMaterials). To ensure accurate convergence, DI was simulated numerically using Euler stepswith dt = 0 . . 6 No. Examples (x10 ) T e s t E rr o r ( % ) Tr-Init, α=0R-Init, α=10 -2 NDI T r a i n i n g E rr o r No. Examples (x10 ) b BPFAPBP NDI
Tr-Init, α=10 -3 Tr-Init, α=10 -2 NDI cd inputoutput BP, orthogonal init. NDI, Gaussian init. Gaussian init.orthogonal init.
Figure 4: Results for MNIST classification (784-1000-10 architecture) and autoencoding (784-500-250-30 architecture) trained with BP, FA, PBP, and two realizations of NDI. a : Test error of MNISTclassification over 15 epochs of online training (learning rate= − for all algorithms). b : Trainingerror of MNIST autoencoding over 25 epochs of mini-batch training (learning rate = − for allalgorithms). c , d : Examples of input and output digits and a 2D t-SNE representation of the 30-dim.latent space for BP with orthogonal weight initialization and NDI for random weight initialization. Following Lillicrap et al. (2016), we tested the algorithms on a simple linear regression task with atwo-layer network (dim. 30-20-10). Training was done with a fixed learning rate (Fig. 3a), or withfixed norm weight updates (Fig. 3b) in order to probe the update directions that each algorithm finds.All algorithms were able to solve this simple task with ease, with DI, NDI, and PBP convergingfaster than BP and FA in the fixed norm case, providing further evidence for the relation to second-order learning. With the exception of substantial variability in the first ~100 iterations, the dynamicinversion remained stable throughout training for all examples shown, with updates well-alignedto the non-dynamic version (Fig. 3c,d). Furthermore, the alignment between the feedback and thenegative transpose of the forward weights settled to around 45 degrees for all DI and NDI models,which also produced alignment with the PBP updates (Fig. 3e,f).Next, we tested performance on a small nonlinear regression task with a three-layer network (dim.30-20-10-10, tanh nonlinearities) also following Lillicrap et al. (2016). All inversion algorithmsfinished with better performance compared to BP and FA (Fig. 3g), but often with slower convergence,which was unexpected considering the approximation to second-order optimization. We speculate thatthis may be due to inaccuracies in approximating the loss curvature (or overly small, “conservative”updates), which is common for many such algorithms (Martens, 2014). DI dynamics remainedstable throughout training, and closely followed NDI updates (except for α = 0 , likely due to slowconvergence; Fig. 3h,i). Alignment of B varied with the layer and algorithm (Fig. 3j) — some layerssettled to ~45 degrees, but others remained close to zero — this is intriguing, but may be due to thesimplicity of the problem. Finally, we also simulated single-loop dynamic inversion (SDI), whichbehaved very differently to DI, suggesting that it converges to different steady states (Fig. 3k). We next tested dynamic inversion on the MNIST handwritten digit dataset, where we use the standardtraining and test datasets (LeCun et al., 1998), with a two-layer architecture (dim. 784-1000-10 as inLillicrap et al. (2016)). Due to the close alignment of the updates for DI and NDI (with the exceptionof α = 0 ), and the computational complexity of simulating DI for large-scale problems, we chose toonly simulate NDI and use it as a proxy for DI performance. All algorithms performed similarly (Fig.4a), with PBP and NDI having a slightly worse test error (BP, 1.9%; FA, 1.7%; PBP, all NDI, 2.7%),possibly due to slower convergence again.Finally, we trained an autoencoder network on the MNIST dataset (dim. 784-500-250-30-250-500-784; nonlinearities tanh-tanh-linear-tanh-tanh-linear, a reduced version of Hinton and Salakhutdinov(2006)) with mini-batch training (100 examples per batch). Notably, first-order optimization al-gorithms have trouble dealing with the “pathological curvature” of such problems and often havevery slow learning (Martens, 2010) (especially FA, Lansdell et al. (2019)). We trained BP, FA, PBPand NDI with two weight initializations — random Gaussian, and random orthogonal, which hasbeen shown to speed up learning (Saxe et al., 2013). We found that BP only learns successfully7ith orthogonal weight initialization, whereas PBP and NDI perform decently in either case, furthersuggesting they use second-order information. Notably, PBP and NDI performance is slower withorthogonal initialization, where second-order information is not useful (but this might be mitigatedby having non-orthogonal feedback weights). Furthermore, FA performed poorly in both cases,providing evidence that dynamic inversion is superior to random feedback. BP with orthogonalinitialization and NDI with random initialization result in similar performance (Fig. 4c,d). Several previous studies have proposed biologically-plausible solutions to credit assignment involvingdynamics of some kind, but most contain conceptual differences with our framework — e.g., learningusing differences in activity over time or phase (contrastive Hebbian learning, O’Reilly (1996);Scellier and Bengio (2017)), using explicit error-encoding neurons (Whittington and Bogacz, 2017),or incorporating dendritic compartments (Guerguiev et al., 2017; Sacramento et al., 2018; Payeuret al., 2020). Furthermore, the aim of most of these models is to approximate error gradients , whereasdynamic inversion is unique in its link to second-order optimization. Most relevant to our work is arecent paper which also proposes to re-use forward weights in order to propagate errors through afeedback loop (Kohan et al., 2018). However, the authors do not formulate this as an inversion ofthe forward pass, and they use a contrastive learning scheme that differs substantially from what wedo here. Finally, target propagation also aims to learn (non-dynamic) inversions (Bengio, 2014; Leeet al., 2015), making it conceptually similar — in principle dynamic inversion can also be used topropagate targets, which may afford some biological plausibility or other benefits.
Unlike many other biologically-plausible algorithms for credit assignment, dynamic inversion does notrequire precise feedback weights. This is a crucial distinction, as it not only relaxes the assumptionson feedback wiring, but also allows for feedback to be used concurrently for other roles, such asattention and prediction (Gilbert and Li, 2013). The architectures we proposed for chained dynamicinversion (Fig. 2) suggest different ways of using feedback for learning, and even leaves the possibilityfor direct feedback to much lower areas which is known to exist in the brain (Felleman and Van Essen,1991). Additionally, our work depends upon the stability and control of recurrent dynamics betweeninteracting populations (or brain areas), which has received recent interest in neuroscience (Joglekaret al., 2018). Stability and fast convergence of dynamic inversion requires slow control (Eq. (13)),suggesting that higher-order areas should be slower than the lower areas they control. Indeed, activityis known to slow down as it moves up the processing hierarchy (Murray et al., 2014).
We see two main limitations of dynamic inversion as a model for credit assignment in the brain. First,its dynamic nature means that the backward pass requires time to converge. Fast convergence wouldbe easier to achieve with additional proportional and derivative control, as in Eq. (7) (Åström andMurray, 2010). Spike-based representations could help here since they effectively add a derivativecomponent to the signal (Eliasmith and Anderson, 2004; Boerlin et al., 2013; Abbott et al., 2016).Second, due to the relationship of our scheme with second-order learning and natural gradientmethods, dynamic inversion could share some of the same problems (Martens, 2014; Kunstner et al.,2019). While our method avoids the cost of explicitly calculating and inverting a Hessian or Gauss-Newton matrix, in common with standard second-order methods (Pearlmutter, 1994; Schraudolph,2002; Martens, 2010), its stability will be enhanced if the recurrent dynamics can be designed tobetter condition the inverse computations.The true test of dynamic inversion will be whether or not it can be successfully scaled up to largertasks (Bartunov et al., 2018; Xiao et al., 2018). Even so, it may be useful in other contexts where it isnecessary to invert a computation, such as motor control (Wolpert and Kawato, 1998) and sensoryperception (Pizlo, 2001). As an example, it was pointed out in a recent paper (Vértes and Sahani,2019) that the successor representation — used in reinforcement learning and requiring an inverse tocalculate explicitly — can be achieved dynamically in a similar way to what we propose here.8 cknowledgments and Disclosure of Funding
This work was supported by the Simons Collaboration on the Global Brain (543009). We thankmembers of the Machens lab for helpful comments and feedback.
References
Abbott, L. F., DePasquale, B., and Memmesheimer, R.-M. (2016). Building functional networks of spikingmodel neurons.
Nature neuroscience , 19(3):350.Akrout, M., Wilson, C., Humphreys, P., Lillicrap, T., and Tweed, D. B. (2019). Deep learning without weighttransport. In
Advances in Neural Information Processing Systems , pages 974–982.Åström, K. J. and Murray, R. M. (2010).
Feedback systems: an introduction for scientists and engineers .Princeton university press.Bartunov, S., Santoro, A., Richards, B., Marris, L., Hinton, G. E., and Lillicrap, T. (2018). Assessing thescalability of biologically-motivated deep learning algorithms and architectures. In
Advances in NeuralInformation Processing Systems , pages 9368–9378.Bengio, Y. (2014). How auto-encoders could provide credit assignment in deep networks via target propagation. arXiv preprint arXiv:1407.7906 .Boerlin, M., Machens, C. K., and Denève, S. (2013). Predictive coding of dynamical variables in balancedspiking networks.
PLoS computational biology , 9(11).Botev, A., Ritter, H., and Barber, D. (2017). Practical gauss-newton optimisation for deep learning. In
Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages 557–565. JMLR.org.Crick, F. (1989). The recent excitement about neural networks.
Nature , 337(6203):129–132.Dayan, P. and Abbott, L. F. (2001).
Theoretical neuroscience: computational and mathematical modeling ofneural systems . MIT press.Desjardins, G., Simonyan, K., Pascanu, R., et al. (2015). Natural neural networks. In
Advances in NeuralInformation Processing Systems , pages 2071–2079.Eliasmith, C. and Anderson, C. H. (2004).
Neural engineering: Computation, representation, and dynamics inneurobiological systems . MIT press.Felleman, D. J. and Van Essen, D. (1991). Distributed hierarchical processing in the primate cerebral cortex.
Cerebral cortex (New York, NY: 1991) , 1(1):1–47.Gilbert, C. D. and Li, W. (2013). Top-down influences on visual processing.
Nature Reviews Neuroscience ,14(5):350–363.Grossberg, S. (1987). Competitive learning: From interactive activation to adaptive resonance.
Cognitive science ,11(1):23–63.Guerguiev, J., Lillicrap, T. P., and Richards, B. A. (2017). Towards deep learning with segregated dendrites.
Elife , 6:e22901.Hennequin, G., Vogels, T. P., and Gerstner, W. (2014). Optimal control of transient dynamics in balancednetworks supports generation of complex movements.
Neuron , 82(6):1394–1406.Hinton, G. E., Dayan, P., Frey, B. J., and Neal, R. M. (1995). The" wake-sleep" algorithm for unsupervisedneural networks.
Science , 268(5214):1158–1161.Hinton, G. E. and Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. science , 313(5786):504–507.Joglekar, M. R., Mejias, J. F., Yang, G. R., and Wang, X.-J. (2018). Inter-areal balanced amplification enhancessignal propagation in a large-scale circuit model of the primate cortex.
Neuron , 98(1):222–234.Kohan, A. A., Rietman, E. A., and Siegelmann, H. T. (2018). Error forward-propagation: Reusing feedforwardconnections to propagate errors in deep learning. arXiv preprint arXiv:1808.03357 . unstner, F., Balles, L., and Hennig, P. (2019). Limitations of the empirical fisher approximation. arXiv preprintarXiv:1905.12558 .Lansdell, B. J., Prakash, P., and Kording, K. P. (2019). Learning to solve the credit assignment problem. arXivpreprint arXiv:1906.00889 .LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature , 521(7553):436–444.LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to documentrecognition. Proceedings of the IEEE , 86(11):2278–2324.Lee, D.-H., Zhang, S., Fischer, A., and Bengio, Y. (2015). Difference target propagation. In
Joint europeanconference on machine learning and knowledge discovery in databases , pages 498–515. Springer.Lillicrap, T. P., Cownden, D., Tweed, D. B., and Akerman, C. J. (2016). Random synaptic feedback weightssupport error backpropagation for deep learning.
Nature communications , 7(1):1–10.Lillicrap, T. P., Santoro, A., Marris, L., Akerman, C. J., and Hinton, G. (2020). Backpropagation and the brain.
Nature Reviews Neuroscience , pages 1–12.Martens, J. (2010). Deep learning via hessian-free optimization. In
ICML , volume 27, pages 735–742.Martens, J. (2014). New insights and perspectives on the natural gradient method. arXiv preprintarXiv:1412.1193 .Miller, P. and Wang, X.-J. (2006). Inhibitory control by an integral feedback signal in prefrontal cortex: amodel of discrimination between sequential stimuli.
Proceedings of the National Academy of Sciences ,103(1):201–206.Moskovitz, T. H., Litwin-Kumar, A., and Abbott, L. (2018). Feedback alignment in deep convolutional networks. arXiv preprint arXiv:1812.06488 .Murray, J. D., Bernacchia, A., Freedman, D. J., Romo, R., Wallis, J. D., Cai, X., Padoa-Schioppa, C., Pasternak,T., Seo, H., Lee, D., et al. (2014). A hierarchy of intrinsic timescales across primate cortex.
Natureneuroscience , 17(12):1661.Nøkland, A. (2016). Direct feedback alignment provides learning in deep neural networks. In
Advances inneural information processing systems , pages 1037–1045.O’Reilly, R. C. (1996). Biologically plausible error-driven learning using local activation differences: Thegeneralized recirculation algorithm.
Neural computation , 8(5):895–938.Payeur, A., Guerguiev, J., Zenke, F., Richards, B., and Naud, R. (2020). Burst-dependent synaptic plasticity cancoordinate learning in hierarchical circuits. bioRxiv .Pearlmutter, B. A. (1994). Fast exact multiplication by the hessian.
Neural computation , 6(1):147–160.Pizlo, Z. (2001). Perception viewed as an inverse problem.
Vision research , 41(24):3145–3161.Rao, R. P. and Ballard, D. H. (1999). Predictive coding in the visual cortex: a functional interpretation of someextra-classical receptive-field effects.
Nature neuroscience , 2(1):79–87.Roelfsema, P. R. and Holtmaat, A. (2018). Control of synaptic plasticity in deep cortical networks.
NatureReviews Neuroscience , 19(3):166.Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by back-propagatingerrors. nature , 323(6088):533–536.Sacramento, J., Costa, R. P., Bengio, Y., and Senn, W. (2018). Dendritic cortical microcircuits approximate thebackpropagation algorithm. In
Advances in Neural Information Processing Systems , pages 8721–8732.Saxe, A. M., McClelland, J. L., and Ganguli, S. (2013). Exact solutions to the nonlinear dynamics of learning indeep linear neural networks. arXiv preprint arXiv:1312.6120 .Scellier, B. and Bengio, Y. (2017). Equilibrium propagation: Bridging the gap between energy-based modelsand backpropagation.
Frontiers in computational neuroscience , 11:24.Schraudolph, N. N. (2002). Fast curvature matrix-vector products for second-order gradient descent.
Neuralcomputation , 14(7):1723–1738. omvanshi, P. R., Patel, A. K., Bhartiya, S., and Venkatesh, K. (2015). Implementation of integral feedbackcontrol in biological systems. Wiley Interdisciplinary Reviews: Systems Biology and Medicine , 7(5):301–316.Vanbiervliet, J., Vandereycken, B., Michiels, W., Vandewalle, S., and Diehl, M. (2009). The smoothed spectralabscissa for robust stability optimization.
SIAM Journal on Optimization , 20(1):156–171.Vértes, E. and Sahani, M. (2019). A neurally plausible model learns successor representations in partiallyobservable environments. In
Advances in Neural Information Processing Systems , pages 13692–13702.Whittington, J. C. and Bogacz, R. (2017). An approximation of the error backpropagation algorithm in apredictive coding network with local hebbian synaptic plasticity.
Neural computation , 29(5):1229–1262.Whittington, J. C. and Bogacz, R. (2019). Theories of error back-propagation in the brain.
Trends in cognitivesciences .Wolpert, D. M. and Kawato, M. (1998). Multiple paired forward and inverse models for motor control.
Neuralnetworks , 11(7-8):1317–1329.Xiao, W., Chen, H., Liao, Q., and Poggio, T. (2018). Biologically-plausible learning algorithms can scale tolarge datasets. arXiv preprint arXiv:1811.03567 . Supplementary materials
As noted in Section 3.2, the eigenvalues of the matrix ( W l B l − α I ) provide a good measure of the linearstability of dynamic inversion (true linear stability is measured by the eigenvalues of the block matrix in Eq. (S9)below). This precludes stability for α = 0 and d l > d l − (expanding layer), as the matrix product W l B l willbe singular. To address this, we propose an alternative control architecture: ˙ δ l − = − α δ l − + B l δ l − B l ˜ δ l = − α δ l − + B l e l (S1) ˙˜ δ l = − ˜ δ l + W l δ l − , (S2)where now the target error for layer l − , δ l − , integrates the error between δ l and ˜ δ l directly, through thefeedback matrix B l . This can be interpreted as proportional feedback control with a fast controller. In thissystem, stability instead depends on the matrix B l W l − α I (assuming the readout dynamics are fast, and so ˜ δ l = W l δ l − ). Note that this scheme either requires identical feedback weights for the target error δ l andthe current estimate ˜ δ l , or a separate population which calculates the error between these, propagated back as B l (˜ δ l − δ l ) = B l e l . Following the derivations in Botev et al. (2017), we can write the block-diagonal sample Gauss-Newton (GN)matrix for a particular layer l as G l = Q l ⊗ G l , (S3)where Q l = h l − h Tl − is the sample input covariance to layer l and G l is the “pre-activation” GN matrix, definedrecursively as G l = D l W Tl +1 G l +1 W l +1 D l = D l W Tl +1 C l +1 C Tl +1 W l +1 D l , (S4)with D l = diag ( g (cid:48) ( a l )) , and C l is a square-root representation of G l . The GN update to the weight matrixof layer l can be written in vectorized form as ∆ vec ( W l ) ∝ − G − l g , where g is a vectorized version of thestandard backprop gradient, as in Eq. (2). In order to avoid vectorization (and thus simplify the comparison withdynamic inversion), we make the assumption that the input to this layer is whitened, making Q l = I l . Thisallows us to write the GN update in non-vectorized form: ∆ W l ∝ −G − l δ l h Tl − = ( D l W Tl +1 C l +1 C Tl +1 W l +1 D l ) − D l W l +1 δ l +1 h Tl − (S5) ≈ ( C Tl +1 W l +1 D l ) + C + l +1 δ l +1 h Tl − . (S6)Note that the approximate equality is due to the assumption that both ( C Tl +1 W l +1 D l ) has a left pseudoinverse,and C l +1 has a right pseudoinverse, which depends on the relative dimensionality of layers l and l + 1 .Considering the simplest case, optimizing the penultimate set of weights W L − for a network solving regressionwith squared error loss, we have G L = I (and thus C L = I ), and δ L = e , and so the update becomes ∆ W L − ∝ − ( W L D L − ) + eh TL − , (S7)which is the same as Eq. (17). As mentioned in the main text, this bears a resemblance to dynamic inversion,which in this case would use the update ∆ W L − ∝ − g − ( W + L e ) h TL − . (S8)A more thorough analysis is merited on the relationship between Eqs. (S7) and (S8) (as well as the types ofnonlinear inversions found in (S8) and a more general comparison of dynamic inversion and GN optimization)but is out of the scope of this study. We note that layer-wise whitening is performed in a recent model proposingto map natural gradient learning onto feedforward networks (Desjardins et al., 2015), suggesting that the strategicplacement of whitening transformations in a network with dynamic inversion may produce a more accurateapproximation to Gauss-Newton or natural gradient optimization, while still being biologically plausible. In general, the dynamic inversion system dynamics for a particular layer l are not stable when initialized withrandom matrices W l and B l (R-Init). We follow procedures outlined in Vanbiervliet et al. (2009) and Hennequinet al. (2014) to optimize linear stability by minimizing the smoothed spectral abscissa (SSA; a smooth relaxationof the spectral abscissa, the maximum real eigenvalue). The full system matrix can be written in block form as M l = (cid:20) − I B l W l − α I (cid:21) , (S9) ith the first and second rows corresponding to the dynamics of δ l − and u l , respectively, of Eqs. (5) and(10). In brief, we calculate the gradient of the SSA with respect to the matrix B l , and make small steps untilthe maximum eigenvalue is sufficiently negative. We refer the reader to the references above for details. SSAoptimization can be done both on W l and B l , but we chose to optimize only B l in order to have full control onthe initialization of W l . We provide pseudocode for recursively calculating the backpropagated error signals ( δ l ) for dynamic inversion(DI) and non-dynamic inversion (NDI). In addition to inversion through a single layer (sequential scheme), wealso provide pseudocode for the repeat layer and single loop DI schemes for the case of a network with twohidden layers. Following the calculation of the error signals, weights and biases are updated according to thestandard backpropagation rules (Eq. (2)). Algorithm 1
Dynamic Inversion (Sequential) function
DYN-INV ( W l , B l , g ( · ) , α, δ l , dt , tsteps): δ l − , u l ← for t = 1 to tsteps do δ l − += dt ( − δ l − + B l u l ) u l += dt ( − α u l + W l g ( δ l − ) − δ l ) end forreturn δ l − Algorithm 2
Two-Layer Dynamic Inversion (Repeat hidden layers) function
REP-2L-DYN-INV ( W l − , W l , B l − , B l , g l ( · ) , g l − ( · ) , α l , α l − , δ l , dt , tsteps): δ l − , δ l − , u l , u l − ← for t = 1 to tsteps do δ l − += dt ( − δ l − + B l u l ) u l += dt ( − α l u l + W l g l ( δ l − ) − δ l ) δ l − += dt ( − δ l − + B l − u l − ) u l − += dt ( − α l − u l − + W l − g l − ( δ l − ) − δ l − ) end forreturn ( δ l − , δ l − ) Algorithm 3
Two-Layer Dynamic Inversion (Single loop) function
SL-2L-DYN-INV ( W l − , W l , B , g l ( · ) , g l − ( · ) , α, δ l , dt , tsteps): δ l − , δ l − , u l ← for t = 1 to tsteps do δ l − += dt ( − δ l − + Bu l ) δ l − += dt ( − δ l − + W l − g l − ( δ l − )) u l += dt ( − α l u l + W l g l ( δ l − ) − δ l ) end forreturn ( δ l − , δ l − ) 13 lgorithm 4 Error propagation for BP, FA, PBP, and NDI function
ERR-INV (algo , W l , B l , g ( · ) , α, δ l ): M l = switch (algo): case (BP): W Tl case (FA): B l case (PBP): W + l case (NDI): if d l > d l − then ( B l W l − α I ) − B l else B l ( W l B l − α I ) − end if δ l − ← M l δ l ◦ g (cid:48) ( a l − ) return δ l − Due to instabilities in the PBP and NDI algorithms during the first several iterations of training, we imposed amaximum norm for the backpropagated error signals for linear and nonlinear regression ( (cid:107) δ (cid:107) ≤ for linearregression, (cid:107) δ (cid:107) ≤ for nonlinear regression). This was not necessary for MNIST classification or MNISTautoencoding. This did not affect BP and FA algorithms, and if anything places a handicap on the dynamicinversion algorithms.Stability was measured by computing the maximum real eigenvalue for the block matrix in Eq. (S9). This wasmonitored during training, and when the maximum eigenvalue became positive, SSA optimization was appliedto the feedback matrix, as outlined above. In practice, this periodic stabilization was only necessary for linearregression. The linear regression example utilized a network with a single hidden layer (dim. 30-20-10) following Lillicrapet al. (2016). Training data was generated in the following way: input data x was generated independentlyfor each dimension from a standard normal distribution, and target output t was generated by passing thisinput through a matrix T , with elements generated randomly from a uniform distribution ( U ( − , ) such that t = Tx . No bias units were used in the network (nor to generate the test data). Weight matrices ( W , W )were initialized with random uniform distributions ( U ( − . , . ) and all algorithms began with exact copies.The random feedback matrix ( B , R-Init) was generated from the same distribution, but for feedback alignment,this distribution had a larger spread ( U ( − . , . as in Lillicrap et al. (2016)). Training used squared error loss,and we plot the training error as normalized mean squared error (NMSE) in which the error for each algorithm isnormalized by the maximum error across all algorithms and iterations, so that training begins with a normalizederror of ~1. The learning rate was set to − for all algorithms and was not optimized. The nonlinear regression example was also adapted from Lillicrap et al. (2016) and used a network with twohidden layers (dim. 30-20-10-10) and tanh nonlinearities (with linear output). Training data was generatedfrom a network with the same architecture, but with randomly generated weights and biases (all generated froma uniform distribution, U ( − . , . ). Feedforward and feedback weight matrices were initialized in thesame way as the linear regression example, and bias weights were initialized to zero. Training loss was againsquared error, and normalized in the same was as for linear regression. The learning rate was set to − for allalgorithms and was not optimized. MNIST classification was done on a single hidden layer network (dim. 784-1000-10 as in Lillicrap et al. (2016))with a tanh nonlinearity and a softmax output with cross-entropy loss. The standard training (60000 examples)and test (10000 examples) sets were used. Data was first preprocessed by subtracting the mean from each pixeldimension and normalizing the variance (across all pixels) to 1. Weight matrices and biases were initialized inthe same way as for linear and nonlinear regression. Training was performed online (no batches). The learning ate was set to − for all algorithms and was not optimized. An additional weight decay of − was alsoused. MNIST autoencoding was done on a network with architecture 784-500-250-30-250-500-784 with nonlinearitiestanh-tanh-linear-tanh-tanh-linear, similar to Hinton and Salakhutdinov (2006) but with one hidden layer removed.The standard MNIST training set was used (60000 examples), and performance was measured on this dataset,without the use of the test set. Data was preprocessed so that each pixel dimension was between 0 and 1 (datawas not centered). To speed up simulations, training was done on mini-batches of size 100. The learning ratewas set to − for all algorithms, with a weight decay of − . Learning rate and mini-batch size were notoptimized, however, we found that PBP and NDI algorithms became unstable for larger learning rates. Code for running all training examples in python 3 can be found at https://github.com/wpodlaski/dynamic-inversion (dependencies: numpy, scipy, matplotlib, numba, and mlxtend). We note that simulationsfor nonlinear regression and the MNIST examples are slow (e.g., each algorithm takes 45-60 minutes per epochon a MacBook laptop with 3 GHz Intel Core i7 and 16 GB RAM). Code will be provided in PyTorch and/orTensorFlow upon publication.(dependencies: numpy, scipy, matplotlib, numba, and mlxtend). We note that simulationsfor nonlinear regression and the MNIST examples are slow (e.g., each algorithm takes 45-60 minutes per epochon a MacBook laptop with 3 GHz Intel Core i7 and 16 GB RAM). Code will be provided in PyTorch and/orTensorFlow upon publication.