[PDF] Event-Based Backpropagation can compute Exact Gradients for Spiking Neural Networks

Abstract

Spiking neural networks combine analog computation with event-based communication using discrete spikes. While the impressive advances of deep learning are enabled by training non-spiking artificial neural networks using the backpropagation algorithm, applying this algorithm to spiking networks was previously hindered by the existence of discrete spike events and discontinuities. For the first time, this work derives the backpropagation algorithm for a continuous-time spiking neural network and a general loss function by applying the adjoint method together with the proper partial derivative jumps, allowing for backpropagation through discrete spike events without approximations. This algorithm, EventProp, backpropagates errors at spike times in order to compute the exact gradient in an event-based, temporally and spatially sparse fashion. We use gradients computed via EventProp to train networks on the Yin-Yang and MNIST datasets using either a spike time or voltage based loss function and report competitive performance. Our work supports the rigorous study of gradient-based learning algorithms in spiking neural networks and provides insights toward their implementation in novel brain-inspired hardware.

Full PDF

EE VENT P ROP : B

ACKPROPAGATION FOR E XACT G RADIENTS IN S PIKING N EURAL N ETWORKS

Timo C. Wunderlich ∗ Kirchhoff-Institute for PhysicsHeidelberg University69120 Heidelberg, Germany

Current Address:

Berlin Institute of HealthCharité–Universitätsmedizin10117 Berlin, Germany [email protected]

Christian Pehle ∗ Kirchhoff-Institute for PhysicsHeidelberg University69120 Heidelberg, Germany [email protected]

September 22, 2020 A BSTRACT

We derive the backpropagation algorithm for spiking neural networks composed of leaky integrate-and-ﬁre neurons operating in continuous time. This algorithm, EventProp, computes the exactgradient of an arbitrary loss function of spike times and membrane potentials by backpropagatingerrors in time. For the ﬁrst time, by leveraging methods from optimal control theory, we are able tobackpropagate errors through spike discontinuities without approximations or smoothing operations.As errors are backpropagated in an event-based manner (at spike times), EventProp requires storingstate variables only at these times, providing favorable memory requirements. EventProp can beapplied to spiking networks with arbitrary connectivity, including recurrent, convolutional and deepfeed-forward architectures. While we consider the leaky integrate-and-ﬁre neuron model in thiswork, our methodology to derive the gradient can be applied to other spiking neuron models. Wedemonstrate learning using gradients computed via EventProp in a deep spiking network using anevent-based simulator and a non-linearly separable dataset encoded using spike time latencies. Ourwork supports the rigorous study of gradient-based methods to train spiking neural networks whileproviding insights toward the development of learning algorithms in neuromorphic hardware. ∗ The authors have contributed equally. a r X i v : . [ q - b i o . N C ] S e p Introduction

How can we train spiking neural networks to achieve brain-like performance in machine learning tasks? The resoundingsuccess and pervasive use of the backpropagation algorithm in deep learning suggests an analogous approach.Spiking neural networks hold the promise for efﬁcient and robust processing of event-based spatio-temporal data asfound in biological systems (Maass, 1997). Although the brain is able to learn impressively well using spike-basedcommunication, spiking models are not widely used in machine learning applications. At the same time, learning inspiking neural networks is an active research subject, with a wide variety of continuously proposed algorithms, and thedevelopment of spiking neuromorphic hardware receives increasing attention (Roy et al., 2019). A notorious issue inspiking neurons is the hard, non-differentiable spiking threshold that does not permit a straight-forward application ofdifferential calculus to compute gradients. Although exact gradients have been derived for special cases, this issue iscommonly side-stepped by using smoothed or stochastic neuron models or by replacing the hard threshold functionusing a surrogate function, leading to the computation of surrogate gradients (Neftci et al., 2019).In contrast, this work provides the exact gradient for an arbitrary loss function deﬁned using the state variables (spiketimes, membrane potentials) of a general recurrent spiking neural network composed of leaky integrate-and-ﬁre neuronswith hard thresholds. Since feed-forward architectures correspond to recurrent neural networks with block-diagonalweight matrices and convolutions can be represented as sparse linear transformations, deep feed-forward networks andconvolutional networks are included as special cases.Our starting point is the observation that while the membrane potential of a leaky integrate-and-ﬁre neuron is discontin-uous at spike times, loss functions depending on spike times and membrane potential trajectories are differentiablealmost everywhere in weight space. Speciﬁcally, both the spike times and an integral of a smooth loss functionover the membrane potential are differentiable almost everywhere, up to the null set that contains the locally deﬁnedhypersurfaces in weight space where spikes are added or removed (the critical parameters ). The relevant question isthen how to compute the derivatives, preferably with the computational efﬁciency afforded by the backpropagationalgorithm and retaining any potential advantages of event-based communication.In order to solve this problem, it is helpful to appreciate that the backpropagation algorithm as used in artiﬁcial neuralnetworks can be derived using the adjoint method (LeCun et al., 1988). Indeed, there is a direct correspondencebetween the intermediate variables computed in the backpropagation algorithm and the adjoint variables λ used in theadjoint method. The adjoint variables (Lagrange multipliers) of dynamical systems (Pontryagin, 1962) that operate incontinuous time are also functions of time, λ ( t ) , and their computation corresponds to the backpropagation of errors intime. The gradient of a given loss function is then a function of the λ ( t ) . See Bradley (2019) for a concise derivation inthe context of PDE-constrained optimization.It remains to treat the discontinuities occurring in spiking neural networks. The computation of partial derivatives( parameter sensitivities ) for hybrid systems exhibiting both continuous and discrete dynamics (e.g., a set of differentialequations combined with state variable jumps) is an established topic in optimal control theory (Barton and Lee, 2002;Rozenwasser and Yusupov, 2019). For these hybrid systems, the sensitivity ∂x∂p ( t ) of a state variable x with respect to aparameter p generally experiences jumps at the points of discontinuity. The relation between the sensitivities beforeand after a given discontinuity was ﬁrst studied in the 1960s (De Backer, 1964; Rozenvasser, 1967). A more generaltheoretical framework was developed thirty years later by Galán et al. (1999) who provide existence and uniquenesstheorems for sensitivity trajectories of hybrid systems. These theorems rely on Gronwall’s theorem (Gronwall, 1919)for the existence of sensitivity trajectories between transitions and on the implicit function theorem to relate sensitivitiesbefore and after a given transition. The adjoint method can be combined with these insights to derive the adjointdynamics for hybrid systems (Serban and Recuero, 2019). Our work uses these methods to derive the gradient for aspiking neural network.Our results show that the gradient can be computed using an adjoint spiking neural network which backpropagateserrors at the spike times. This instantly suggests a simple algorithm, EventProp , to compute the gradient. Since theadjoint spiking network backpropagates error at the spike times, the algorithm computes gradients using an event-basedcommunication scheme and is amenable to neuromorphic implementation.2 .1 Previous Work

We provide a compact overview of gradient-based approaches to learning in spiking neural networks and refer thereader to the following review articles for a comprehensive survey of this topic: Pfeiffer and Pfeil (2018) and Tavanaeiet al. (2019) discuss learning in deep spiking networks, Roy et al. (2019) discusses learning along with the history andfuture of neuromorphic computing and Neftci et al. (2019) focuses on the surrogate gradient approach.Surrogate gradients replace non-differentiable spiking threshold functions with smooth functions for the purposesof backpropagation and have been used to train spiking networks in a variety of settings (e.g., Esser et al., 2016;Bellec et al., 2018; Zenke and Ganguli, 2018; Shrestha and Orchard, 2018). Apart from surrogate gradients, severalpublications provide exact gradients for ﬁrst-spike-time based loss functions and leaky integrate-and-ﬁre neurons: Bohteet al. (2000) provides the gradient for at most one spike per layer and this result was subsequently generalized to anarbitrary number of spikes as well as recurrent connectivity (Booij and tat Nguyen, 2005; Xu et al., 2013). While thesepublications provide recursive relations for the gradient that can be implicitly computed using backpropagation, weexplicitly provide the dynamical system that implements backpropagation and show that it represents an adjoint spikingnetwork which transmits errors at spike times. In addition, we also consider voltage-dependent loss functions andour methodology can be applied to neuron models without analytic expressions for the Post-Synaptic Potential (PSP)kernels. The chronotron (Florian, 2012) uses a gradient-based learning rule based on the Victor-Purpura metric whichenables single leaky integrate-and-ﬁre neurons to learn a target spike train. Our work, as well as the works mentionedabove which derive exact gradients, applies the implicit function theorem to the relation v ( t, w ) − ϑ = 0 to differentiatespike times with respect to synaptic weights. A different approach is to consider ratios of the neuronal time constants τ mem , τ syn where analytic expressions for ﬁrst spike times can be given and to derive the corresponding gradients, asdone in Göltz et al. (2019); Comsa et al. (2020); Mostafa (2017); Kheradpisheh and Masquelier (2020). Our workencompasses these results as special cases.The seminal Tempotron model uses gradient descent to adjust the sub-threshold voltage maximum in a single neuron(Gütig and Sompolinsky, 2006) and has recently been generalized to the spike threshold surface formalism Gütig (2016)that uses the exact gradient of the critical thresholds ϑ ∗ k at which an leaky integrate-and-ﬁre neuron transitions fromemitting k to k − spikes; computing this gradient is not considered in this work. The adjoint method was recently usedto optimize neural ordinary differential equations (Chen et al., 2018) and neural jump stochastic differential equations(Jia and Benson, 2019) as well as to derive the gradient for a smoothed spiking neuron model without reset (Huh andSejnowski, 2018). We deﬁne a network of N standard leaky integrate-and-ﬁre neurons with arbitrary (up to self-connections) recurrentconnectivity (table 1). We set the leak potential to zero and choose parameter-independent initial conditions. Note thatthe Spike-Response Model (SRM) (Gerstner and Kistler, 2002) with double-exponential or α -shaped PSPs is generallyan integral expression of the model given in table 1 with corresponding time constants. Free Dynamics Transition Condition Jumps at Transition τ mem dd t V = − V + Iτ syn dd t I = − I ( V ) n − ϑ = 0( ˙ V ) n (cid:54) = 0 for any n ( V + ) n = ( V − ) n − ϑI + = I − + W e n Table 1: The leaky integrate-and-ﬁre spiking neural network model. Inbetween spikes, the vector of membranepotentials V and synaptic currents I evolve according to the free dynamics. When some neuron n crosses the threshold ϑ , the transition condition is fulﬁlled, causing a spike. This leads to a reset of the membrane potential as well aspost-synaptic current jumps. W ∈ R N × N is the weight matrix with zero diagonal and e n ∈ R N is the unit vector witha at index n and at all other indices. We use + and − to denote quantities before and after a given spike.3 .2 Main Result Consider smooth loss functions l V ( V, t ) , l p ( t post ) that depend on the membrane potentials V , time t and the set ofpost-synaptic spike times t post . The total loss is given by L = l p ( t post ) + (cid:90) T l V ( V ( t ) , t )d t. (1)As will be derived in the following section, the gradient of the total loss with respect to a speciﬁc weight w ij = ( W ) ij that connects pre-synaptic neuron i to post-synaptic neuron j is given by a sum over the spikes caused by i , d L d w ij = − τ syn (cid:88) spikes from i ( λ I ) j (2)where λ I is the adjoint variable (Lagrange multiplier) corresponding to I . Equation (2) therefore samples the post-synaptic neuron’s adjoint variable ( λ I ) j at the spike times caused by neuron i .After the neuron dynamics given by table 1 have been computed from t = 0 to t = T , λ I is computed in reverse time(i.e., from t = T to t = 0 ) as the solution of the system deﬁned in table 2. The dynamical system deﬁned by table 2 isthe adjoint spiking network to table 1 which backpropagates error signals at the spike times t post . Free Dynamics TransitionCondition Jump at Transition τ mem λ (cid:48) V = − λ V − ∂l V ∂Vτ syn λ (cid:48) I = − λ I + λ V t − t post k = 0 for any k ( λ − V ) n ( k ) = ( λ + V ) n ( k ) + 1 τ mem ( ˙ V − ) n ( k ) (cid:20) ϑ ( λ + V ) n ( k ) + (cid:0) W T ( λ + V − λ I ) (cid:1) n ( k ) + ∂l p ∂t post k + l − V − l + V (cid:21) Table 2: The adjoint spiking network to table 1 that computes the adjoint variable λ I needed for the gradient (eq. (2)).The adjoint variables are computed in reverse time (i.e., from t = T to t = 0 ) with (cid:48) = − dd t denoting the reverse timederivative. ( λ − V ) n ( k ) experiences jumps at the spikes times t post k , where n ( k ) is the index of the neuron that causedthe k th spike. Computing this system amounts to the backpropagation of errors in time. The initial conditions are λ V ( T ) = λ I ( T ) = 0 and we provide λ − V in terms of λ + V because the computation happens in reverse time.Equation (2) and table 2 suggest a simple algorithm, EventProp, to compute the gradient (algorithm 1). Notably, if theloss is voltage-independent (i.e., l V = 0 ), the backward pass of the algorithm requires only the spike times t post andthe synaptic current of the ﬁring neurons at their respective ﬁring times. The memory requirement of the algorithmtherefore scales as O ( rT N ) , where r is the ﬁring rate and T is the total trial duration. In case of a voltage-dependentloss l V (cid:54) = 0 , the algorithm has to store the non-zero components of ∂l V ∂V along the forward trajectory. Note that in manypractical scenarios as found in deep learning, the loss l V depends only on the state of a constant number of neurons,irrespective of network size. If l V depends on the voltage of non-ﬁring readout neurons, we have l + V = l − V and thecorresponding term in the jump given in table 2 vanishes.For voltage-independent losses (i.e., l V = 0 ), all variables only need to be computed at spike times. In that case,EventProp can be computed in a purely event-based manner. Figure 1 illustrates how EventProp computes gradients fortwo leaky integrate-and-ﬁre neurons where one neuron receives Poisson spike trains via synapses and is connectedto the other neuron via a single feed-forward weight w . 4 lgorithm 1 EventProp: Algorithm to compute eq. (2).

Require:

Input spikes, losses l p , l V , parameters W , τ mem , τ syn , initial conditions V (0) , I (0)grad ← Compute neuron state trajectory (table 1) from t = 0 to t = T : (cid:46) Forward passfor all spikes k , store spike time t post k and the ﬁring neuron’s component of I ( t post k ) if l V (cid:54) = 0 , also store ∂l V ∂V Compute adjoint state trajectory (table 2) from t = T to t = 0 : (cid:46) Backward passaccumulate grad ij ← grad ij − τ syn ( λ I ) j for each spike from neuron i to j return grad A BC DE FG

Forward Pass Backward Pass Backward Pass

Figure 1: Illustration of EventProp-based gradient calculation in two leaky integrate-and-ﬁre neurons connected withweight w and a spike-time dependent loss L . The forward pass (B, C) computes the spike times for both neurons andthe backward pass (D-G) backpropagates errors at spike times, yielding the gradient as given in eq. (2). A : The upperneuron receives independent Poisson spike trains with frequency

200 Hz across randomly initialized weights and isconnected to the lower neuron via a single weight w . The loss L is a sum of the spike times of the lower neuron. B, C :Membrane potential of upper and lower neuron. Spike times of upper neuron are indicated using arrows.

D, E : Adjointvariable λ I of upper and lower neuron. The lower neuron backpropagates its error signal λ V − λ I at the upper neuron’sspike times (indicated by arrows). F, G : Accumulated gradient for one of the input weights of the upper neuron andthe weight w connecting the upper and lower neuron. EventProp computes the adjoint variables from t = T to t = 0 and accumulates the gradients by sampling − τ syn λ I when spikes are transmitted across the respective weight. Thenumerical gradients were calculated via central differences and match the ﬁnal accumulated gradients up to a relativedeviation of less than − . 5 .3 Derivation The following derivation is speciﬁc the model given in table 1. A fully general derivation of sensitivity analysis inhybrid systems can be found in Galán et al. (1999) or Serban and Recuero (2019).The differential equations deﬁning the free dynamics in implicit form are f V ≡ τ mem ˙ V + V − I = 0 , (3a) f I ≡ τ syn ˙ I + I = 0 , (3b)where f V , f I are again vectors of size N . We now split up the integral in eq. (1) at the spike times t post and add vectorsof Lagrange multipliers λ V , λ I that ﬁx the system dynamics f V , f I between transitions. d L d w ij = dd w ij  l p ( t post ) + N post (cid:88) k =0 (cid:90) t post k +1 t post k [ l V ( V, t ) + λ V · f V + λ I · f I ] d t  , (4)where we set t post = 0 and t post N post +1 = T and x · y is the dot product of two vectors x , y . Using eq. (3) and the short-handnotation s X ≡ ∂X∂w ij (the parameter sensitivities in the language of control theory), we have, as per Gronwall’s theorem(Gronwall, 1919), ∂f V ∂w ij = τ mem ˙ s V + s V − s I , (5a) ∂f I ∂w ij = τ syn ˙ s I + s I , (5b)where we have commuted the derivatives ∂∂w ij dd t = dd t ∂∂w ij . The gradient then becomes, by application of the Leibnizintegral rule, d L d w ij = N post (cid:88) k =0 (cid:20) (cid:90) t post k +1 t post k (cid:20) ∂l V ∂V · s V + λ V · ( τ mem ˙ s V + s V − s I ) + λ I · ( τ syn ˙ s I + s I ) (cid:21) d t + ∂l p ∂t post k d t post k d w ij + l − V,k +1 d t post k +1 d w ij − l + V,k d t post k d w ij (cid:21) , (6)where l ± V,k is the voltage-dependent loss evaluated before ( − ) or after ( + ) the transition and we have used that f V = f I = 0 along all considered trajectories. Using partial integration, we have (cid:90) t post k +1 t post k λ V ˙ s V d t = − (cid:90) t post k +1 t post k ˙ λ V · s V d t + (cid:2) λ V · s V (cid:3) t post k +1 t post k , (7) (cid:90) t post k +1 t post k λ I ˙ s I d t = − (cid:90) t post k +1 t post k ˙ λ I · s I d t + (cid:2) λ I · s I (cid:3) t post k +1 t post k . (8)Using the short-hand notation d t post k d w ij ≡ τ post k and collecting terms in s V , s I , we have d L d w ij = N post (cid:88) k =0 (cid:20) (cid:90) t post k +1 t post k (cid:20)(cid:18) ∂l V ∂V − τ mem ˙ λ V + λ V (cid:19) · s V + (cid:16) − τ syn ˙ λ I + λ I − λ V (cid:17) · s I (cid:21) d t + ∂l p ∂t post k τ post k + τ mem (cid:2) λ V · s V (cid:3) t post k +1 t post k + τ syn (cid:2) λ I · s I (cid:3) t post k +1 t post k + l − V,k +1 τ post k +1 − l + V,k τ post k (cid:21) . (9)This form allows us to set the dynamics of the adjoint variables between transitions. Since the integration of the adjointvariables is done from t = T to t = 0 in practice (i.e., reverse in time), it is practical to transform the time derivative as dd t → − dd t . Denoting the new time derivative by (cid:48) , we have τ mem λ (cid:48) V = − λ V − ∂l V ∂V (10a) τ syn λ (cid:48) I = − λ I + λ V . (10b)6he integrand therefore vanishes along the trajectory and we are left with a sum over the transitions. Since the initialconditions of V and I are assumed to be parameter independent, we have s V = s I = 0 at t = 0 . We set the initialcondition for the adjoint variables to be λ V = λ I = 0 at t = T . We are therefore left with a sum over transitions ξ k evaluated at the transition times t post k , d L d w ij = N post (cid:88) k =1 ξ k (11)with the deﬁnition ξ k ≡ ∂l p ∂t post k τ post k + l − V,k τ post k − l + V,k τ post k + (cid:2) τ mem ( λ − V · s − V − λ + V · s + V ) + τ syn ( λ − I · s − V − λ + I · s + I ) (cid:3) (cid:12)(cid:12)(cid:12)(cid:12) t post k . (12)We proceed by deriving the relationship between the adjoint variables before and after each transition. Since thecomputation of the adjoint variables happens in reverse time in practice, we provide λ − in terms of λ + .Consider a spike caused by the n th neuron, with all other neurons m (cid:54) = n remaining silent. We start by ﬁrst deriving therelationships between s + V , s − V and s + I , s − I .Figure 2: In this sketch, the relation v ( t, w ) − ϑ = 0 deﬁnes an implicit function (black linealong which d v = 0 ). The critical point wherethe gradient diverges is shown in red. Membrane Potential Transition

By considering the relations be-tween V + , V − and ˙ V + , ˙ V − , we can derive the relation between s + V and s − V at each spike. Each spike at t post is triggered by a neuron’smembrane potential crossing the threshold. We therefore have, at t post , v − n − ϑ = 0 , (13)where v − n = ( V − ) n is the membrane potential of the spiking neuron n before the transition. This relation deﬁnes a differentiable functionof w ij via the implicit function theorem (illustrated in ﬁg. 2, see alsoYang et al., 2014). Differentiation of this relation yields ( s − V ) n + ˙ v − n τ post = 0 . (14)Note that corresponding relations were previously used to derivegradient-based learning rules for spiking neuron models (Bell andParra, 2005; Bohte et al., 2000; Booij and tat Nguyen, 2005; Xu et al.,2013; Florian, 2012); in contrast to the suggestion in Bohte et al.(2000), eq. (14) is not an approximation but rather an exact relationat all non-critical parameters and invalid at all critical parameters.Since we only allow transitions for ˙ v n (cid:54) = 0 , we have τ post = − v − n ( s − V ) n . (15)Because the spiking neuron’s membrane potential is reset to zero, we have v + n = v − n − ϑ. (16)This implies, again via the implicit function theorem, the relationship between the parameter sensitivities for themembrane potential before and after the spike, ( s + V ) n + ˙ v + n τ post = ( s − V ) n + ˙ v − n τ post (17) ⇒ ( s + V ) n = ˙ v + n ˙ v − n ( s − V ) n . (18)Since we have v + m = v − m for all other, non-spiking neurons m (cid:54) = n , it holds that ( s + V ) m + ˙ v + m τ post = ( s − V ) m + ˙ v − m τ post . (19)7ecause the spiking neuron n causes the synaptic current of all neurons m (cid:54) = n to jump by w nm , we have τ mem ˙ v + m = τ mem ˙ v − m + w nm (20)and therefore get with eq. (14) ( s + V ) m = ( s − V ) m − τ − mem w nm τ post (21) = ( s − V ) m + 1 τ mem ˙ v − n w nm ( s − V ) n . (22) Synaptic Current Transition

The spiking neuron n causes the synaptic current of all neurons m (cid:54) = n to jump by thecorresponding weight w nm . We therefore have ( I + ) m = ( I − ) m + w nm . (23)By differentiation, this relation implies the consistency equations for the sensitivities s I with respect to the consideredweight w ij , ( s + I ) m + ( ˙ I + ) m τ post = ( s − I ) m + ( ˙ I − ) m τ post + δ in δ jm , (24)where δ ij is the Kronecker delta. Because τ syn ( ˙ I + ) m = τ syn ( ˙ I − ) m − w nm , (25)we get with eq. (14) ( s + I ) m = ( s − I ) m + τ − syn w nm τ post + δ in δ jm (26) = ( s − I ) m − τ syn ˙ v − n w nm ( s − V ) n + δ in δ jm . (27)With ( I + ) n = ( I − ) n and ( ˙ I + ) n = ( ˙ I − ) n , we have ( s + I ) n = ( s − I ) n . (28)Using the parameter sensitivity relations from eqs. (15), (18), (22), (27) and (28) in the transition equation eq. (12), wenow derive relations between the adjoint variables. Collecting terms in the sensitivities and writing the index of thespiking neuron for the k th spike as n ( k ) , we have ξ k = (cid:20) (cid:88) m (cid:54) = n ( k ) (cid:20) τ mem ( λ − V − λ + V ) m ( s − V ) m + τ syn ( λ − I − λ + I ) m ( s − I ) m − τ syn δ in ( k ) δ jm ( λ + I ) m (cid:21) + ( s − V ) n ( k )  τ mem (cid:32) λ − V − ˙ v + n ( k ) ˙ v − n ( k ) λ + V (cid:33) n ( k ) + 1˙ v − n ( k )  (cid:88) m (cid:54) = n ( k ) w n ( k ) m ( λ + I − λ + V ) m − ∂l p ∂t post k + l + V − l − V  + τ syn ( λ − I − λ + I )( s − I ) n ( k ) (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) t post k . (29)This form dictates the jumps of the adjoint variables for the spiking neuron n and all other, silent neurons m , ( λ − V ) n = ˙ v + n ˙ v − n ( λ + V ) n + 1 τ mem ˙ v − n  (cid:88) m (cid:54) = n w nm ( λ + V − λ + I ) m + ∂l p ∂t post k + l − V − l + V  , (30a) ( λ − V ) m = ( λ + V ) m , (30b) λ − I = λ + I . (30c)With these jumps, the gradient reduces to d L d w ij = − τ syn N post (cid:88) k =1 δ in ( k ) ( λ I ) j (31) = − τ syn (cid:88) spikes from i ( λ I ) j . (32)8 ummary The free adjoint dynamics between spikes are given by eq. (10) while spikes cause jumps given byeq. (30). The gradient for a given weight samples the post-synaptic neuron’s λ I when spikes are transmitted across thecorresponding synapse (eq. (31)). Since we can identify ˙ v + n ˙ v − n = ˙ v + n − ˙ v − n ˙ v − n + 1 = ϑτ mem ˙ v − n + 1 (33)the derived solution is equivalent to eq. (2) and table 2. Fixed Input Spikes

If a given neuron i is subjected to a ﬁxed pre-synaptic spike train across a synapse with weight w input , the transition times are ﬁxed and the adjoint variables do not experience jumps. The gradient simply samples theneuron’s λ I at the times of spike arrival, d L d w input = − τ syn (cid:88) input spikes ( λ I ) i . (34) Coincident Spikes

The derivation above assumes that only a single neuron of the recurrent network spikes at agiven t post k . In general, coincident spikes may occur. If neurons a and b spike at the same time and the times of theirrespective threshold crossing vary independently as function of w ij , the derivation above still holds, with both neuron’s λ V experiencing a jump as in eq. (30a). C V a li d a t i o n L o ss Epoch B V a li d a t i o n E rr o r A

10y 1x D t min t max t bias x1-xy1-yfixed E Latency encoding of

Figure 3: We used EventProp and a time-to-ﬁrst-spike loss function to train a two-layer leaky integrate-and-ﬁre networkon the Yin-Yang dataset. A : Illustration of the two-dimensional dataset (adapted from Göltz et al. (2019)). The threedifferent classes are shown in red, green and blue. This dataset is encoded using spike time latencies (see D). B, C :Training results in terms of validation error and loss averaged over different random seeds (individual traces shownas grey lines). D : Data points ( x, y ) are transformed into ( x, − x, y, − y ) and encoded using spike time latencies.We add a ﬁxed spike at time t bias . E : Spike time latencies ∆ t of the three output neurons (encoding the blue, red orgreen class) after training, for all samples in the validation set and a speciﬁc random seed. Latencies are relative to theﬁrst spike among the three neurons and given in units of t max − t min . A latency of zero (bright yellow dots) impliesthat the corresponding neuron has ﬁred the ﬁrst spike, determining the class assignment. Missing spikes are denotedusing black crosses.We demonstrate learning using EventProp using an event-based simulator. The simulator uses an event queue androot-bracketing to compute post-synaptic spike times in the forward pass and back-propagates errors by attaching errorsignals to events in the backward pass. The gradients computed in this way are used for stochastic gradient descentvia the Adam optimizer (Kingma and Ba, 2014). We use a two-layer neural network and encode the Yin-Yang dataset9Kriener, 2020, ﬁg. 3 A) using input spike latencies. This dataset is non-linearly separable, with a shallow classiﬁerachieving around accuracy, and it therefore requires a hidden layer and backpropagation of errors for reasonableclassiﬁcation accuracy. Consider that in contrast, the widely used MNIST dataset can be classiﬁed using a linearclassiﬁer with at least accuracy (Lecun et al., 1998).Each two-dimensional data point of the dataset ( x, y ) is transformed into ( x, − x, y, − y ) and encoded using spikelatencies in the interval [ t min , t max ] (see ﬁg. 3 D). We added a ﬁxed bias spike at time t bias for a total of ﬁve input spikesper data point. The resulting spike patterns are used as input to a two-layer network composed of leaky integrate-and-ﬁreneurons. The output layer consists of three neurons that each encode one of the three classes, with each data point beingassigned the class of the neuron that ﬁred the earliest spike.In analogy to Göltz et al. (2019), we use a cross-entropy loss deﬁned using the ﬁrst output spike times per neuron, L ( t post , l ) = − log (cid:34) exp (cid:0) − t post l / ( ξτ syn ) (cid:1)(cid:80) k exp (cid:0) − t post k / ( ξτ syn ) (cid:1) (cid:35) + α (cid:34) exp (cid:32) t post l βτ syn (cid:33) − (cid:35) , (35)where t post k is the ﬁrst spike time of neuron k , l is the index of the correct label and the second term is a regularizationterm that encourages early spiking of the label neuron. If the activity of one of the two layers fell below a certainthreshold for a given minibatch, we incremented all weights by a constant amount ∆ w bump .Training results are shown in ﬁg. 3. All parameters used are given in ﬁg. 4. After training, the validation accuracy was (95 . ± .

5) % (mean and standard deviation over different random seeds). This is comparable to the results shownin Göltz et al. (2019), who report (95 . ± .

7) % accuracy with a similar architecture. Note that the results shown herecan likely be improved using systematic hyperparameter tuning.

We have derived, and provided an algorithm (EventProp) to compute, the gradient of an arbitrary loss function for aspiking neural network composed of leaky integrate-and-ﬁre neurons. For the ﬁrst time, the parameter-dependent spikediscontinuities were treated in a well-deﬁned manner using adjoint sensitivity analysis, without any approximationsor smoothing operations. EventProp uses the resulting adjoint spiking network to backpropagate errors in order tocompute the exact gradient. Its forward pass requires computing the spike times of pre-synaptic neurons that transmitspikes to post-synaptic neurons, while the backward pass backpropagates errors at these spike times using the reversepath (i.e., from post-synaptic to pre-synaptic neurons). The rigorous treatment of spike discontinuities as well as anevent-based computation of the exact gradient represent a signiﬁcant conceptual advance in the study of gradient-basedlearning methods for spiking neural networks.An apparent issue with gradient-descent based learning in the context of spiking networks is that the gradient divergesat the critical points in parameter space (note the ˙ v − term in the jump term given in table 2; this term diverges as themembrane potential becomes tangent to the threshold and we have ˙ v → ). Indeed, this is a known issue in the broadercontext of optimal control of dynamical systems with parameter-dependent state transitions (Barton and Lee, 2002;Galán et al., 1999). While this divergence can be mitigated using gradient clipping in practice, commonly consideredloss functions lead to learning dynamics that are ignorant with respect to these critical points and are therefore unable toselectively recruit additional spikes or dismiss existing spikes. It is therefore plausible that surrogate gradient methodswhich continuously transmit errors across spiking neurons represent a form of implicit regularization. Neftci et al.(2019) report that the surrogate gradient approximates the true gradient in a minimalistic binary classiﬁcation task whileat the same time remaining ﬁnite and continuous along an interpolation path in weight space. Hybrid algorithms thatcombine the exact gradient with explicit regularization techniques could be a direction for future research and providemore principled learning algorithms as compared to ad-hoc replacements of threshold functions.This work is based on the widely used leaky integrate-and-ﬁre neuron model. Adjoint equations for models with ﬁxedrefractory periods, adaptive thresholds, synaptic plasticity or of multi-compartment neuron models can be derived in ananalogous way (Pehle, 2020). While the absence of explicit solutions to the resulting differential equations can preventthe use of typical event-based simulation approaches as the one used in this work, such extensions can signiﬁcantlyenhance the computational capabilities of spiking networks. For example, Bellec et al. (2018) uses adaptive thresholdsto implement LSTM-like memory cells in a recurrent spiking neural network.Neuromorphic hardware is an increasingly active research subject (e.g., Aamir et al., 2018; Davies et al., 2018; Furberet al., 2014; Neckar et al., 2019; Moradi et al., 2018; Merolla et al., 2014; Pei et al., 2019; Billaudelle et al., 2019;Feldmann et al., 2019; Boybat et al., 2017; Wunderlich et al., 2019) and implementing EventProp on such hardware is anatural consideration. The adjoint dynamics as given in table 2 represent a type of spiking neural network which, insteadof spiking dynamically, transmits errors at ﬁxed times t post that are scaled with factors ˙ v − retained from the forward10ass. Therefore, a neuromorphic implementation could store spike times and scaling factors locally at each neuron,where they could be combined with the dynamic error signal ( λ V − λ I in table 2) in the backward pass. This requires apossibility to read out neuronal state variables both in the forward and backward pass (membrane potential, synapticcurrent). Transmitting error signals across the network in a neuromorphic system requires similar considerations todistributed event-based simulators. The error signals associated with each spike could be propagated using messagepassing with gather/scatter primitives. As mentioned above, EventProp can be extended to multi-compartment neuronmodels as used in a recent neuromorphic architecture (Schemmel et al., 2017).We have demonstrated learning using EventProp using a two-layer feed-forward architecture and a simple non-linearlyseparable dataset. The algorithm can, however, compute the gradient for arbitrary recurrent or convolutional architectures(Pehle, 2020). Its computational and spatial complexity scales linearly with network size, analogous to backpropagationin non-spiking artiﬁcial neural networks. The performance in more complex tasks therefore hinges on the generalefﬁcacy of gradient descent in spiking networks. As mentioned above, naive gradient descent using loss functionsdeﬁned in terms of spike times or membrane potentials ignores the presence of critical parameters where spikes appearor disappear. We suggest that studying regularization techniques which deal with this fundamental issue in a targetedmanner could enable powerful learning algorithms for spiking networks. By providing a theoretical foundation forbackpropagation in spiking networks, we have laid the groundwork for future research that combines such regularizationtechniques with the computation of exact gradients. Contributions

CP conceived of the presented idea and outlined the application of sensitivity analysis to spiking neuron models. CPand TW derived the adjoint equations for LIF neurons and the resulting EventProp algorithm. TW implemented theevent based simulation code. TW conceived and implemented the example experiment and analysed its results. CP andTW wrote and edited the manuscript.

Acknowledgements

We would like to express gratitude to Eric Müller and Johannes Schemmel for discussions, continued support andencouragement during the preparation of this work. We thank Laura Kriener and Julian Göltz for their support regardingtime to ﬁrst spike experiments and helpful discussions. We thank Korbinian Schreiber and Mihai Petrovici for helpfuldiscussions.

Funding

The research has received funding from the EC Horizon 2020 Framework Programme under Grant Agreements 785907and 945539 (HBP).

References

S. A. Aamir, Y. Stradmann, P. Müller, C. Pehle, A. Hartel, A. Grübl, J. Schemmel, and K. Meier. An accelerated lifneuronal network array for a large-scale mixed-signal neuromorphic architecture.

IEEE Transactions on Circuits andSystems I: Regular Papers , 65(12):4299–4312, 2018.P. I. Barton and C. K. Lee. Modeling, simulation, sensitivity analysis, and optimization of hybrid systems.

ACMTrans. Model. Comput. Simul. , 12(4):256–289, Oct. 2002. ISSN 1049-3301. doi: 10.1145/643120.643122. URL https://doi.org/10.1145/643120.643122 .A. J. Bell and L. C. Parra. Maximising sensitivity in a spiking network. In L. K. Saul, Y. Weiss, and L. Bottou,editors,

Advances in Neural Information Processing Systems 17 , pages 121–128. MIT Press, 2005. URL http://papers.nips.cc/paper/2674-maximising-sensitivity-in-a-spiking-network.pdf .G. Bellec, D. Salaj, A. Subramoney, R. Legenstein, and W. Maass. Long short-term memory and learning-to-learn innetworks of spiking neurons. In

Advances in Neural Information Processing Systems , pages 787–797, 2018.S. Billaudelle, Y. Stradmann, K. Schreiber, B. Cramer, A. Baumbach, D. Dold, J. Göltz, A. F. Kungl, T. C. Wunderlich,A. Hartel, E. Müller, O. Breitwieser, C. Mauch, M. Kleider, A. Grübl, D. Stöckel, C. Pehle, A. Heimbrecht, P. Spilger,G. Kiene, V. Karasenko, W. Senn, M. A. Petrovici, J. Schemmel, and K. Meier. Versatile emulation of spiking neuralnetworks on an accelerated neuromorphic substrate, 2019.11. M. Bohte, J. N. Kok, and J. A. La Poutré. Spikeprop: backpropagation for networks of spiking neurons. In

ESANN ,volume 48, pages 17–37, 2000.O. Booij and H. tat Nguyen. A gradient descent rule for spiking neurons emitting multiple spikes.

Information ProcessingLetters , 95(6):552 – 558, 2005. ISSN 0020-0190. doi: https://doi.org/10.1016/j.ipl.2005.05.023. URL . Applications of Spiking NeuralNetworks.I. Boybat, M. Gallo, N. S.R., T. Moraitis, T. Parnell, T. Tuma, B. Rajendran, Y. Leblebici, A. Sebastian, and E. Eleft-heriou. Neuromorphic computing with multi-memristive synapses.

Nature Communications , 9, 11 2017. doi:10.1038/s41467-018-04933-y.A. M. Bradley. Pde-constrained optimization and the adjoint method, 2019. URL https://cs.stanford.edu/~ambrad/adjoint_tutorial.pdf .T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud. Neural ordinary differential equations. In

Advances inneural information processing systems , pages 6571–6583, 2018.I. M. Comsa, K. Potempa, L. Versari, T. Fischbacher, A. Gesmundo, and J. Alakuijala. Temporal coding in spikingneural networks with alpha synaptic function. In

ICASSP 2020 - 2020 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) , pages 8529–8533, 2020.M. Davies, N. Srinivasa, T. Lin, G. Chinya, Y. Cao, S. H. Choday, G. Dimou, P. Joshi, N. Imam, S. Jain, Y. Liao, C. Lin,A. Lines, R. Liu, D. Mathaikutty, S. McCoy, A. Paul, J. Tse, G. Venkataramanan, Y. Weng, A. Wild, Y. Yang, andH. Wang. Loihi: A neuromorphic manycore processor with on-chip learning.

IEEE Micro , 38(1):82–99, 2018.W. De Backer. Jump conditions for sensitivity coefﬁcients.

IFAC Proceedings Volumes , 1(3):168 – 175, 1964.ISSN 1474-6670. doi: https://doi.org/10.1016/S1474-6670(17)69603-4. URL . International Symposium on Sensitivity Methods in ControlTheory, Dubrovnik, Yugoslavia, August 31-September 5, 1964.S. K. Esser, P. A. Merolla, J. V. Arthur, A. S. Cassidy, R. Appuswamy, A. Andreopoulos, D. J. Berg, J. L. McKinstry,T. Melano, D. R. Barch, C. di Nolfo, P. Datta, A. Amir, B. Taba, M. D. Flickner, and D. S. Modha. Convolutionalnetworks for fast, energy-efﬁcient neuromorphic computing.

Proceedings of the National Academy of Sciences ,113(41):11441–11446, 2016. ISSN 0027-8424. doi: 10.1073/pnas.1604850113. URL .J. Feldmann, N. Youngblood, C. Wright, H. Bhaskaran, and W. Pernice. All-optical spiking neurosynaptic networkswith self-learning capabilities.

Nature , 569:208–214, 05 2019. doi: 10.1038/s41586-019-1157-8.R. V. Florian. The chronotron: A neuron that learns to ﬁre temporally precise spike patterns.

PLOS ONE , 7(8):1–27, 082012. doi: 10.1371/journal.pone.0040233. URL https://doi.org/10.1371/journal.pone.0040233 .S. B. Furber, F. Galluppi, S. Temple, and L. A. Plana. The spinnaker project.

Proceedings of the IEEE , 102(5):652–665,2014.S. Galán, W. F. Feehery, and P. I. Barton. Parametric sensitivity functions for hybrid discrete/continuous systems.

AppliedNumerical Mathematics , 31(1):17 – 47, 1999. ISSN 0168-9274. doi: https://doi.org/10.1016/S0168-9274(98)00125-1.URL .W. Gerstner and W. Kistler.

Spiking Neuron Models: Single Neurons, Populations, Plasticity . Spiking Neuron Models:Single Neurons, Populations, Plasticity. Cambridge University Press, 2002. ISBN 9780521890793.T. H. Gronwall. Note on the derivatives with respect to a parameter of the solutions of a system of differential equations.

Annals of Mathematics , 20(4):292–296, 1919. ISSN 0003486X. URL .R. Gütig. Spiking neurons can discover predictive features by aggregate-label learning.

Science , 351(6277), 2016. ISSN0036-8075. doi: 10.1126/science.aab4113. URL https://science.sciencemag.org/content/351/6277/aab4113 .R. Gütig and H. Sompolinsky. The tempotron: a neuron that learns spike timing–based decisions.

Nature Neuroscience ,9(3):420–428, Mar 2006. ISSN 1546-1726. doi: 10.1038/nn1643. URL https://doi.org/10.1038/nn1643 .J. Göltz, L. Kriener, A. Baumbach, S. Billaudelle, O. Breitwieser, B. Cramer, D. Dold, A. F. Kungl, W. Senn,J. Schemmel, K. Meier, and M. A. Petrovici. Fast and deep: energy-efﬁcient neuromorphic learning with ﬁrst-spiketimes, 2019.D. Huh and T. J. Sejnowski. Gradient descent for spiking neural networks. In S. Bengio, H. Wallach,H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,

Advances in Neural Information Pro-cessing Systems 31 , pages 1433–1443. Curran Associates, Inc., 2018. URL http://papers.nips.cc/paper/7417-gradient-descent-for-spiking-neural-networks.pdf .12. Jia and A. R. Benson. Neural jump stochastic differential equations. In

Advances in Neu-ral Information Processing Systems , pages 9843–9854, 2019. URL http://papers.nips.cc/paper/9177-neural-jump-stochastic-differential-equations.pdf .S. R. Kheradpisheh and T. Masquelier. Temporal backpropagation for spiking neural networks with one spike perneuron.

International Journal of Neural Systems , 30(06):2050027, 2020. doi: 10.1142/S0129065720500276. URL https://doi.org/10.1142/S0129065720500276 . PMID: 32466691.D. P. Kingma and J. Ba. Adam: A method for stochastic optimization, 2014.L. Kriener. Yin-yang dataset. https://github.com/lkriener/yin_yang_data_set , 2020.Y. LeCun, D. Touresky, G. Hinton, and T. Sejnowski. A theoretical framework for back-propagation. In

Proceedings ofthe 1988 connectionist models summer school , volume 1, pages 21–28, 1988.Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.

Proceedingsof the IEEE , 86(11):2278–2324, 1998.W. Maass. Networks of spiking neurons: The third generation of neural network models.

Neural Networks , 10(9):1659 – 1671, 1997. ISSN 0893-6080. doi: https://doi.org/10.1016/S0893-6080(97)00011-7. URL .P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada, F. Akopyan, B. L. Jackson, N. Imam, C. Guo,Y. Nakamura, et al. A million spiking-neuron integrated circuit with a scalable communication network and interface.

Science , 345(6197):668–673, 2014.S. Moradi, N. Qiao, F. Stefanini, and G. Indiveri. A scalable multicore architecture with heterogeneous memorystructures for dynamic neuromorphic asynchronous processors (dynaps).

IEEE Transactions on Biomedical Circuitsand Systems , 12(1):106–122, 2018.H. Mostafa. Supervised learning based on temporal coding in spiking neural networks.

IEEE Transactions on NeuralNetworks and Learning Systems , page 1–9, 2017. ISSN 2162-2388. doi: 10.1109/tnnls.2017.2726060. URL http://dx.doi.org/10.1109/TNNLS.2017.2726060 .A. Neckar, S. Fok, B. V. Benjamin, T. C. Stewart, N. N. Oza, A. R. Voelker, C. Eliasmith, R. Manohar, and K. Boa-hen. Braindrop: A mixed-signal neuromorphic architecture with a dynamical systems-based programming model.

Proceedings of the IEEE , 107(1):144–164, 2019.E. O. Neftci, H. Mostafa, and F. Zenke. Surrogate gradient learning in spiking neural networks: Bringing the power ofgradient-based optimization to spiking neural networks.

IEEE Signal Processing Magazine , 36(6):51–63, 2019.C. Pehle. Adjoint equations of spiking neural networks. In preparation, 2020.J. Pei, L. Deng, S. Song, M. Zhao, Y. Zhang, S. Wu, G. Wang, Z. Zou, Z. Wu, W. He, F. Chen, N. Deng, S. Wu, Y. Wang,Y. Wu, Z. Yang, C. Ma, G. Li, W. Han, and L. Shi. Towards artiﬁcial general intelligence with hybrid tianjic chiparchitecture.

Nature , 572:106, 08 2019. doi: 10.1038/s41586-019-1424-8.M. Pfeiffer and T. Pfeil. Deep learning with spiking neurons: Opportunities and challenges.

Frontiers in Neuroscience ,12:774, 2018. ISSN 1662-453X. doi: 10.3389/fnins.2018.00774. URL .L. S. Pontryagin.

Mathematical theory of optimal processes . Routledge, 1962.K. Roy, A. Jaiswal, and P. Panda. Towards spike-based machine intelligence with neuromorphic computing.

Nature ,575:607–617, 11 2019. doi: 10.1038/s41586-019-1677-2.E. Rozenvasser. General sensitivity equations of discontinuous systems.

Automatika i telemekhanika , 3:52–56, 1967.E. Rozenwasser and R. Yusupov.

Sensitivity of Automatic Control Systems . Control Series. CRC Press, 2019. ISBN9781420049749. doi: 10.1201/9781420049749. URL https://doi.org/10.1201/9781420049749 .J. Schemmel, L. Kriener, P. Müller, and K. Meier. An accelerated analog neuromorphic hardware system emulatingnmda-and calcium-based non-linear dendrites. In ,pages 2217–2226. IEEE, 2017.R. Serban and A. Recuero. Sensitivity analysis for hybrid systems and systems with memory.

Journal of Computationaland Nonlinear Dynamics , 14(9), Jul 2019. ISSN 1555-1423. doi: 10.1115/1.4044028. URL http://dx.doi.org/10.1115/1.4044028 .S. B. Shrestha and G. Orchard. Slayer: Spike layer error reassignment in time. In

Advances in Neural InformationProcessing Systems , pages 1412–1421, 2018. 13. Tavanaei, M. Ghodrati, S. R. Kheradpisheh, T. Masquelier, and A. Maida. Deep learning in spiking neural networks.

Neural Networks , 111:47 – 63, 2019. ISSN 0893-6080. doi: https://doi.org/10.1016/j.neunet.2018.12.002. URL .T. Wunderlich, A. F. Kungl, E. Müller, A. Hartel, Y. Stradmann, S. A. Aamir, A. Grübl, A. Heimbrecht, K. Schreiber,D. Stöckel, and et al. Demonstrating advantages of neuromorphic computation: A pilot study.

Frontiers inNeuroscience , 13, Mar 2019. ISSN 1662-453X. doi: 10.3389/fnins.2019.00260. URL http://dx.doi.org/10.3389/fnins.2019.00260 .Y. Xu, X. Zeng, L. Han, and J. Yang. A supervised multi-spike learning algorithm based on gradient descent for spikingneural networks.

Neural networks : the ofﬁcial journal of the International Neural Network Society , 43:99—113,July 2013. ISSN 0893-6080. doi: 10.1016/j.neunet.2013.02.003. URL https://doi.org/10.1016/j.neunet.2013.02.003 .W. Yang, D. Yang, and Y. Fan. A proof of a key formula in the error-backpropagation learning algorithm for multiplespiking neural networks. In Z. Zeng, Y. Li, and I. King, editors,

Advances in Neural Networks – ISNN 2014 , pages19–26, Cham, 2014. Springer International Publishing. ISBN 978-3-319-12436-0.F. Zenke and S. Ganguli. Superspike: Supervised learning in multilayer spiking neural networks.

Neural computation ,30(6):1514–1541, 2018.

A Simulation Parameters

Symbol Description Value τ mem Membrane Time Constant

20 ms τ syn Synaptic Time Constant ϑ Threshold Input Size Hidden Size

Output Size t bias Bias Time

20 ms t min Minimum Time

10 ms t max Maximum Time

40 ms

Hidden Weights Mean Hidden Weights Standard Deviation Output Weights Mean . Output Weights Standard Deviation . Minibatch Size

Optimizer Adam β Adam Parameter . β Adam Parameter . (cid:15) Adam Parameter × − η Learning Rate × − Allowed Ratio of Missing Spikes in Hidden Layer . Allowed Ratio of Missing Spikes in Output Layer w bump Weight Bump Value × − α Regularization Factor × − ξ First Time Constant Factor . β Second Time Constant Factor2