[PDF] Unifying Activation- and Timing-based Learning Rules for Spiking Neural Networks

Abstract

For the gradient computation across the time domain in Spiking Neural Networks (SNNs) training, two different approaches have been independently studied. The first is to compute the gradients with respect to the change in spike activation (activation-based methods), and the second is to compute the gradients with respect to the change in spike timing (timing-based methods). In this work, we present a comparative study of the two methods and propose a new supervised learning method that combines them. The proposed method utilizes each individual spike more effectively by shifting spike timings as in the timing-based methods as well as generating and removing spikes as in the activation-based methods. Experimental results showed that the proposed method achieves higher performance in terms of both accuracy and efficiency than the previous approaches.

Full PDF

UUnifying Activation- and Timing-based LearningRules for Spiking Neural Networks

Jinseok Kim Kyungsu Kim Jae-Joon Kim

Department of Creative IT EngineeringPohang University of Science and Technology {jinseok.kim, kyungsu.kim, jaejoon}@postech.ac.kr

Abstract

For the gradient computation across the time domain in Spiking Neural Networks(SNNs) training, two different approaches have been independently studied. Theﬁrst is to compute the gradients with respect to the change in spike activation(activation-based methods), and the second is to compute the gradients with respectto the change in spike timing (timing-based methods). In this work, we presenta comparative study of the two methods and propose a new supervised learningmethod that combines them. The proposed method utilizes each individual spikemore effectively by shifting spike timings as in the timing-based methods as wells asgenerating and removing spikes as in the activation-based methods. Experimentalresults showed that the proposed method achieves higher performance in terms ofboth accuracy and efﬁciency than the previous approaches.

Spiking neural networks (SNNs) have been studied not only for their biological plausibility but alsofor computational efﬁciency that stems from information processing with binary spikes [1]. One ofthe unique characteristics of SNNs is that the states of the neurons at different time steps are closelyrelated to each other. This may resemble the temporal dependency in recurrent neural networks(RNNs), but in SNNs direct inﬂuences between neurons are only through the binary spikes. Since thetrue derivative of the binary activation function, or thresholding function, is zero almost everywhere,SNNs have an additional challenge in precise gradient computation unless the binary activationfunction is replaced by an alternative as in [2].Due to the difﬁculty of training SNNs, in some recent studies, parameters trained in non-spiking NNswere employed in SNNs. However, this approach is only feasible by using the similarity betweenrate-coded SNNs and non-spiking NNs [3, 4] or by abandoning several features of spiking neuronsto maximize the similarity between SNNs and non-spiking NNs [5–7]. The unique characteristicsof SNNs that enable efﬁcient information processing can only be utilized with dedicated learningmethods for SNNs. In this context, several studies have reported promising results with the gradient-based supervised learning methods that takes account of those characteristics [8–12].Previous works on gradient-based supervised learning for SNNs can be classiﬁed into two categories.The methods in the ﬁrst category work around the non-differentiability of the spiking function withthe surrogate derivative [13] and compute the gradients with respect to the spike activation [10–12].The methods in the second category focus on the timings of existing spikes and computes the gradientswith respect to the spike timing [8, 9, 14]. Let us call those methods as the activation-based methodsand the timing-based methods, respectively. Until now, the two approaches have been thoughtirrelevant to each other and studied independently.The problem with previous works is that both approaches have limitations in computing accurategradients, which become more problematic when the spike density is low. The computational cost ofthe SNN is known to be proportional to the number of spikes, or the ﬁring rates [6, 15, 16]. To make

Preprint. Under review. a r X i v : . [ c s . N E ] J un - t t+ Time L a y e r l L a y e r l + S j S i V j V j I j S i S j S j S i V j I j I j (a) S i S i t+ L a y e r l t- t L a y e r l + S i V j V j V j S j S j S j (b)Figure 1: Computational graphs representing (a) the RNN-like description and (b) the SRM-based descriptionof our SNN model. Black solid arrows represent accumulation and decaying. Black dashed arrows representsynaptic integration, red solid arrows represent the spiking function, and red dashed arrows represent reset paths. the best use of the computational power of SNNs and use them more efﬁciently than non-spikingcounterparts, it is important to reduce the required number of spikes for inference. If there are onlya few spikes in the network, the network becomes more sensitive to the change in the state of eachindividual spike such as the generation of a new spike, the removal of an existing spike, or the shiftof an existing spike. Training SNNs with fewer spikes requires the learning method to be aware ofthose changes through gradient computation.In this work, we investigated the relationship between the activation-based methods and the timing-based methods for supervised learning in SNNs. We observed that the two approaches are complemen-tary when considering the change in the state of individual spikes. Then we devised a new learningmethod called activation- and timing-based learning rule (ANTLR) that enables more precise gradientcomputation by combining the two methods. In experiments with random spike-train matching taskand widely used benchmarks (MNIST and N-MNIST), our method achieved the higher accuracy thanthat of previous works when the networks are forced to use fewer spikes in training. We used a discrete-time version of a leaky integrate-and-ﬁre (LIF) neuron with the current-basedsynapse model. The neuronal states of postsynaptic neuron j are formulated as V j [ t ] = α V (1 − S j [ t − V j [ t −

1] + β V I j [ t ] + β bias V bias ,j (1) I j [ t ] = α I (1 − S j [ t − I j [ t −

1] + β I (cid:88) i w i,j S i [ t ] (2) S j [ t ] = Θ( V j [ t ]) = (cid:26) , if V j [ t ] ≥ θ , otherwise (3)where V j [ t ] is a membrane potential, I j [ t ] is a synaptic current, S j [ t ] is a binary spike activation. w i,j is a synaptic weight from presynaptic neuron i . V bias ,j is a trainable bias parameter. Θ and θ arethe spiking function and the threshold, respectively. α V and α I are the decay coefﬁcients for thepotential and the current. β V , β I , and β bias are the scale coefﬁcients. We call this type of descriptionas the RNN-like description since the temporal dependency between variables resembles that inRNNs [13] (Figure 1a). The term (1 − S j [ t − was introduced in V j [ t ] and I j [ t ] to reset both thepotential and the synaptic current. Note that this model can express various types of commonly usedneuron models by changing the decay coefﬁcients (Figure A1 in Appendix A).The same neuron model can also be formulated using the spike response kernel ε [ τ ] = β I β V (cid:80) τk =0 α kI α τ − kV as V j [ t ] = (cid:88) i (cid:88) ˆ t i ∈T i,j,t w i,j ε [ t − ˆ t i ] = t (cid:88) τ =ˆ t last j [ t ]+1 (cid:88) i w i,j ε [ t − τ ] S i [ τ ] (4) S j [ t ] = Θ( V j [ t ]) (5)2 - t t+ Time L a y e r l L a y e r l +  L  S j  L  S j  L  S j  L  S i  L  I j  L  V j  L  S i  L  S i  L  I j  L  V j  L  V j  L  I j (a) Activation-based,RNN-like  L  S i  L  S i t+ L a y e r l t- t L a y e r l +  L  S i  L  V j  L  V j  L  V j  L  S j  L  S j  L  S j (b) Activation-based,SRM-based t- t t+ Time L a y e r l L a y e r l +  L  S j  L  S j  L  S j  L  S i  L  I j  L  V j  L  S i  L  S i  L  I j  L  V j  L  V j  L  I j (c) Activation-based,RNN-like w/o reset paths t+ t- t  L  V j  L  V j ˆ j Lt  ˆ j Lt  ˆ i Lt  ˆ i Lt  ˆ j Lt   L  V j ˆ i Lt  L a y e r l L a y e r l + (d) Timing-based,SRM-basedFigure 2: Various types of back-propagation derived from different descriptions where ˆ t i is a spike timing of neuron i , T i,j,t = { τ | ˆ t last j [ t ] < τ ≤ t, S i [ τ ] = 1 } , and ˆ t last j [ t ] is the last spiketiming of neuron j before t . We call this type of description as the SRM-based description as itis in the form of the Spike Response Model (SRM) [17] (Figure 1b). Detailed explanations on theequivalence of the two descriptions are given in Appendix B.

To back-propagate the gradients to the lower layers, the activation-based methods [2, 10–12] approxi-mate the derivative of the spiking function which is zero almost everywhere. It is similar to whatnon-spiking NNs do to the quantized activation functions such as the thresholding function for BinaryNeural Networks [18]. The approximated derivative is called the surrogate derivative [13], and wewill denote this as σ ( V [ t ]) ≈ ∂S [ t ] ∂V [ t ] . RNN-like method

Since the forward pass of the RNN-like description of the neuron model re-sembles that of non-spiking RNNs (Figure 1a), back-propagation can also be treated like the Back-Propagation-Through-Time (BPTT) [19] (Figure 2a, the equations are in Appendix C) [2, 11].

SRM-based method

However, from the SRM-based description of the same model (Figure 1b),back-propagation is derived in a slightly different way using the kernel function ε between each layer(Figure 2b) [10]. From Equation 4, we can obtain the gradient of the membrane potential of thepostsynaptic neuron j at arbitrary time step t a with respect to the spike activation of the presynapticneuron i at time step t as ∂V j [ t a ] ∂S i [ t ] = (cid:26) w i,j ε [ t a − t ] if t > ˆ t last j [ t a ] and t a ≥ t else (6)Interestingly, we found that the SRM-based method (Figure 2b) is functionally equivalent to theRNN-like method except that the diagonal reset paths are removed (Figure 2c, See Appendix D fordetailed explanation). In fact, neglecting the reset paths in back-propagation can improve the learningresult as it can avoid the accumulation of the approximation errors. Via the reset paths (red dashedarrows in Figure 2a), the same gradient value recursively passes through the surrogate derivative (redsolid arrows in Figure 2a), as many times as the number of time steps. Even though the amount ofthe approximation error from a single surrogate derivative is tolerable, the accumulated error can beorders of magnitude larger because the number of time steps is usually larger than hundreds. Weexperimentally observed that propagating gradients via the reset paths signiﬁcantly degrades trainingresults regardless of the task and network settings. In this regard, we used the SRM-based methodinstead of the RNN-like method to represent the activation methods throughout this paper. The timing-based methods [8, 9, 14] exploit the differentiable relationship between the spike timing ˆ t and the membrane potential at the spike timing V (ˆ t ) . The local linearity assumption of the membrane3otential around ˆ t leads to ∂ ˆ t i ∂V i (ˆ t i ) = − V (cid:48) i (ˆ t i ) where V (cid:48) ( t ) is the time derivative of the membranepotential at time t . In this work, we used approximated time derivative V ∗ [ t ] for discrete time domainas ∂ ˆ t i ∂V i [ˆ t i ] ≈ − V ∗ i [ˆ t i ] . Note that computing the gradient of a spike timing does not require the derivativeof the spiking function Θ .From Equation 4 of the SRM-based description, we can obtain the gradient of the membrane potentialof the postsynaptic neuron j at arbitrary time step t a with respect to the spike timing ˆ t i of thepresynaptic neuron i as ∂V j [ t a ] ∂ ˆ t i = (cid:40) w i,j ∂ε [ t a − ˆ t i ] ∂ ˆ t i = w i,j ε ∗ [ t a − ˆ t i ] if ˆ t i > ˆ t last j [ t a ] and t a ≥ ˆ t i else (7)where ε ∗ [ t ] is the approximated time derivative of SRM kernel ε in discrete time domain. Figure 2ddepicts how the timing-based method propagates the gradients. Only in the time steps with spikes, ∂L∂ ˆ t is propagated to ∂L∂V and then is propagated to the lower layer with Equation 7. Calculating the gradients is to estimate how much the network output varies when the parametersor the variables are changed. One of the main ﬁndings in our study is that the activation-based andtiming-based methods are complementary in the way they consider the change in the network.The change in SNNs can be represented by the generation, the removal, and the shift of spikes. Thegeneration or the removal of a spike is expressed as the change of the spike activation S [ t ] (0 → → ∂L∂S [ t ] , then naturally can consider the generations and the removals. On the other hand, the shift of aspike is expressed as the change of the spike timing ˆ t . The timing-based methods, which calculatethe gradient with respect to the spike timings ∂L∂ ˆ t , easily take account of the spike shifts.The problem in the activation-based methods is that they cannot deal with the spike shifts accurately.In terms of the spike activations, the spike shift is interpreted as a pair of opposite spike activationchanges with causal relationship through the reset path (Figure 3). Because of the major role ofthe reset path in the spike shift, gradient computation methods with the spike activations cannotconsider the shift without precisely computing the gradients related to the reset paths. Unfortunately,as explained in Section 2.2.1, the SRM-based activation-based method does not have a reset path sothat it is not possible to consider the spike shift at all. The RNN-like activation-based method has thereset paths, but it suffers from accuracy loss due to the accumulated errors in the reset path. Althoughthe shift of an individual spike does not make a huge difference to the whole network in the situationwhere many spikes are generated and removed, it becomes important when there are not many spikesin the network.The problem in the timing-based methods is that the generation and the removal of spikes cannot bedescribed with the spike timings. The timing-based methods also cannot anticipate the spike numberchange in the network, which happens by the generation or the removal of spikes. Even though thegeneration and the removal happen less often compared to the spike shift when the parameters areupdated by small amounts, their inﬂuences to the network are usually more signiﬁcant. Spike Timing

Time P o t e n ti a l (a) Time P o t e n ti a l Reset + - Spike Activation (b)Figure 3: The spike timing shift ( 1 (cid:13)→ (cid:13) ) can be described using the change in (a) the spike timing or (b) thespike activation. The spike activation change in the earlier time step causes the activation change in the latertime step via the reset path (red arrow). L  V i  L  V i Time L a y e r l  L  V i  L  V j  L  V k  L  V k L a y e r l + L a y e r l +  L  V j  L  V j  L  V k t+ t- t  L  V k  L  V j ˆ j Lt  Tim.  L  S j  L  V k  L  V j Act.  L  V k  L  V j ANTLR  L  S j  L  V j  L  V j ˆ j Lt  Figure 4: Back-propagation in both the activation-based method and the timing-based method can be describedusing ∂L∂V of neurons at different time steps and the way they are propagated (black arrows). ANTLR combinesthe two methods (red arrows and blue arrows) by weighted summation at each stage.

To overcome the limitations in previous works, we propose a new method of back-propagation forSNNs, called an activation- and timing-based learning rule (ANTLR), that combines the activation-based gradients and the timing-based gradients together. The activation-based methods and thetiming-based methods back-propagate the gradient through different intermediate gradients, which are ∂L∂S and ∂L∂ ˆ t , respectively. For this reason, the two approaches have been treated as completely differentapproaches. However, there is another intermediate gradient ∂L∂V calculated in both approaches. ∂L∂V inthe activation-based methods is propagated from ∂L∂S and carries information about the generation andthe removal of the spikes whereas ∂L∂V in the timing-based methods is propagated from ∂L∂ ˆ t and carriesinformation about the spike shift.The main idea of ANTLR is to (1) combine the activation-based gradients ∂L∂V | act and the timing-basedgradients ∂L∂V | tim by taking weighted sum and (2) propagate the combined gradients ∂L∂V | ant (Figure 4).In ANTLR, the gradients are back-propagated to the lower layers as ∂L∂V j [ t ] (cid:12)(cid:12)(cid:12)(cid:12) ant = λ act ∂L∂V j [ t ] (cid:12)(cid:12)(cid:12)(cid:12) act + λ tim ∂L∂V j [ t ] (cid:12)(cid:12)(cid:12)(cid:12) tim (8) ∂L∂V i [ t ] (cid:12)(cid:12)(cid:12)(cid:12) act = (cid:88) j (cid:88) t a ∂L∂V j [ t a ] (cid:12)(cid:12)(cid:12)(cid:12) ant ∂V j [ t a ] ∂S i [ t ] ∂S i [ t ] ∂V i [ t ] (9) ∂L∂V i [ˆ t i ] (cid:12)(cid:12)(cid:12)(cid:12) tim = (cid:88) j (cid:88) t a ∂L∂V j [ t a ] (cid:12)(cid:12)(cid:12)(cid:12) ant ∂V j [ t a ] ∂ ˆ t i ∂ ˆ t i ∂V i [ˆ t i ] (10)where last two terms in Equation 9 are calculated using the activation-based method as in Section 2.2.1and last two terms in Equation 10 are calculated using the timing-based method as in Section 2.2.2.To train SNNs using ANTLR and other methods, we implemented CUDA-compatible gradientcomputation functions in PyTorch [20] (implementation details are described in Appendix E).Note that ANTLR with the setting λ act = 1 , λ tim = 0 is equivalent to the activation-based methodwhereas ANTLR with λ act = 0 , λ tim = 1 is equivalent to the timing-based method. Therefore, ANTLRcan also be regarded as a uniﬁed framework that covers the two distinct approaches. In this work,we focused on showing the fundamental beneﬁts of combining them and used the simplest setting λ act = 1 , λ tim = 1 . Proper values of λ act , λ tim may depend on the situations, but further studies areneeded to precisely understand their inﬂuences. We used three types of widely used loss functions which are count loss, spike-train loss, and latency loss (Table 1). Count loss is deﬁned as a sum of squared error between the output and target numberof spikes of each output neuron. Spike-train loss is a sum of squared error between the ﬁltered outputspike-train and the ﬁltered target spike-train. Latency loss is deﬁned as the cross-entropy of thesoftmax of negatively weighted ﬁrst spike timings of output neurons. Note that the count loss cannotprovide the gradient with respect to the spike timing whereas the latency loss cannot provide thegradient with respect to the spike activation. It makes those loss types inapplicable to certain types oflearning methods. We want to emphasize that ANTLR can use all the loss types. The source code will be released later. ype Count Spike-train LatencyLoss ( L ) (cid:80) o { ( (cid:80) τ S o [ τ ]) − n o } /T (cid:80) o (cid:80) τ d o [ τ ] − (cid:80) o y o log p o∂L∂S o [ t ] { ( (cid:80) τ S o [ τ ]) − n o } /T (cid:80) τ κ [ τ − t ] d o [ τ ] ∂L∂ ˆ t o − (cid:80) τ κ ∗ [ τ − ˆ t o ] d o [ τ ] − β ( p o − y o ) Compatible with Activation, ANTLR Activation, Timing, ANTLR Timing, ANTLR o represents an index of the output neurons, d o [ τ ] = ( κ ∗ S o )[ τ ] − ( κ ∗ S tar o )[ τ ] , p o = e − β ˆ t ﬁrst o / (cid:80) x e − β ˆ t ﬁrst x , κ representsan exponential kernel, β is a scaling factor, n o represents a target spike number, and y o represents a target probability Table 1: Three different types of loss functions and corresponding activation-based gradient ∂L∂S o [ t ] andtiming-based gradient ∂L∂ ˆ t o L o ss D i m D i m (a) L o ss D i m D i m (b) L o ss D i m D i m (c) L o ss D i m D i m (d) MinMax (e)Figure 5: (a) True loss landscape, estimated loss landscapes using (b) the activation-based method, (c) thetiming-based method and (d) ANTLR with λ act , λ tim = 1 , , and (e) the color scheme used for highlighting.Dim and Dim represent two dimensions along which we perturbed the network parameters. We conducted a simple experiment to visualize the gradients computed by each method. A fully-connected network with two hidden layers of 10-50-50-1 neurons was trained to minimize thespike-train loss with three random input spikes for each input neuron and a single target spike forthe target neuron. After reaching to the global optimum of zero loss, we perturbed all trainableparameters (weights and biases) along ﬁrst two principal components of the gradient vectors usedin training and measured the true loss (Figure 5a). The lowest point at the center (dark blue region)represents the global minimum, and subtle loss increase around the center shows the effect of thespike timing shift. Dramatic increase of the loss depicted in the right corner shows the loss increasefrom the spike number change. To emphasize the subtle height difference due to the spike timingshift, we highlighted the area adjacent to the global optimum where the number of spikes does notchange using the color scheme in Figure 5e.Different learning methods provide different gradient values based on their distinct approaches. Usingeach method’s gradient vector at each parameter point, we visualized the estimated loss landscapeusing the surface reconstruction method [21, 22] (Figure 5b to 5d). The results of the activation-basedmethod (Figure 5b) well demonstrated the steep loss change due to the spike number change, whereasthe timing-based method (Figure 5c) could not take account of it. On the other hand, the timing-basedmethod captured the subtle loss change due to the spike timing shift while the activation-based methodshowed almost ﬂat loss landscape in the region without the spike number change. By combining bothmethods, ANTLR was able to capture those features at the same time (Figure 5d).

We evaluated practical advantages of ANTLR compared to other methods using 3 different tasks: (1)random spike-train matching, (2) latency-coded MNIST, and (3) N-MNIST. Hyper-parameters fortraining were grid-searched for each task (detailed experimental settings are in Appendix F). For thetiming-based method, we added a no-spike penalty that increases the incoming synaptic weights ofthe neurons without any spike as in [8].

Using the same experiment setup as in Section 3.4 except the varying number of the target spikesand the different network size of 10-50-50-5, we measured the training loss of the networks trainedby different learning methods (Figure 6). This task was used to see the basic performance of the6 L o ss ActivationTimingANTLR (a) L o ss ActivationTimingANTLR (b)Figure 6: Averaged training loss over 100 trials of random spike-train matching task with three input spikes and(a) a single target spike and (b) three target spikes. Note that the y axis is in logarithmic scale.

100 200 300 400 T e s t acc . [ % ] Activation: (52 ±

4, 95.70 ± ±

22, 95.96 ± ±

2, 97.66 ± (a)

100 200 300 400 500 T e s t acc . [ % ] Activation: (51 ±

2, 95.95 ± ±

45, 95.17 ± ±

2, 97.63 ± (b)Figure 7: Test accuracy and the required number of hidden and output spikes to classify a single sample on (a)latency-coded MNIST task and (b) latency-coded MNIST task with the single-spike restriction. The values inthe legend represent the mean and standard deviation of 16 trials. learning methods in a situation where each spike signiﬁcantly affects the training results. During50000 training iterations, both the activation-based method and ANTLR showed noticeable decreasein loss whereas the timing-based method failed to train the network as it cannot handle the spikenumber change. ANTLR outperformed other methods with much faster convergence and lower loss. In this experiment, we applied the latency coding to the input data of MNIST dataset [23] as in [8, 9].The larger intensity value of each pixel was represented by the earlier spike timing of correspondinginput neuron. We used this conversion to reduce the total number of spikes and make the situationwhere each learning method should take account of the precise spike timing for a better result.The timing-based method and ANTLR used the latency loss, and the activation-based method used thecount loss with the target spike number of 1/0 for correct/wrong labels. We also added a variant of thecount loss to the total loss of ANTLR to prevent the target output neuron from being silent. Note thatthe target spike number for the activation-based method is much smaller than that from previous workssince we applied the latency coding to the input to reduce the number of input spikes. The outputclass can either be determined using the output neuron emitting the most spikes (most-spike decisionscheme) or the neuron emitting the earliest spike (earliest-spike decision scheme). The timing-basedmethod and ANTLR used the earliest-spike decision scheme whereas the activation-based methodused the most-spike decision scheme considering the loss types they used.We trained the network with a size of 784-800-10 and 100 time steps using a mini-batch size of16 and the split of 50000/10000 images for training/validation dataset. The results of test accuracyand the number of spikes used for each sample are shown in Figure 7a. The number of spikes usedto ﬁnish a task was usually not presented in previous works, but we included it to demonstrate theefﬁciency of the networks trained by different methods. The results show that ANTLR achieved thehighest accuracy compared to other methods. The number of spikes for the timing-based methodwas exceptionally higher than the others, because of the no-spike penalty and its inability to removeexisting spikes during training. Figure 7b shows a different scenario we tested, where each neuronis restricted to emit at most one spike as in [8, 9, 14]. We tested this situation to further reduce thenumber of spikes. However, this modiﬁcation did not change the trend of the results as the number ofspikes was already small in the ﬁrst place.Note that previous works reported higher accuracy results, but the results were achieved with largenumber of spikes. In this study, we focus on the cases in which the networks are forced to use fewer7

500 1000 1500 2000 2500 T e s t acc . [ % ] Activation: (931 ±

77, 97.32 ± ± ± ±

8, 96.58 ± (a)

100 200 300 400 500 600 700 T e s t acc . [ % ] Activation: (102 ±

20, 92.30 ± ±

15, 95.79 ± ±

17, 96.02 ± (b)Figure 8: Test accuracy and the required number of hidden and output spikes to classify a single sample on (a)N-MNIST task and (b) N-MNIST task with the single-spike restriction. The values in the legend represent themean and standard deviation of 16 trials. spikes for high energy efﬁciency. We believe that such cases represent more desirable environmentsfor application of SNNs. In contrast to the MNIST dataset which is static, the spiking version of MNIST, called N-MNIST isa dynamic dataset that contains the samples of the input spikes in 34x34 spatial domain with twochannels along 300 time steps [24]. The same loss and the classiﬁcation settings as in Section 4.2were used here except the target spike number for the activation-based method, which is increasedto 10/0 considering the increased number of input spikes in the N-MNIST dataset. Note that thelatency loss and the earliest-spike decision scheme have never been used for the N-MNIST dataset,but we intentionally used them to reduce the number of spikes. We trained the network with a size of2x34x34-800-10 using a mini-batch size of 16 and the results are shown in Figure 8a.Due to the large target spike number, the activation-based method required much more spikes thanANTLR. The timing-based method again used large number of spikes because of its limitation inremoving spikes. We also tested the scenario where the single-spike restriction is applied (Figure 8b).Since the activation-based method had to use the target spike number of 1/0 due to the restriction,its accuracy result was degraded whereas the timing-based method showed improvement in bothaccuracy and efﬁciency. This supports the fact that the activation-based method favors the multi-spikesituation and the timing-based method favors the single-spike situation.

In this work, we presented and compared the characteristics of two existing approaches of gradient-based supervised learning methods for SNN and proposed a new learning method called ANTLR thatcombines them. The experimental results using various tasks showed that the proposed method canimprove the accuracy of the network in the situations where the number of spikes are constrained, byprecisely considering the inﬂuence of individual spikes.It is known that both the temporal coding and the rate coding play important roles for informationprocessing in biological neurons [25]. Interestingly, the timing-based methods are closely related tothe temporal coding since they explicitly consider the spike timings in gradient computation. On theother hand, the activation-based methods are more favorable to the rate coding in which the spiketiming change does not contain information. Even though we did not explicitly address the conceptof the temporal coding and the rate coding in this work, to the best of our knowledge, this work is theﬁrst work that tries to unify the different learning methods suitable for different coding schemes.Some other works that were not mentioned in this paper also have shown notable results as supervisedlearning methods for SNNs [26–28], but these methods are not classiﬁed as neither activation-basedor timing-based. In these methods, a scalar variable mediates the back-propagation from the wholespike-train of a postsynaptic neuron to the whole spike-train of a presynaptic neuron. This variablemay be able to capture the current state of the spike-train and its inﬂuence to another neuron, but itcannot cope with the change in the spike-train such as the generation, the removal, or the timing shiftduring training. This limitation may not be problematic with the rate coding in which the changein the state of individual spikes does not make a huge difference, but it is a critical problem whentraining SNNs with fewer spikes for higher efﬁciency.8 roader Impact

We believe that broader impact discussion is not applicable to our work because our work is toimprove the general supervised learning performance of spiking neural networks and is not related toa speciﬁc application.

Acknowledgments and Disclosure of Funding

This research was supported by Samsung Research Funding Center of Samsung Electronics underProject Number SRFC-TC1603-51, the MSIT (Ministry of Science and ICT), Korea, under theICT Consilience Creative program (IITP-2019-2011-1-00783) supervised by the IITP (Institute forInformation & communications Technology Promotion), and NRF (National Research Foundation ofKorea) Grant funded by the Korean Government (NRF-2016-Global Ph.D. Fellowship Program).

References [1] W. Maass, “Networks of spiking neurons: the third generation of neural network models,”

Neural networks , vol. 10, no. 9, pp. 1659–1671, 1997.[2] D. Huh and T. J. Sejnowski, “Gradient descent for spiking neural networks,” in

Advances inNeural Information Processing Systems , pp. 1433–1443, 2018.[3] P. U. Diehl, D. Neil, J. Binas, M. Cook, S.-C. Liu, and M. Pfeiffer, “Fast-classifying, high-accuracy spiking deep networks through weight and threshold balancing,” in , pp. 1–8, ieee, 2015.[4] E. Hunsberger and C. Eliasmith, “Spiking deep networks with lif neurons,” arXiv preprintarXiv:1510.08829 , 2015.[5] S. Park, S. Kim, B. Na, and S. Yoon, “T2fsnn: Deep spiking neural networks with time-to-ﬁrst-spike coding,” arXiv preprint arXiv:2003.11741 , 2020.[6] B. Rueckauer and S.-C. Liu, “Conversion of analog to spiking neural networks using sparsetemporal coding,” in ,pp. 1–5, IEEE, 2018.[7] L. Zhang, S. Zhou, T. Zhi, Z. Du, and Y. Chen, “Tdsnn: From deep neural networks to deep spikeneural networks with temporal-coding,” in

Proceedings of the AAAI Conference on ArtiﬁcialIntelligence , vol. 33, pp. 1319–1326, 2019.[8] I. M. Comsa, K. Potempa, L. Versari, T. Fischbacher, A. Gesmundo, and J. Alakuijala,“Temporal coding in spiking neural networks with alpha synaptic function,” arXiv preprintarXiv:1907.13223 , 2019.[9] H. Mostafa, “Supervised learning based on temporal coding in spiking neural networks,”

IEEEtransactions on neural networks and learning systems , vol. 29, no. 7, pp. 3227–3235, 2017.[10] S. B. Shrestha and G. Orchard, “Slayer: Spike layer error reassignment in time,” in

Advances inNeural Information Processing Systems , pp. 1412–1421, 2018.[11] Y. Wu, L. Deng, G. Li, J. Zhu, and L. Shi, “Spatio-temporal backpropagation for traininghigh-performance spiking neural networks,”

Frontiers in neuroscience , vol. 12, 2018.[12] F. Zenke and S. Ganguli, “Superspike: Supervised learning in multilayer spiking neural net-works,”

Neural computation , vol. 30, no. 6, pp. 1514–1541, 2018.[13] E. O. Neftci, H. Mostafa, and F. Zenke, “Surrogate gradient learning in spiking neural networks:Bringing the power of gradient-based optimization to spiking neural networks,”

IEEE SignalProcessing Magazine , vol. 36, no. 6, pp. 51–63, 2019.[14] S. M. Bohte, J. N. Kok, and H. La Poutre, “Error-backpropagation in temporally encodednetworks of spiking neurons,”

Neurocomputing , vol. 48, no. 1-4, pp. 17–37, 2002.[15] F. Akopyan, J. Sawada, A. Cassidy, R. Alvarez-Icaza, J. Arthur, P. Merolla, N. Imam, Y. Naka-mura, P. Datta, G.-J. Nam, et al. , “Truenorth: Design and tool ﬂow of a 65 mw 1 million neuronprogrammable neurosynaptic chip,”

IEEE transactions on computer-aided design of integratedcircuits and systems , vol. 34, no. 10, pp. 1537–1557, 2015.916] M. Davies, N. Srinivasa, T.-H. Lin, G. Chinya, Y. Cao, S. H. Choday, G. Dimou, P. Joshi,N. Imam, S. Jain, et al. , “Loihi: A neuromorphic manycore processor with on-chip learning,”

IEEE Micro , vol. 38, no. 1, pp. 82–99, 2018.[17] W. Gerstner, “Time structure of the activity in neural network models,”

Physical review E ,vol. 51, no. 1, p. 738, 1995.[18] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks,”in

Advances in neural information processing systems , pp. 4107–4115, 2016.[19] P. J. Werbos, “Backpropagation through time: what it does and how to do it,”

Proceedings ofthe IEEE , vol. 78, no. 10, pp. 1550–1560, 1990.[20] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin,N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani,S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style,high-performance deep learning library,” in

Advances in Neural Information Processing Systems32 , pp. 8024–8035, Curran Associates, Inc., 2019.[21] M. Harker and P. O’Leary, “Least squares surface reconstruction from measured gradient ﬁelds,”in , pp. 1–7, IEEE, 2008.[22] C. H. Jordan, “pygrad2surf.” https://gitlab.com/chjordan/pyGrad2Surf/ , 2017.[23] Y. LeCun, C. Cortes, and C. J. Burges, “The mnist database of handwritten digits, 1998,”

URLhttp://yann. lecun. com/exdb/mnist , vol. 10, p. 34, 1998.[24] G. Orchard, A. Jayawant, G. K. Cohen, and N. Thakor, “Converting static image datasets tospiking neuromorphic datasets using saccades,”

Frontiers in neuroscience , vol. 9, p. 437, 2015.[25] W. Gerstner, W. M. Kistler, R. Naud, and L. Paninski,

Neuronal dynamics: From single neuronsto networks and models of cognition . Cambridge University Press, 2014.[26] Y. Jin, W. Zhang, and P. Li, “Hybrid macro/micro level backpropagation for training deep spikingneural networks,” in

Advances in Neural Information Processing Systems , pp. 7005–7015, 2018.[27] J. H. Lee, T. Delbruck, and M. Pfeiffer, “Training deep spiking neural networks using backprop-agation,”

Frontiers in neuroscience , vol. 10, p. 508, 2016.[28] W. Zhang and P. Li, “Spike-train level backpropagation for training deep recurrent spikingneural networks,” in

Advances in Neural Information Processing Systems , pp. 7800–7811, 2019.10 ppendixA Versatility of the neuron model

In our neuron model, depending on the decay coefﬁcients α V , α I , the shape of the post-synapticpotential induced by a single spike can be varied. Figure A1 shows some examples cases of commonlyused neuron models that can be implemented using our neuron model. t ε ( t ) (a) α V , α I = 1 , t ε ( t ) (b) α V , α I = 0 . , t ε ( t ) (c) α V = α I = 0 . t ε ( t ) (d) α V , α I = 1 , . t ε ( t ) (e) α V = α I = 1 Figure A1: Various types of neuron models can be expressed by the neuron model we used, including (a) simpleIF neuron, (b) LIF neuron without decaying synaptic current, (c) biologically-plausible alpha synaptic function[8, 10], (d) non-leaky neuron with exponential PSP [9], and (e) non-leaky neuron with linear PSP [6].

B Functional equivalence of the RNN-like description and the SRM-baseddescription of the model

From the RNN-like description of the model (Equation 1 to 3), we can infer that the post-synapticpotential induced by S i [ t ] , the spike activation of presynaptic neuron i at time step t , to V j [ t a ] , thepotential of a postsynaptic neuron j at later time step t a > t , can be transmitted only via I j [ t ] . Then I j [ t ] forwards the inﬂuence to I j [ t + 1] and V j [ t ] , and it continues with I j s and V j s along the way.If there is no spike activation S j [ x ] = 1 between t and t a ( t < x < t a ), this inﬂuence can reach to V j [ t a ] ,and by the time it reaches, the amount of the inﬂuence from S i [ t ] becomes w i,j β I β V (cid:80) t a − tk =0 α kI α t a − t − kV .If there is the spike activation S j [ x ] = 1 between t and t a ( t < x < t a ), this inﬂuence cannot betransmitted to V j [ t a ] since S j [ x ] cuts off the signals that I j [ x + 1] and V j [ x + 1] receive.If we express this relationship between S i [ t ] and V j [ t a ] with a single kernel function ε [ τ ] = β I β V (cid:80) τk =0 α kI α τ − kV and the causal set T i,j,t = { τ | ˆ t last j [ t ] < τ ≤ t, S i [ τ ] = 1 } , it becomes the SRM-based description (Equation 4 and 5). C RNN-like activation-based method

From the RNN-like description of the model (Equation 1 to 3), following BPTT-like back-propagationcan be derived ∂L∂V j [ t ] = ∂L∂S j [ t ] ∂S j [ t ] ∂V j [ t ] + ∂L∂V j [ t + 1] ∂V j [ t + 1] ∂V j [ t ] (11) ∂L∂I j [ t ] = ∂L∂V j [ t ] ∂V j [ t ] ∂I j [ t ] + ∂L∂I j [ t + 1] ∂I j [ t + 1] ∂I j [ t ] (12) ∂L∂S j [ t ] = ∂L∂I k [ t ] ∂I k [ t ] ∂S j [ t ] + ∂L∂I j [ t + 1] ∂I j [ t + 1] ∂S j [ t ] + ∂L∂V j [ t + 1] ∂V j [ t + 1] ∂S j [ t ] (13) ∂S j [ t ] ∂V j [ t ] = σ ( V j [ t ]) , ∂V j [ t + 1] ∂V j [ t ] = α V (1 − S j [ t ]) (14) ∂V j [ t ] ∂I j [ t ] = β V , ∂I j [ t + 1] ∂I j [ t ] = α I (1 − S j [ t ]) (15) ∂I k [ t ] ∂S j [ t ] = β I w k,j , ∂I j [ t + 1] ∂S j [ t ] = − α I I j [ t ] , ∂V j [ t + 1] ∂S j [ t ] = − α V V j [ t ] (16)11hat results in the gradients for the parameter update as ∂L∂w i,j = (cid:88) t (cid:26) ∂L∂I j [ t ] β I S i [ t ] (cid:27) , ∂L∂V bias ,j = (cid:88) t (cid:26) ∂L∂V j [ t ] β bias (cid:27) (17) D Interpreting SRM-based activation-based back-propagation withRNN-like description

The forward passes of the RNN-like description and the SRM-based description are functionallyequivalent, but corresponding back-propagation methods derived from them are slightly different.The SRM-based back-propagation can be summarized using the relationship between the potentialsas follows. ∂V j [ t a ] ∂V i [ t ] = (cid:26) w i,j σ ( V i ( t ) ε [ t a − t ] if t > t last j [ t a ]0 else (18)where the kernel function is given as ε [ τ ] = β I β V (cid:80) τk =0 α kI α τ − kV Similar to the derivation in Appendix B, following back-propagation formula can provide the samefunctionality as the SRM-based back-propagation. ∂L∂V j [ t ] = ∂L∂S j [ t ] ∂S j [ t ] ∂V j [ t ] (19) ∂L∂V dep j [ t ] = ∂L∂V j [ t ] + ∂L∂V dep j [ t + 1] ∂V dep j [ t + 1] ∂V dep j [ t ] (20) ∂L∂I j [ t ] = ∂L∂V dep j [ t ] ∂V dep j [ t ] ∂I j [ t ] + ∂L∂I j [ t + 1] ∂I j [ t + 1] ∂I j [ t ] (21) ∂L∂S j [ t ] = ∂L∂I k [ t ] ∂I k [ t ] ∂S j [ t ] (22) ∂S j [ t ] ∂V j [ t ] = σ ( V j [ t ]) , ∂V dep j [ t + 1] ∂V dep j [ t ] = α V (1 − S j [ t ]) , (23) ∂V dep j [ t ] ∂I j [ t ] = β V , ∂I j [ t + 1] ∂I j [ t ] = α I (1 − S j [ t ]) , (24) ∂I k [ t ] ∂S j [ t ] = β I w k,j (25)where V dep is introduced to consider temporal dependency between V [ t ] s of the same neuron atdifferent time steps.Those formula are almost identical to the RNN-like back-propagation (Equation 11 to 16) except how ∂L∂S is propagated (Equation 13 and 22). The only difference is whether the reset paths (red dashedarrows in Figure 2a, represented as ∂I j [ t +1] ∂S j [ t ] and ∂V j [ t +1] ∂S j [ t ] ) are considered in back-propagation or not. E Implementation details of the learning methods

For the activation-based method and ANTLR, we used the surrogate derivative using exponentialfunction σ ( v ) = α σ exp( − β σ | θ − v | ) as in [10]. For the timing-based method and ANTLR, theapproximated time derivative V ∗ [ τ ] and ε ∗ [ τ ] were calculated as V [ τ ] − V [ τ − and ( ε [ τ +1] − ε [ τ − / respectively.Algorithm 1, 2, 3 show the detailed procedure for back-propagation of the activation-based method,the timing-based method, and ANTLR, respectively; ∂L∂X is represented as δX for better readability,and W l represents a weight matrix between layer l and layer l + 1 . Note that ∂L∂S [ t ] and ∂L∂ ˆ t [ t ] arecalculated considering the loss function used (Table 1). V dep from Appendix D was used in all12ethods to reduce the total number of computations by not using ε explicitly. For the same reason,we did not implement the for loop related to ε ∗ (Algorithm 2 and 3) in the actual implementation andused auxiliary variables similar to V dep . Algorithm 1:

The activation-based back-propagation for t = T − to dofor l = L − to doif l = L − then δS l [ t ] ← ∂L∂S o [ t ] ; else δS l [ t ] ← (cid:80) W l δI l +1 [ t ] ; end δV l [ t ] ← σ ( V l [ t ]) δS l [ t ] ; δV l dep [ t ] ← δV l [ t ] + α V (1 − S l [ t ]) δV l dep [ t + 1] ; δI l [ t ] ← β V δV l dep [ t ] + α I (1 − S l [ t ]) δI l [ t + 1] ; endend Algorithm 2:

The timing-based back-propagation for t = T − to dofor l = L − to doif l = L − then δ ˆ t l [ t ] ← ∂L∂ ˆ t o [ t ] ; else for τ = − to T − t + 1 do δ ˆ t l [ t ] ← δ ˆ t l [ t ] + (cid:80) W l ε ∗ [ τ ] δV l +1 [ t + τ ]) ; endendif S l [ t ] = 1 then δV l [ t ] ← − δ ˆ t l [ t ] /V l ∗ [ t ] ; else δV l [ t ] ← ; endendend Algorithm 3:

ANTLR back-propagation for t = T − to dofor l = L − to doif l = L − then δS l [ t ] ← ∂L∂S o [ t ] ; δ ˆ t l [ t ] ← ∂L∂ ˆ t o [ t ] ; else δS l [ t ] ← (cid:80) W l δI l +1 [ t ] ; for τ = − to T − t + 1 do δ ˆ t l [ t ] ← δ ˆ t l [ t ] + (cid:80) W l ε ∗ [ τ ] δV l +1 [ t + τ ]) ; endend δV l [ t ] ← λ act σ ( V l [ t ]) δS l [ t ] ; if S l [ t ] = 1 then δV l [ t ] ← δV l [ t ] − λ tim δ ˆ t l [ t ] /V l ∗ [ t ] ; end δV l dep [ t ] ← δV l [ t ] + α V (1 − S l [ t ]) δV l dep [ t + 1] ; δI l [ t ] ← β V δV l dep [ t ] + α I (1 − S l [ t ]) δI l [ t + 1] ; endend Experimental settings

Hyper-parameters used for loss landscape estimation (Section 3.4) and random spike-train matchingtask (Section 4.1) are listed in Table A1. For latency-coded MNIST task and N-MNIST task, we grid-searched several hyper-parameter options and reported the results of the ones that provided highestvalid accuracy (averaged over 16 trials). Table A2 and Table A3 show searched hyper-parameteroptions and the ones used for the ﬁnal results.Some of the hyper-parameters were not mentioned in the paper. grad_clip is for clipping theparameter gradients before update. init_bias_center was used as a binary option that initializethe bias with large value to ease the generation of spikes at earlier training iterations. kappa_exp isfor the exponential ﬁlter used for the spike-train loss. ste_alpha and ste_beta are coefﬁcients forthe surrogate derivative described in Appendix E.

Name Value alpha_v , alpha_i grad_clip init_bias_center kappa_exp learning_rate optimizer ‘sgd’ste_alpha ste_beta alpha_v , alpha_i (0.95, 0.95), (0.99, 0.99) (0.99, 0.99) (0.99, 0.99) (0.99, 0.99) beta_softmax epoch

10 10 10 10 grad_clip init_bias_center

0, 1 0 1 1 learning_rate max_target_spikes optimizer ‘adam’ ‘adam’ ‘adam’ ‘adam’ste_alpha ste_beta

1, 3 3 - 3 weight_decay

0, 1e-3, 1e-4 0 0 0Table A2: Hyper-parameters searched and chosen for latency-coded MNIST task (Section 4.2)Hyper-parameter Searched options Chosen forActivation Timing ANTLR alpha_v , alpha_i (0.95, 0.95), (0.99, 0.99) (0.99, 0.99) (0.99, 0.99) (0.99, 0.99) beta_softmax ∗ ) 1/6 epoch grad_clip ∗ ) 1 1 init_bias_center learning_rate max_target_spikes

1, 3, 10 (1 ∗ ) 10 (1 ∗ ) - - optimizer ‘adam’ ‘adam’ ‘adam’ ‘adam’ste_alpha ste_beta

1, 3 3 - 3 weight_decay

0, 1e-3, 1e-4 0 0 0Table A3: Hyper-parameters searched and chosen for N-MNIST task ( ∗ hyper-parameters used in the case withthe single-spike coding if they are different) (Section 4.3)hyper-parameters used in the case withthe single-spike coding if they are different) (Section 4.3)