BS4NN: Binarized Spiking Neural Networks with Temporal Coding and Learning
Saeed Reza Kheradpisheh, Maryam Mirsadeghi, Timothée Masquelier
BBS4NN: Binarized Spiking Neural Networks with TemporalCoding and Learning
Saeed Reza Kheradpisheh , ∗ , Maryam Mirsadeghi , and Timoth´ee Masquelier Department of Computer Science, Faculty of Mathematical Sciences, Shahid Beheshti University, Tehran, Iran Department of Electrical Engineering, Amirkabir University of Technology, Tehran, Iran CerCo UMR 5549, CNRS Universit´e Toulouse 3, France
Abstract
We recently proposed the S4NN algorithm, essen-tially an adaptation of backpropagation to multi-layer spiking neural networks that use simple non-leaky integrate-and-fire neurons and a form of tem-poral coding known as time-to-first-spike coding.With this coding scheme, neurons fire at most onceper stimulus, but the firing order carries informa-tion. Here, we introduce BS4NN, a modification ofS4NN in which the synaptic weights are constrainedto be binary (+1 or -1), in order to decrease mem-ory and computation footprints. This was done us-ing two sets of weights: firstly, real-valued weights,updated by gradient descent, and used in the back-ward pass of backpropagation, and secondly, theirsigns, used in the forward pass. Similar strate-gies have been used to train (non-spiking) bina-rized neural networks. The main difference is thatBS4NN operates in the time domain: spikes arepropagated sequentially, and different neurons mayreach their threshold at different times, which in-creases computational power. We validated BS4NNon two popular benchmarks, MNIST and FashionMNIST, and obtained state-of-the-art accuraciesfor this sort of networks (97.0% and 87.3% respec-tively) with a negligible accuracy drop with re- ∗ Corresponding Author
Email addresses: s [email protected] (SRK),[email protected] (MM),[email protected] (TM) spect to real-valued weights (0.4% and 0.7%, re-spectively). We also demonstrated that BS4NNoutperforms a simple BNN with the same archi-tectures on those two datasets (by 0.2% and 0.9%respectively), presumably because it leverages thetemporal dimension.
Spiking neural networks (SNNs), as the third gen-eration of neural networks, are getting more andmore attention due to their higher biological plausi-bility, hardware friendliness, lower energy demand,and temporal nature [1, 2, 3, 4]. Although SNNshave not yet reached the performance of the state-of-the-art artificial neural networks (ANNs) withdeep architectures, recent efforts on adapting thegradient descent and backpropagation algorithmsto SNNs have led to great achivements [5].Contrary to artificial neurons with floating-pointoutputs, spiking neurons communicate via sparseand asynchronous stereotyped spikes which makesthem suitable for event-based computations [1, 2].That is why the neuromorphic implementation ofSNNs can be far less energy-hungry than ANNimplementations [6] which makes them appeal-ing for real-time embedded AI systems and edgecomputing solutions. However, as SNNs becomelarger they require more storage and computa-tional power. Binarizing the synaptic weights,similar to the binarized artificial neural networks1 a r X i v : . [ c s . N E ] J u l BANNs) [7], could be a good solution to reduce thememory and computational requirements of SNNs.Although the use of binary (+1 and -1) weightsin ANNs is not a very recent idea [8, 9, 10],the early studies could not adapt backpropagationto BANNs. Since binary weights cannot be up-dated in small amounts, the backpropagation andstochastic gradient descent algorithms cannot bedirectly applied to BANNs. By proposing Bina-ryConnect [11, 12] Courbariaux et al. were thefirst who successfully trained deep BANNs usingthe backpropagation algorithm. They used real-valued weights which are binarized before beingused in the forward pass. During backpropagation,using the Straight-Through Estimator (STE), thegradients of the binary weights are simply passedand applied to the real-valued weights. Soon after,Rastegari et al. [13] proposed XNOR-Net that isvery similar to BinaryConnect but it multiplies aper-layer scaling factor (the L1-norm of real-valuedweights) to the binary weights to make a betterapproximation of the real-valued weights. In orderto speed up the learning phase of BANNs, Tanget al. [14] controlled the rate of oscillations in bi-nary weights between -1 and 1 by optimizing thelearning rates. They also proposed to use learnedscaling factors instead of the L1-norm of real-valuedweights in XNOR-Net. In DoReFa-NET [15], Zhouet al. proposed a model with variable width-size(down to binary) weights, activations, and evengradients during backpropagation. A more detailedsurvey on BANNs is provided in [7].A few recent studies have tried to convertsupervised BANNs into equivalent binary SNNs(BSNNs), however, there is no other study to thebest of our knowledge aimed at directly trainingmulti-layer supervised SNNs with binary weights.Esser et al. [16] trained ANNs with constrainedweights and activations and deployed them intoSNNs with binary weights on TrueNorth. Laterin [17], they mapped convolutional ANNs with tri-nary weights and binary activations to SNNs onTrueNorth. Rueckauer et al. [18] converted Bina-ryConnect [11] with binary and full-precision ac-tivations into equivalent rate-coded BSNNs. Al-though their converted BSNN had binary weights,they did not binarize the full-precision parameters of the batch-normalization layers. In [19], Wanget al. convert BinaryConnect networks to rate-coded BSNNs using a weights-thresholds balanceconversion method which scales the high-precisionbatch normalization parameters of BinaryConnectinto -1 or 1. In another study, Lu et al. [20] con-verted a modified version of XNOR-Net withoutbatch normalization and bias inputs into equiva-lent rate-coded BSNNs.In this work, we propose a direct supervisedlearning algorithm to train multi-layer SNNs withbinary synaptic weights. The input layer uses atemporal time-to-first-spike coding [21, 22, 23] toconvert the input image into a spike train withone spike per neuron. The non-leaky integrate-and-fire (IF) neurons in the subsequent hidden andoutput layers integrate incoming spikes through bi-nary (+1 or -1) synapses and emit only one spikeright after the first crossing of the threshold. In-spired by BANNs, we also use a set of real-valuedproxy weights such that the binary weights are in-deed the sign of real-valued weights. Hence, in thebackward pass, we update the real-valued weightsbased on the errors made by the binary weights.Literally, after completing the forward pass withbinary weights, the output layer computes the er-rors by comparing its actual and target firing times,and then, real-valued synaptic weights get updatedusing the temporal error backpropagation. Weevaluated the proposed network on MNIST [24]and Fashion-MNIST [25] datasets with 97.0% and87.3% categorization accuracies, respectively.SNNs can vary in terms of neuronal model, neu-ral connectivity, information coding, and learningstrategy which deeply affect their accuracy, mem-ory, and energy efficiency. The advantages of theproposed BSNN are 1) the use of non-leaky IF neu-rons whit a very simple neuronal dynamics, 2) hav-ing binarized connectivity with low memory andcomputational cost 3) the use of a sparse temporalcoding with at most one spike per neuron, and 4)learning by a direct supervised temporal learningrule which forces the network to make decisions asaccurate and early as possible.2
Methods
The input layer of the proposed binarized single-spike supervised spiking neural network (BS4NN)converts the input image into a spike train basedon a time-to-first-spike coding. These spikes arethen propagated through the network, where, thebinary IF neurons in hidden and output layers arenot allowed to fire more than once per image. Eachoutput neuron is dedicated to a different categoryand the first output neuron to fire determines thedecision of the network.The error of each output neuron is computed bycomparing its actual firing time with a target fir-ing time. Then, a modified version of the temporalbackpropagation algorithm in S4NN [26] is used toupdate the synaptic weights. During the learningphase, we have two sets of weights, the real-valuedweights, W , and the corresponding binary weights, B , where B = sign ( W ). The forward propagationis done with the binary weights, while, the errorbackpropagation and weight updates are done bythe real-valued weights. Finally, we put the real-valued weights aside and use the binary weights toinference about testing images. Note that some ofthe following equations are adopted from S4NN [26]and they are reproduced here for the sake of read-ers. The input layer converts the input image into avolley of spikes using a single-spike temporal cod-ing scheme known as intensity-to-latency conver-sion. For images with the pixel intensity of range[0 , I max ], the firing time of the i th input neuron, t i , corresponding to the i th pixel intensity, I i , iscomputed as t i = (cid:22) I max − I i I max t max (cid:23) , (1)where, t max is the maximum firing time. In thisway, input neurons with higher pixel intensitieshave shorter spike latencies. Here, we used discretetime. Therefore, the spike train of the i th input neuron is defined as S i ( t ) = (cid:40) t = t i . (2)Subsequent hidden and output layers are com-prised of non-leaky IF neurons. The j th IF neuronof l th layer receives incoming spikes through binarysynaptic weights of -1 or +1 and update its mem-brane potential, V lj , as V lj ( t ) = V lj ( t −
1) + α l (cid:88) i B lji S l − i ( t ) , (3)where S l − i and B lji are, respectively, the input spiketrain and the binary synaptic weight connecting the i th presynaptic neuron to the neuron j . Note that α l is a scaling factor shared between all the neuronsof the l th layer. The IF neuron fires only once,the first time its membrane potential crosses thethreshold θ lj , S lj ( t ) = (cid:40) V lj ( t ) ≥ θ lj & S lj ( < t ) (cid:54) = 10 otherwise . (4)where S lj ( < t ) (cid:54) = 1 checks if the neuron has notfired at any previous time step. Equivalently, onecan move the scaling factor α l from Eq. 3 to Eq. 4by replacing θ lj with θ lj /α l .For each input image, we first reset all the mem-brane voltages to zero and then run the simulationfor at most t max time steps. Each output neuron isassigned to a different category and the output neu-ron that fires earlier than others determines the cat-egory of the input image. Hence, in the test phase,we do not need to continue the simulation after thefirst spike in the output layer. If none of the outputneurons fires before t max , the output neuron withthe maximum membrane potential at t max makesthe decision. However, during the learning phase,to compute the temporal error and gradients, weneed all the neurons in the network to fire at somepoint, and hence, we continue the simulation until t max and if a neuron never fires, we force it to emita fake spike at time t max .3 .2 Backward pass For a categorization task with C categories, we de-fine the temporal error as a function of the actualand target firing times, e = [ e , ..., e C ] s.t. e j = ( T oj − t oj ) /t max , (5)where t oj and T oj are the actual and the target firingtimes of the j th output neuron, respectively. Let’sdefine τ as the minimum firing time in the outputlayer (i. e., τ = min { t oj | ≤ j ≤ C } ). For an inputimage belonging to the i th category, we have T oj = τ − γ if j = i,τ + γ if j (cid:54) = i & t oj < τ + γ,t oj if j (cid:54) = i & t oj ≥ τ + γ, (6)where, γ is a positive constant. This way the cor-rect neuron is encouraged to fire first and others arepenalized to not fire earlier than τ + γ . In a spe-cial case that all the output neurons remain silentduring the forward pass (emit fake spikes at t max ),we set T oi = t max − γ and T oj (cid:54) = i = t max to force thecorrect neuron to fire.Let’s define the “squared error” loss function as L = 12 (cid:107) e (cid:107) = 12 C (cid:88) j =1 e j . (7)To apply the gradient descent algorithm, weshould compute ∂L/∂B lji , the gradient of the lossfunction with respect to the binary weights. How-ever, the gradient descent method makes smallchanges to the weights, which cannot be done withbinary values. To solve the problem, during thelearning phase, we use a set of real-valued weights, W , as a proxy, such that B lji = sign ( W lji ) , (8)and, as the gradient of the sign function is 0or undefined, using the straight-through estimator(STE) we approximate ∂sign ( x ) /∂x = 1, therefore,we have ∂L∂W lji = ∂L∂B lji . (9) Now, we can update the real-valued weights as W lji = W lji − η ∂L∂B lji , (10)where η is the learning rate parameter.Let’s define δ lj = ∂L∂t lj , (11)where, t lj is the firing time of the j th neuron of the l th layer. Also, according to [26], we approximate ∂t lj /∂B lji to be − α l if t l − i < t lj and 0 otherwise.Therefore, we have ∂L∂B lji = ∂L∂t lj ∂t lj ∂B lji = (cid:40) − α l δ lj if t l − i < t lj , (12)where for the output layer (i. e., l = o ) we have δ oj = ∂L∂e j ∂e j ∂t oj = − e j t max , (13)and for the hidden layers (i. e., l (cid:54) = o ), accordingto the backpropagation algorithm, we compute theweighted sum of the delta values of neurons in thefollowing layer, δ lj = (cid:88) k ∂L∂t l +1 k ∂t l +1 k ∂V l +1 k ∂V l +1 k ∂t lj = (cid:88) k δ l +1 k α l +1 B l +1 kj [ t lj < t l +1 k ] , (14)where, k iterates over neurons in layer l + 1. Simi-lar to [26], we approximate ∂t l +1 k /∂V l +1 k = − ∂V l +1 k /∂t lj = − α l +1 B l +1 kj if and only if [ t lj ≤ t l +1 k ].To have smooth gradients, we use the real-valuedweights, W l +1 kj , instead of the scaled binary weights, α l +1 B l +1 kj .We also update the scaling factor α l as α l = α l − µ ∂L∂α l (15)where µ is the learning rate parameter. Thereforewe compute ∂L∂α l = (cid:88) j ∂L∂t lj ∂t lj ∂α l = (cid:88) j ∂L∂t lj ∂t lj ∂V lj ∂V lj ∂α l = − (cid:88) j δ li (cid:88) i B lji [ t l − i < t lj ] (16)4 able 1: The structural, initialization, and model parameters used for MNIST and Fashion-MNIST datasets.Layer size Initial real-value weights Initial parametersDataset Hidden Output W h W o t max θ α h α o η µ γ λ MNIST 600 10 [0 ,
5] [0 ,
50] 256 100 5 5 0.1 0.01 1 10 − Fashion-MNIST 1000 10 [0 ,
1] [0 ,
1] 256 700 5 10 0.1 0.01 1 10 − Table 2: The recognition accuracies of recent supervised fully connected SNNs with spike-time-based backpropagation onthe MNIST dataset. The details of each model including its input coding scheme, neuron model, synapses, post-synapticpotential (PSP), learning method, and the number of hidden neurons are provided.
Model Coding Neuron / Synapse / PSP Learning Hidden( where j and i iterate over neurons in layer l and l −
1, respectively. Here again, similar to [26], weapproximate ∂t lj /∂V lj = − ∂V lj /∂α l = (cid:80) i B lji [ t l − i < t lj ].Note that before updating the weights we nor-malize the gradients as δ lj = δ lj / (cid:80) i δ li , to avoidexploding and vanishing gradients. Also, weadded an L -norm weight regularization term λ (cid:80) l (cid:80) i,j ( W lji ) to the loss function to avoid over-fitting. The parameter λ is the regularization pa-rameter accounting for the degree of weigh penal-ization. In this section, we evaluate BS4NN on the MNISTdataset which is the most popular benchmark forspiking neural networks [1]. The MNIST datasetcontains 60,000 handwritten digits (0 to 9) in im-ages of size 28 ×
28 pixels as the train set. Thetest set contains 10,000 digit images, ∼ W h ) and hidden-output ( W o ) weights aredrawn from uniform distributions in range [0 ,
5] and[0 , α h ) and output ( α o ) layers are tuned throughthe learning phase. Adaptive parameters including η and µ are discounted by 30% every ten epochs.Other parameters remain intact in both the learn-ing and testing phases.Table 2 presents the categorization accuracy ofthe proposed BS4NN along with some other SNNswith spike-time-based direct supervised learn-ing algorithms and fully-connected architectures.BS4NN is the only network in this table that usesbinary weights and it could reach 97.0% accuracyon MNIST. As mentioned in the Methods Sec-tion, BS4NN uses a modified version of the tem-poral backpropagation algorithm in S4NN (Kher-adpisheh et al. (2020) [26]) to have binary weights.Compared to S4NN, the categorization accuracy inBS4NN dropped by 0.4% only. Although BS4NN isoutperformed by the other SNNs by at most 1.4%,its advantages are the use of binary weights in-stead of real-valued full-precision weights and in-stantaneous post-synaptic potential (PSP) func-tion. As seen, BS4NN could outperform Tavanaeiet al. (2019) [27] that uses real-valued weights5 a) ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' ' Digit category0123456789 O u t p u t n e u r o n
75 188 88 88 128 108 96 122 94 135
Mean firing times t i m e s t e p (b) Figure 1: (a) The firing times of the ten output neuronsover the test images ordered by category. (b) The meanfiring time of each output neuron (rows) over the images ofdifferent digits (columns). and instantaneous PSP. Other SNNs use exponen-tial and linear PSP functions which complicate theneural processing and the learning procedure of thenetwork, which consequently, increase their compu-tational and energy cost.We also compared BS4NN to a BNN with a sim-ilar architecture. To do a fair comparison, inspiredfrom [12], we implemented a BNN with binaryweights (-1 and +1) and binary sigmoid activations(0 and 1). The network has a single hidden layerof size 600 and it is trained using ADAM optimizerand squared hinge loss function for 500 epochs. Thelearning rate initiates from 10 − and exponentiallydecays, through the learning epochs, down to 10 − .According to [32], the initial real-valued weights ofeach layer are randomly drawn from a uniform dis- Sp i k e c o un t s TotalHidden layerInput layer
Figure 2: The mean required number of spikes in the input,hidden, and total layers. tribution in range [ − / √ n, / √ n ], where, n is thenumber of synaptic weights of that layer. As pro-vided in Table 2, the BNN could reach the bestaccuracy of 96.8% on MNIST, that is a 0.2% dropwith respect to BS4NN (we will comment these re-sults in the Discussion).The firing times of the ten output neurons overall test images are shown in Figure 1a. Images areordered by the digit category from ’0’ to ’9’. Foreach test image, the firing time of each neuron isshown by a color-coded dot. As seen, for each cate-gory, its corresponding output neuron tends to fireearlier than others. It is better evident in Figure 1bwhich shows the mean firing time of each outputneuron for each digit category. Each output neu-ron has, by difference, the shortest mean firing timefor images of its corresponding digit. Interestingly,BS4NN needs a much longer time to detect digit’1’ (188 time steps) that could be due to the use ofbinary weights. Other digits cover more pixels ofthe image, and therefore, produce more early spikesthan digit ’1’. Since the weights are binary, the fewearly spikes of digit ’1’ can not activate the hid-den IF neurons, and hence, BS4NN needs to waitfor later surrounding spikes to distinguish digit ’1’from other digits.We further counted the mean required number ofspikes for BS4NN to categorize images of each digitcategory. To this end, we counted the number ofspikes in all the layers until the emission of the firstspike in the output layer (when the network makesits decision). The mean required spikes of the in-put and hidden layers are depicted in Figure 2. All6
50 100 150 200 250Time step (t)4003002001000100200300400 M e m b r a n e p o t e n t i a l ( V ) Neuron 0Neuron 1Neuron 2Neuron 3Neuron 4Neuron 5Neuron 6Neuron 7Neuron 8Neuron 9Threshold
Figure 3: The trajectory of the membrane potential for allthe ten output neuron for sample ’9’ test image along withthe demonstration of the accumulated input spikes until the15, 58, 100, 190, and 250 time steps digit categories but ’1’, on average, require about100 spikes in the input and 200 spikes in the hid-den layers, respectively. Digit ’1’ requires about300 input spikes, while, similar to other digits, itshidden layer needs about 100 spikes. As explainedabove, digit ’1’ covers a fewer number of pixels thanother digits and also its shape overlaps with theconstituent parts of some other digits, hence, dueto the use of the binary weights, the network shouldwait for later input spikes to distinguish digit ’1’from other digits.Figure 3 shows the time course of the membranepotentials of the output neurons for a sample ’9’test image. The membrane potential of the 9th out-put neuron overtakes others at the 15th time stepand quickly increases until it crosses the thresh-old at the 58th time step. The accumulated inputspikes until the 15, 58, 100, 190, and 250 time stepsare depicted in this figure. As seen, up to the 15thtime step, a few input spikes are propagated and atthe 58th time steps with the propagation of a fewmore input spikes, the 9th output neuron reachesits threshold and determines the category of the in-put image. Later input, hidden, and output spikesare no more required by the network.To evaluate the robustness of the trained BS4NNto the input noise, during the test phase, we addedrandom jitter noise drawn from a uniform distribu-tion in range [ − J, J ] to the pixels of the input im-ages. The noise level, J , varies from 5% to 100% ofthe maximum pixel intensity, I max . Figure 4a showsa sample image contaminated with different lev-
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% (a)(b)
Figure 4: (a) A sample image contaminated with differentamount of jitter noise. (b) The recognition accuracy of thetrained BS4NN on test images under different levels of noise.
Real weights Binary weights
Figure 5: Reconstruction of the real-valued weights andtheir corresponding binary weights for sixteen randomly se-lected hidden neurons. els of jitter noise. The recognition accuracy of thetrained model over noisy test images under differ-ent levels of noise is plotted in Figure 4b. As shown,the recognition accuracy remains above 95% and itdrops to 79% for the 100% noise level. In highernoise levels, the order of input spikes can dramati-cally change and because BS4NN has only +1 and-1 synaptic weights even to the insignificant parts ofthe input images, It affects the behavior of IF neu-rons which consequently increase the categorizationerror rate.In a further experiment, we replaced the binaryweights of the trained BS4NN with their corre-sponding real-valued weights and applied them to7 able 3: The recognition accuracies of recent supervised SNNs on the Fashion-MNIST dataset. The details of each modelincluding its architecture, input coding scheme, neuron model, and learning method are provided.
Model Neuron Coding Synapses Learning Acc. (%)Zhang et al. (2019) [33] Recurrent SNN Leaky IF Rate Real-value Spike-train backpropagation 90.1Ranjan et al. (2019) [34] Convolutional SNN Leaky IF Rate Real-value Spike-rate backpropagation 89.0Wu et al. (2020) [35] Convolutional SNN Leaky IF Rate Real-value Global-local hybrid learning rule 93.3Zhang et al. (2020) [36] Fully-connected SNN Leaky IF Rate Real-value Spike-sequence backpropagation 89.5Zhang et al. (2020) [36] Fully-connected SNN IOW Rate Real-value Spike-sequence backpropagation 90.2Hao et al.(2020) [37] Fully-connected SNN Leaky IF Rate Real-value Dopamine-modulated STDP 85.3S4NN Fully-connected SNN IF Temporal Real-value Temporal backpropagation 88.0BNN Fully-connected Binary Sigmoid Binary - ADAM 86.4BS4NN (this paper) Fully-connected SNN IF Temporal Binary Temporal backpropagation 87.3 A cc u r a c y ( % ) R e s p o n s e t i m e Figure 6: The speed-accuracy trade-off in the pre-trainedBS4NN when the threshold varies form 0 to 200. the test images. In other words, we replaced the α l B lji term in Eq. 3 with W lji . The network reached89.1% accuracy on test images which is far lessthan the 97.0% accuracy of the binary weights.It shows that, although we update the real-valuedproxy weights during the learning phase, we are ac-tually tuning the binary weights, because the lossand gradients are computed based on the binaryweights. Figure 5 shows the pairs of the real andbinary-valued weights for 16 randomly selected hid-den neurons. Dark pixels correspond to negativeand bright values correspond to positive weights.It seems that hidden neurons tend to detect differ-ent variants of digits and their constituent parts.To assess the speed-accuracy trade-off in BS4NN,we first trained the network with a threshold of 100for all the neurons, then we varied the thresholdsfrom 0 to 200 for all the neurons and evaluated thenetwork on test images. As shown in Figure 6, the T_shirtTrouser
Pullover
DressCoatSandalShirtSneakerBagAnkle boot
Figure 7: Sample images from Fashion-MNIST dataset. accuracy peaks around the threshold of 100 anddrops as we move to higher or lower threshold val-ues, while, the response time (time to the first spikein the output layer) increases by the threshold.Regarding this trade-off, by reducing the thresh-old of the pre-trained BS4NN, one can get fasterresponses but with lower accuracy. For instance,by setting the threshold to 80, the response timeshortens from 112.9 to 44.9 ( ∼
3x faster responses),while, the accuracy drops from 97.0% to 91.0%.The α l scaling factors are full-precision floating-point parameters we used in our neuronal layersto have a better approximation of the real-valuedweights by the binary weights. We could round the α l factors in the pre-trained network down to twodecimal places without a change in the categoriza-tion accuracy. Fashion-MNIST [25] is a fashion product imagedataset with 10 classes (see Figure 7). Im-ages are gathered from the thumbnails of the8lothing products on an online shopping website.Fashion-MNIST has the same image size and train-ing/testing splits as MNIST, but it is a more chal-lenging classification task. Here, we used a BS4NNwith a single hidden layer with 1000 IF neurons.Details of the parameter values are presented inTable 1. The initial weights of all layers are ran-domly drawn from a uniform distribution in therange [0,1]. The learning rate parameters η and µ discount by 30% every 10 epochs, and the scalingfactors α h and α o are trained during the learningphase.Table 3 summarizes the characteristics andrecognition accuracies of recent SNNs on theFashion-MNIST dataset. BS4NN could reach87.3% accuracy (0.7% drop with respect to S4NN).Apart from BS4NN, all the models use real-valuedsynaptic weights, spike-rate-based neural coding,and leaky neurons with exponential decay. Themean firing times of the output neurons of BS4NNfor each of the ten categories of Fashion-MNIST areillustrated in Figure 8a. As seen, the correct outputneuron has the minimum firing time for its corre-sponding category than others. However, comparedto MNIST, there is a small difference between themean firing times of the correct and some otherneurons. It could be due to the similarities betweeninstances of different categories. For instance, asshown in Figure 8b, BS4NN confuses ankle boots,sandals, and sneakers. There is a similar situationfor shirts and t-shirts, and also, between pulloversand coats, where, their firing times are close to-gether and consequently BS4NN confuses them byeach other sometimes. The total required numberof spikes in each layer and the total network is pro-vided in Figure 8c. Those classes that are mostlyconfused by each other (i. e., shirts, t-shirts, coats,and pullovers) require more spikes in both inputand hidden layers. One reason could be the largersize of these objects in the input image leading tomore early input spikes. But, the other reason, es-pecially for the hidden layer, could be the need formore discriminative features between these confus-ing categories.We also did a comparison between BS4NN anda BNN with binary weights (-1 and 1), binary ac-tivations (0 and1), and the same architecture as ' T _ s h i r t '' T r o u s e r '' P u ll o v e r '' D r e ss '' C o a t '' S a n d a l '' S h i r t '' S n e a k e r '' B a g '' A n k l e b o o t ' Image category'T_shirt''Trouser''Pullover''Dress''Coat''Sandal''Shirt''Sneaker''Bag''Ankle boot' O u t p u t n e u r o n
123 168 142 145 184 254 128 255 219 255207 81 244 137 221 255 237 256 240 255138 179 120 163 131 254 124 255 217 254145 157 178 115 168 250 162 255 217 254110 130 103 124 98 252 101 255 159 253246 254 244 251 251 166 238 199 194 187137 164 138 149 143 253 132 254 188 254251 256 219 243 254 113 201 107 116 135148 245 141 197 158 193 132 205 93 236183 252 182 226 217 71 126 89 77 67
Mean firing times t i m e s t e p (a) ' T _ s h i r t '' T r o u s e r '' P u ll o v e r '' D r e ss '' C o a t '' S a n d a l '' S h i r t '' S n e a k e r '' B a g '' A n k l e b o o t ' Image category'T_shirt''Trouser''Pullover''Dress''Coat''Sandal''Shirt''Sneaker''Bag''Ankle boot' p r e d i c t e d c a t e g o r y (b) ' T _ s h i r t ' ' T r o u s e r ' ' P u ll o v e r ' ' D r e ss ' ' C o a t ' ' S a n d a l ' ' S h i r t ' ' S n e a k e r ' ' B a g '' A n k l e b oo t ' Image category0200400600800 Sp i k e c o un t s TotalHidden layerInput layer (c)
Figure 8: (a) The mean firing times of the output neuronsover the Fashion-MNIST categories. (b) The confusion ma-trix of BS4NN on Fashion-MNIST. (c) The mean requirednumber of spikes per category and layer. − and exponential decays down to 10 − . Theinitial real-valued weights of each layer are ran-domly drawn from a uniform distribution in range[ − / √ n, / √ n ], where, n is the number of synapticweights of that layer. Interestingly, BS4NN outper-forms BNN by 0.9% accuracy. In this paper, we propose a binarized spiking neuralnetwork (called BS4NN) with a direct supervisedtemporal learning algorithm. To this end, we useda very common approach in the area of BANNs [7].During the learning phase, we have two sets of realand binary-valued weights, such that the binaryweights are the sign of the real-valued weights. Thebinary weights are used for the inference and gradi-ent backpropagation, while, in the backward pass,the weight updates are applied to the real-valuedweights. The proposed BS4NN uses the time-to-first-spike coding [38, 39, 40, 41, 42] to convertimage pixels into spike trains in which input neu-rons with higher pixel intensities emit spikes withshorter latencies. The subsequent hidden and out-put neurons are comprised of non-leaky IF neuronswith binary (+1 or -1) weights that fire once whenthey reach their threshold for the first time. Thedecision is simply made by the first spike in theoutput layer. The temporal error is then computedby comparing the actual and target firing times.Gradients backpropagate through the network andare applied to the real-valued weights. Target fir-ing times are computed relative to the actual firingtimes of the output neurons to push the correctoutput neuron to fire earlier than others. It forcesBS4NN to make quick and accurate decisions withthe less possible amount of spikes (high sparsity).In our experiments, BS4NN could reach 97.0%and 87.3% accuracy on MNIST and Fashion-MNIST datasets, respectively. Although in termsof accuracy, BS4NN could not beat the real-valuedSNNs, it has several computational, memory, andenergy advantages which makes it suitable for hard- ware and neuromorphic implementations. Interest-ingly, BS4NN has also outperformed BNNs withsame architectures on MNIST and Fashion-MNISTby 0.2% and 0.9% accuracy, respectively. This im-provement with respect to BNN could be due to theuse of time in our time-to-first-spike coding andtemporal backpropagation in BS4NN. Both net-works have binary activations and binary weights,but the advantage of BS4NN is the use of temporalinformation encoded in spike-times.Instead of real-valued weights, BS4NN uses bi-nary synapses with only one full-precision scal-ing factor per layer. It can be very importantfor memory optimization in hardware implementa-tions where every synaptic weight requires a sep-arate memory space. If one implements the bi-nary synapses with a single bit of memory, then itcan reduce the network size by 32x compared to anetwork with 32-bit floating-point weights [13, 43].Also, it can ease the implementation of multiplica-tive synapses by replacing them with one unit in-crement and decrement operations. Hence, it canbe important for reducing the computational andenergy-consumption costs [13, 43].The use of non-leaky IF neurons instead of com-plicated neuron models such as SRM [29] andLIF [36, 44] makes BS4NN more computation-ally efficient and hardware friendly. It might bepossible to efficiently implement leakage in analoghardware regarding the physical features of tran-sistors and capacitors [6], but it is always costlyto be implemented in digital hardware. To doso, one might periodically (e.g., every millisecond)decrease the membrane potential of all neurons(clock-driven) [45], or whenever an input spike isreceived by a neuron (event-based) [46, 47]. Thefirst one requires energy and the latter one needsmore memory to store last firing times.The implementation of instantaneous synapsesused in BS4NN is way simpler than the exponen-tial [28], alpha [29], and linear [30, 31, 48] synapticcurrents and costs much less energy and computa-tion. In instantaneous synapses, each input spikecauses a sudden potential increment or decrement,but in the current-based synapses, each input spikecauses the potential to be updated on several con-secutive time steps (which requires an extra state10arameter).As mentioned above, BS4NN uses single-spikeneural coding throughout the network. The in-put layer employs a time-to-first-spike coding bywhich input neurons fires only once (shorter laten-cies for stronger inputs). Also, neurons in the sub-sequent layers are allowed to fire at most once andonly when they reach their threshold for the firsttime. In addition, the proposed temporal learn-ing algorithm used to train BS4NN forces it to relyon earlier spikes and respond as quickly as possi-ble. This cocktail is shown to take much less en-ergy and time on neuromorphic devices comparedto the rate-coded SNNs [49, 50], even up to 15 timeslower energy-consumption and 5 times faster deci-sions [51].Recently, efforts are made to convert pre-trainedBANNs into equivalent BSNNs with spike-rate-based neural coding [18, 19, 20]. However, thesenetworks do not use the temporal advantages ofSNNs that can be obtained through a direct learn-ing algorithm. Due to the non-differentiability ofthe thresholding activation function in spiking neu-rons, it is not convenient to apply backpropaga-tion and gradient descents to SNNs. Various solu-tions are proposed to tackle this problem includ-ing computing gradients with respect to the spikerates instead of single spikes [52, 53, 54, 55], usingdifferentiable smoothed spike functions [56], sur-rogate gradients for the threshold function in thebackward pass [57, 58, 59, 60, 61, 62], and trans-fer learning by sharing weights between the SNNand an ANN [49, 63]. In another approach, knownas latency learning, the neuron’s activity is definedbased on the firing time of its first spike, therefore,they do not need to compute the gradient of thethresholding function. In return, they need to de-fine the firing time as a function of the membranepotential [26, 29, 30, 31, 36, 64, 65], or directlyas a function the firing times of presynaptic neu-rons [28]. By the way, all aforementioned learn-ing strategies work with full-precision real-valuedweights and future studies can assess their capabil-ities to be used in BSNNs.
References [1] A. Tavanaei, M. Ghodrati, S. R. Kherad-pisheh, T. Masquelier and A. Maida, Deeplearning in spiking neural networks,
NeuralNetworks (2019) 47–63.[2] M. Pfeiffer and T. Pfeil, Deep learning withspiking neurons: opportunities and challenges,
Frontiers in Neuroscience (2018) p. 774.[3] A. Taherkhani, A. Belatreche, Y. Li,G. Cosma, L. P. Maguire and T. M. McGin-nity, A review of learning in biologicallyplausible spiking neural networks, NeuralNetworks (2020) 253–272.[4] B. Illing, W. Gerstner and J. Brea, Biologicallyplausible deep learning–but how far can wego with shallow networks?,
Neural Networks (2019).[5] X. Wang, X. Lin and X. Dang, Supervisedlearning in spiking neural networks: A reviewof algorithms and evaluations,
Neural Net-works (2020).[6] K. Roy, A. Jaiswal and P. Panda, Towardsspike-based machine intelligence with neuro-morphic computing,
Nature (nov 2019)607–617.[7] T. Simons and D.-J. Lee, A review of bina-rized neural networks,
Electronics (6) (2019)p. 661.[8] D. Saad and E. Marom, Training feed forwardnets with binary weights via a modified chiralgorithm, Complex Systems (5) (1990).[9] S. S. Venkatesh, Directed drift: A new lin-ear threshold algorithm for learning binaryweights on-line, Journal of Computer and Sys-tem Sciences (2) (1993) 198–217.[10] C. Baldassi, A. Braunstein, N. Brunel andR. Zecchina, Efficient supervised learning innetworks with binary synapses, Proceedingsof the National Academy of Sciences (26)(2007) 11079–11084.1111] M. Courbariaux, Y. Bengio and J.-P. David,Binaryconnect: Training deep neural networkswith binary weights during propagations,
Ad-vances in neural information processing sys-tems , 2015, pp. 3123–3131.[12] M. Courbariaux, I. Hubara, D. Soudry, R. El-Yaniv and Y. Bengio, Binarized neural net-works: Training deep neural networks withweights and activations constrained to +1 or-1, arXiv preprint arXiv:1602.02830 (2016).[13] M. Rastegari, V. Ordonez, J. Redmon andA. Farhadi, Xnor-net: Imagenet classifica-tion using binary convolutional neural net-works,
European conference on computer vi-sion , Springer2016, pp. 525–542.[14] W. Tang, G. Hua and L. Wang, How to train acompact binary neural network with high ac-curacy?,
Thirty-First AAAI conference on ar-tificial intelligence , 2017.[15] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wenand Y. Zou, Dorefa-net: Training lowbitwidth convolutional neural networkswith low bitwidth gradients, arXiv preprintarXiv:1606.06160 (2016).[16] S. K. Esser, R. Appuswamy, P. Merolla, J. V.Arthur and D. S. Modha, Backpropagation forenergy-efficient neuromorphic computing,
Ad-vances in Neural Information Processing Sys-tems 28 , eds. C. Cortes, N. D. Lawrence, D. D.Lee, M. Sugiyama and R. Garnett (Curran As-sociates, Inc., 2015), pp. 1117–1125.[17] S. K. Esser, P. A. Merolla, J. V. Arthur,A. S. Cassidy, R. Appuswamy, A. An-dreopoulos, D. J. Berg, J. L. McKinstry,T. Melano, D. R. Barch, C. di Nolfo, P. Datta,A. Amir, B. Taba, M. D. Flickner andD. S. Modha, Convolutional networks forfast, energy-efficient neuromorphic computing,
Proceedings of the National Academy of Sci-ences (41) (2016) 11441–11446.[18] B. Rueckauer, I.-A. Lungu, Y. Hu, M. Pfeif-fer and S.-C. Liu, Conversion of continuous-valued deep networks to efficient event-driven networks for image classification,
Frontiers inNeuroscience (2017) p. 682.[19] Y. Wang, Y. Xu, R. Yan and H. Tang, Deepspiking neural networks with binary weightsfor object recognition, IEEE Transactions onCognitive and Developmental Systems (2020).[20] S. Lu and A. Sengupta, Exploring the con-nection between binary and spiking neuralnetworks, arXiv preprint arXiv:2002.10064 (2020).[21] S. R. Kheradpisheh, M. Ganjtabesh, S. J.Thorpe and T. Masquelier, Stdp-based spikingdeep convolutional neural networks for objectrecognition,
Neural Networks (2018) 56–67.[22] M. Mozafari, S. R. Kheradpisheh, T. Masque-lier, A. Nowzari-Dalini and M. Ganjtabesh,First-spike-based visual categorization usingreward-modulated stdp, IEEE Transactionson Neural Networks and Learning Systems (12) (2018) 6178–6190.[23] S. R. Kheradpisheh, M. Ganjtabesh andT. Masquelier, Bio-inspired unsupervisedlearning of visual features leads to robustinvariant object recognition, Neurocomputing (sep 2016) 382–392.[24] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner et al. , Gradient-based learning applied to doc-ument recognition,
Proceedings of the IEEE (11) (1998) 2278–2324.[25] H. Xiao, K. Rasul and R. Vollgraf, Fashion-mnist: a novel image dataset for bench-marking machine learning algorithms, arXivpreprint arXiv:1708.07747 (2017).[26] S. R. Kheradpisheh and T. Masquelier, Tem-poral backpropagation for spiking neural net-works with one spike per neuron, InternationalJournal of Neural Systems (06) (2020) p.2050027, PMID: 32466691.[27] A. Tavanaei and A. Maida, Bp-stdp: Ap-proximating backpropagation using spike tim-ing dependent plasticity, Neurocomputing (2019) 39–47.1228] H. Mostafa, Supervised learning based on tem-poral coding in spiking neural networks,
IEEETransactions on Neural Networks and Learn-ing Systems (7) (2017) 3227–3235.[29] I. M. Comsa, K. Potempa, L. Versari, T. Fis-chbacher, A. Gesmundo and J. Alakuijala,Temporal coding in spiking neural networkswith alpha synaptic function, arXiv (2019) p.1907.13223.[30] M. Zhang, J. Wang, Z. Zhang, A. Bela-treche, J. Wu, Y. Chua, H. Qu and H. Li,Spike-timing-dependent back propagation indeep spiking neural networks, arXiv preprintarXiv:2003.11837 (2020).[31] Y. Sakemi, K. Morino, T. Morie andK. Aihara, A supervised learning algo-rithm for multilayer spiking neural networksbased on temporal coding toward energy-efficient vlsi processor design, arXiv preprintarXiv:2001.05348 (2020).[32] X. Glorot and Y. Bengio, Understanding thedifficulty of training deep feedforward neuralnetworks, Proceedings of the thirteenth inter-national conference on artificial intelligenceand statistics , 2010, pp. 249–256.[33] W. Zhang and P. Li, Spike-train level back-propagation for training deep recurrent spik-ing neural networks,
Advances in Neural In-formation Processing Systems , 2019, pp. 7802–7813.[34] J. A. K. Ranjan, T. Sigamani and J. Barnabas,A novel and efficient classifier using spikingneural network,
The Journal of Supercomput-ing (2019) 1–16.[35] Y. Wu, R. Zhao, J. Zhu, F. Chen, M. Xu,G. Li, S. Song, L. Deng, G. Wang, H. Zheng et al. , Brain-inspired global-local hybrid learn-ing towards human-like intelligence, arXivpreprint arXiv:2006.03226 (2020).[36] W. Zhang and P. Li, Temporal spike se-quence learning via backpropagation for deep spiking neural networks, arXiv preprintarXiv:2002.10085 (2020).[37] Y. Hao, X. Huang, M. Dong and B. Xu, A bio-logically plausible supervised learning methodfor spiking neural networks using the symmet-ric stdp rule,
Neural Networks (2020) 387–395.[38] M. Mozafari, M. Ganjtabesh, A. Nowzari-Dalini, S. J. Thorpe and T. Masquelier,Bio-inspired digit recognition using reward-modulated spike-timing-dependent plasticityin deep convolutional networks,
Pattern Recog-nition (2019) 87–95.[39] M. Mozafari, M. Ganjtabesh, A. Nowzari-Dalini and T. Masquelier, SpykeTorch: Ef-ficient Simulation of Convolutional SpikingNeural Networks With at Most One Spikeper Neuron, Frontiers in Neuroscience (jul2019) 1–12.[40] R. Vaila, J. Chiasson and V. Saxena, Fea-ture Extraction using Spiking ConvolutionalNeural Networks, Proceedings of the Interna-tional Conference on Neuromorphic Systems- ICONS ’19 , (ACM Press, New York, NewYork, USA, 2019), pp. 1–8.[41] R. Vaila, J. Chiasson and V. Saxena,Deep convolutional spiking neural networksfor image classification, arXiv preprintarXiv:1903.12272 (2019).[42] P. Kirkland, G. Di Caterina, J. Soraghan andG. Matich, Spikeseg: Spiking segmentationvia stdp saliency mapping,
International JointConference on Nerual Networks , 2020.[43] B. McDanel, S. Teerapittayanon and H. Kung,Embedded binarized neural networks, arXivpreprint arXiv:1709.02260 (2017).[44] T. Masquelier and S. R. Kheradpisheh, Op-timal localist and distributed coding of spa-tiotemporal spike patterns through stdp andcoincidence detection,
Frontiers in computa-tional neuroscience (2018) p. 74.1345] A. Yousefzadeh, T. Masquelier, T. Serrano-Gotarredona and B. Linares-Barranco, Hard-ware implementation of convolutional STDPfor on-line visual feature learning, (may 2017) 1–4.[46] G. Orchard, C. Meyer, R. Etienne-Cummings,C. Posch, N. Thakor and R. Benosman,HFirst: A Temporal Approach to ObjectRecognition, IEEE Transactions on PatternAnalysis and Machine Intelligence (2015).[47] A. Yousefzadeh, T. Serrano-Gotarredona andB. Linares-Barranco, Fast Pipeline 128x128pixel spiking convolution core for event-drivenvision processing in FPGAs ,(IEEE, jun 2015), pp. 1–8.[48] B. Rueckauer and S.-C. Liu, Conversion ofanalog to spiking neural networks using sparsetemporal coding, ,IEEE2018, pp. 1–5.[49] S. P, K. T. N. Chu, Y. Tavva, J. Wu,M. Zhang, H. Li, T. E. Carlson et al. , You onlyspike once: Improving energy-efficient neuro-morphic inference to ann-level accuracy, arXivpreprint arXiv:2006.09982 (2020).[50] J. G¨oltz, A. Baumbach, S. Billaudelle,A. Kungl, O. Breitwieser, K. Meier, J. Schem-mel, L. Kriener and M. Petrovici, Fastand deep neuromorphic learning with first-spike coding,
Proceedings of the Neuro-inspiredComputational Elements Workshop , 2020, pp.1–3.[51] S. Oh, D. Kwon, G. Yeom, W.-M. Kang,S. Lee, S. Y. Woo, J. S. Kim, M. K. Park andJ.-H. Lee, Hardware implementation of spik-ing neural networks using time-to-first-spikeencoding, arXiv preprint arXiv:2006.05033 (2020). [52] E. Hunsberger and C. Eliasmith, Spiking deepnetworks with lif neurons, arXiv (2015) p.1510.08829.[53] J. H. Lee, T. Delbruck and M. Pfeiffer, Train-ing deep spiking neural networks using back-propagation,
Frontiers in Neuroscience (2016) p. 508.[54] E. O. Neftci, C. Augustine, S. Paul andG. Detorakis, Event-driven random back-propagation: Enabling neuromorphic deeplearning machines, Frontiers in Neuroscience (2017) p. 324.[55] F. Zenke and S. Ganguli, Superspike: Su-pervised learning in multilayer spiking neuralnetworks, Neural Computation (6) (2018)1514–1541.[56] D. Huh and T. J. Sejnowski, Gradient descentfor spiking neural networks, Advances in Neu-ral Information Processing Systems , 2018, pp.1433–1443.[57] E. O. Neftci, H. Mostafa and F. Zenke, Sur-rogate gradient learning in spiking neural net-works, arXiv (2019) p. 1901.09948.[58] S. M. Bohte, Error-backpropagation in net-works of fractionally predictive spiking neu-rons,
International Conference on ArtificialNeural Networks , Springer2011, pp. 60–68.[59] S. K. Esser, P. A. Merolla, J. V. Arthur,A. S. Cassidy, R. Appuswama, A. An-dreopoulos, D. J. Berg, J. L. McKinstry,T. Melano, D. R. Barch, C. d. Nolfo, P. Datta,A. Amir, B. Taba, M. D. Flickner andD. S. Modha, Convolutional networks forfast energy-efficient neuromorphic computing,
Proceedings of the National Academy of Sci-ences of USA (41) (2016) 11441–11446.[60] S. B. Shrestha and G. Orchard, Slayer: Spikelayer error reassignment in time,
Advances inNeural Information Processing Systems , 2018,pp. 1412–1421.1461] G. Bellec, D. Salaj, A. Subramoney, R. Leg-enstein and W. Maass, Long short-term mem-ory and learning-to-learn in networks of spik-ing neurons,
Advances in Neural InformationProcessing Systems , 2018, pp. 787–797.[62] R. Zimmer, T. Pellegrini, S. F. Singh andT. Masquelier, Technical report: super-vised training of convolutional spiking neu-ral networks with pytorch, arXiv preprintarXiv:1911.10124 (2019).[63] J. Wu, Y. Chua, M. Zhang, G. Li, H. Li andK. C. Tan, A tandem learning rule for efficientand rapid inference on deep spiking neural net-works, arXiv (2019) arXiv–1907.[64] S. M. Bohte, H. La Poutr´e and J. N.Kok, Error-Backpropagation in TemporallyEncoded Networks of Spiking Neurons,
Neu-rocomputing (2000) 17–37.[65] S. Zhou, Y. Chen, Q. Ye and J. Li, Directtraining based spiking convolutional neuralnetworks for object recognition, arXiv preprintarXiv:1909.10837arXiv preprintarXiv:1909.10837