[PDF] Deep Residual Learning in Spiking Neural Networks

Abstract

Deep Spiking Neural Networks (SNNs) present optimization difficulties for gradient-based approaches due to discrete binary activation and complex spatial-temporal dynamics. Considering the huge success of ResNet in deep learning, it would be natural to train deep SNNs with residual learning. Previous Spiking ResNet mimics the standard residual block in ANNs and simply replaces ReLU activation layers with spiking neurons, which suffers the degradation problem and can hardly implement residual learning. In this paper, we propose the spike-element-wise (SEW) ResNet to realize residual learning in deep SNNs. We prove that the SEW ResNet can easily implement identity mapping and overcome the vanishing/exploding gradient problems of Spiking ResNet. We evaluate our SEW ResNet on ImageNet and DVS Gesture datasets, and show that SEW ResNet outperforms the state-of-the-art directly trained SNNs in both accuracy and time-steps. Moreover, SEW ResNet can achieve higher performance by simply adding more layers, providing a simple method to train deep SNNs. To our best knowledge, this is the first time that directly training deep SNNs with more than 100 layers becomes possible.

Full PDF

SSpike-based Residual Blocks

Wei Fang , Zhaofei Yu , Timoth´ee Masquelier , Yanqi Chen ,Tiejun Huang , and Yonghong Tian Department of Computer Science and Technology, Peking University Peng Cheng Laboratory, Shenzhen 518055, China Centre de Recherche Cerveau et Cognition (CERCO), UMR5549CNRS - Univ. Toulouse 3, Toulouse, France

Abstract

Deep Spiking Neural Networks (SNNs) are harder to train than ANNs be-cause of their discrete binary activation and spatio-temporal domain error back-propagation. Considering the huge success of ResNet in ANNs’ deep learning, itis natural to attempt to use residual learning to train deep SNNs. Previous Spik-ing ResNet used a similar residual block to the standard block of ResNet, whichwe regard as inadequate for SNNs and which still causes the degradation problem.In this paper, we propose the spike-element-wise (SEW) residual block and provethat it can easily implement the residual learning. We evaluate our SEW ResNet onImageNet. The experiment results show that the SEW ResNet can obtain higherperformance by simply adding more layers, providing a simple method to traindeep SNNs.

Artiﬁcial Neural Networks (ANNs) have achieved great success in many tasks, in-cluding image classiﬁcation, object detection, speech recognition, machine translation,and gaming. One of the critical factors for ANNs’ success is deep learning, which usesmultilayers to learn representations of data with multiple levels of abstraction [13].[24] [26] have shown that the network depth is crucial for network’s performance, anddeeper and deeper network structures [12] [25] [6] [7] are proposed, with better andbetter performance.Spiking Neural Networks (SNNs) are regarded as a potential competitor of Artiﬁ-cial Neural Networks (ANNs) for their high biological plausibility, event-driven prop-erty and low power consumption [19]. Recently, deep learning methods are introducedinto SNNs and deep SNNs achieve close performance with ANNs in some simple tasks[27]. However, deep SNNs are harder to train than ANNs for their discrete binary ac-tivation and spatio-temporal domain error back-propagation.To address the difﬁculty of training deep ANNs, the residual learning was proposed[6]. With residual blocks, the ResNet can still converge even with 1000 layers. It would1 a r X i v : . [ c s . N E ] F e b re-printbe natural to use the residual learning in SNNs. In this paper, we will discuss how toefﬁciently implement the residual learning in deep SNNs. ANN to SNN conversion (ANN2SNN) [9] [20] and surrogate gradient method [17]are the two main methods to obtain deep SNNs. The ANN2SNN method ﬁrstly trainsan ANN with ReLU activation and then converts the ANN to SNN by replacing ReLUwith spiking neurons and some extra normalization. Some conversion methods haveshown near-lossless accuracy [5] [4]. But the conversion is based on rate-coding, there-fore many time-steps are required to approximate the original ANN [20], which in-creases the SNN’s latency. The surrogate method [15] [28] [23] trains the SNN bygradient descend directly by using surrogate derivatives to deﬁne the derivative of thethreshold-triggered ﬁring mechanism. The SNN trained by the surrogate method is notlimited to rate-coding and can be applied on temporal tasks, e.g., classifying neuromor-phic datasets [28] [2].

Similar to ResNet in ANNs, the spiking version of ResNet has been explored inSNNs. [8] ﬁrstly applied the residual structure in ANN2SNN with scaled shortcuts inSNN to match the original ANN’s activations. [22] proposed Spike-Norm to balanceSNN’s threshold and veriﬁed their method by converting VGG and ResNet to SNNs.[14] evaluated their custom surrogate methods on shallow ResNets whose depths areno more than ResNet-11. [30] proposed the threshold-dependent batch normalization(tdBN) to replace naive BN [10] and successfully trained Spiking ResNet directly withsurrogate gradient by adding tdBN in shortcuts.

The residual block is the basic component of ResNet. Fig.1 shows the basic blockin ResNet [6] (left) and Spiking SNNs [30][8][14] (right), which simply mimics theblock in ANNs by replacing ReLU activation layers by spiking neurons (SN).One of the critical concept in ResNet is identity mapping. [6] noted that if theadded layers implement identity mappings, a deeper model should not have trainingerror no greater than its shallower counterpart. However, solvers may have difﬁcultiesin training added layers to identity mappings, which results in deeper models performworse than shallower models (the degradation problem). Hence, the residual learn-ing is proposed by [6] and implemented by a shortcut connection, as shown in Fig.1(left). Denoting the stacked nonlinear layers as F , the two basic blocks in Fig.1 can be2re-print ConvBNSNConvBN + SNConvBNReLUConvBN + ReLU ConvBNSNConvBNSN 𝑔 ℱ(𝑋)

ℱ 𝑋 + 𝑋

ℱ(𝑆 𝑡 )ℱ 𝑆 𝑡 + 𝑆 𝑡 ℱ(𝑆 𝑡 )𝑋 𝑆 𝑡 𝑆 𝑡 𝑌 𝑂 𝑡 𝑂 𝑡 Figure 1: Basic blocks in ResNet (left) and Spiking ResNet (right).formulated as Y = ReLU( F ( X ) + X ) , (1) O t = SN( F ( S t ) + S t ) . (2)Eq.(1) can easily implement the identity mapping. When F ( X ) ≡ and X isthe activations of the previous ReLU layer, then Y = ReLU(X) = X . However,we believe that using Eq.(2) to implement direct training of SpikingResNet is ill-conditioned. Eq.(2) can hardly implement the identity mapping. When F ( X ) ≡ , O t = SN(S t ) (cid:54) = S t unless we set SN as IF neurons with V th = 1 , which limits theSNN’s diversity. Note that the right block in Fig.1 is still decent for ANN2SNN withextra normalization [8] [22] because the SNN converted from ANN aims to match theorigin ANN.Hence, we propose the spike-element-wise (SEW) residual block to implement theresidual learning, as shown in Fig.2. It can be formulated as O t = g ( F ( S t ) , S t ) (3)where g is an element-wise function with two spikes tensor as inputs. The SEW resid-ual block is similar with ReLU before addition in [7], which is criticized in ANNs forinﬁnite outputs and worse accuracy. Due to the binary property of spikes, we can easilyﬁnd different g to satisfy the identity mapping, as shown in Tab.1.3re-print ConvBNSNConvBN + SNConvBNReLUConvBN + ReLU ConvBNSNConvBNSN 𝑔 ℱ(𝑋)

ℱ 𝑋 + 𝑋

ℱ(𝑆 𝑡 )ℱ 𝑆 𝑡 + 𝑆 𝑡 ℱ(𝑆 𝑡 )𝑋 𝑆 𝑡 𝑆 𝑡 𝑌 𝑂 𝑡 𝑂 𝑡 Figure 2: The spike-element-wise (SEW) residual block.

Name Expression of g ( F ( S t ) + S t ) ADD F ( S t ) + S t AND F ( S t ) ∧ S t NAND ¬F ( S t ) ∧ S t OR F ( S t ) ∨ S t Table 1: List of g .When using ADD , NAND , and OR , the identity mapping can be obtained by setting F ( S t ) ≡ , which is usually implemented by setting weight and bias of the last BN tobe zero in F . When using AND , we can set F ( S t ) ≡ to get the identity mapping,which can also be implemented by setting the last BN’s weight as zero and bias as alarge enough constant to cause spikes, e.g., V th of the last SN.Note that all g in Table 1 output spikes (i.e. binary tensors) except ADD , whichoutputs a spike count (0, 1, or 2). However, processing these spike counts shouldnot be an issue for a neuromorphic chip. Indeed, considering that the next layer ofthese non-spike outputs are usually the convolutional/fully-connected layer with lineartransform, we can vary the order of

ADD and linear transform to get desirable binarycommunication. We denote the linear transform as O t = W S t + B where W is theweight and B is the bias. When using ADD and linear transform successively on inputspikes S t , S t , we have O a − lt = W ( S t + S t ) + B . If using linear transform ﬁrstly and ADD secondly, we can set B = B and get O l − at = W S t + B + W S t + B = O a − lt ,which can be used for inference on neuromorphic chips with acceleration for spikes.To train the SNNs with OR by gradient descend, we need to use the surrogategradient method to re-deﬁne the derivative of OR , e.g., using ADD as the gradient4re-printsurrogate function of OR . We use the same formulations as in [2] to model spiking neurons and networks,which uses three discrete equations to describe all kinds of spiking neurons: H t = f ( V t − , X t ) , (4) S t = Θ( H t − V th ) , (5) V t = H t (1 − S t ) + V reset S t . (6)where X t is the input current at time-step t , H t is the membrane potential after neuronaldynamics, V th is the threshold, S t is the output spikes, V t is the membrane potentialafter the trigger of spikes, V reset is the reset potential, and Θ( x ) is the Heaviside stepfunction and is deﬁned by Θ( x ) = 1 for x ≥ and Θ( x ) = 0 for x < . f in Eq.(4) is the neuronal dynamics and varies from neuron to neuron. Eq. (5) and Eq. (6)describe the spiking neuronal ﬁring and resetting, which are the same for all kinds ofspiking neurons. Here we will use the Integrate-and-Fire (IF) neuron, whose neuronaldynamics Eq.(4) are as followed: H t = V t − + X t (7)The surrogate gradient method is used to deﬁne Θ (cid:48) ( x ) = σ (cid:48) ( x ) during error back-propagation, where σ ( x ) is the surrogate function. We evaluate our method by building SEW ResNet with SEW residual blocks andtraining SEW ResNet on ImageNet 2012 classiﬁcation dataset [21] with 1.28 milliontraining images and 50k validation images. The test server of ImageNet 2012 is nolonger available, thus we can not report test accuracy. Hence, we utilize 85% samplesof each class in the origin training set as the new training set and set the rest 15% asthe validation set. The test accuracy reported in this paper is testing the model thatachieves the maximum validation accuracy on the origin validation set.

We use the Adam optimizer [11], the cosine annealing learning rate schedule [16]with T schedule = 30 , and epoch = 180 . We set batch size bs = 128 , learning rate lr = 0 . for ResNet-18, ResNet-34 and bs = 32 , lr = 0 . for deeper structuresto reduce memory consumption. We keep bs · lr to be constant according to [3]. Thesurrogate gradient function is σ ( x ) = π arctan( πx ) + , thus σ (cid:48) ( x ) = πx ) , whichwe introduced in [2]. Note that we set V reset = 0 and V th = 1 for all neurons. Wedetach S t in the neuronal reset Eq. (6) in the backward computational graph to improveperformance, as recommended by [29]. All codes are based on SpikingJelly [1] andPyTorch [18]. We choose ADD for g . 5re-printThe network structure of our SEW ResNet is the same as the standard ResNet [6]except for using spiking neurons and SEW residual blocks. We also compare SEWResNet with Spiking ResNet whose residual blocks are the same as those in ANNs, asshown in Fig. 1 (right). Fig. 3 shows the training loss of mini-batch and test accuracy during training. Itcan be found that the test accuracy of our SEW ResNet is higher with the increaseof depth, indicating that we can obtain higher performance by simply increasing net-work’s depth. We also observe the degradation problem of the Spiking ResNet that thedeeper network has higher training loss than the shallower network (Fig. 3(a)). t r a i n i n g l o ss SEW ResNet 18SEW ResNet 34SEW ResNet 50 0 100000 200000 300000iteration01234567 t r a i n i n g l o ss Spiking ResNet 18Spiking ResNet 34Spiking ResNet 50 (a) Training loss t e s t a cc u r a c y ( % ) SEW ResNet 18SEW ResNet 34SEW ResNet 50 0 50 100 150epoch010203040506070 t e s t a cc u r a c y ( % ) Spiking ResNet 18Spiking ResNet 34Spiking ResNet 50 (b) Test accuracy

Figure 3: (a) Training loss of mini-batch on ImageNet. The shaded curves indicate theorigin data. The solid curves are 64 moving averages. (b) Test accuracy on ImageNet.

Network Structure Accuracy(%) of Spiking ResNet Accuracy(%) of SEW ResNet

ResNet-18 60.22 60.6ResNet-34 58.38 62.58ResNet-50 59.43 63.55Table 2: Test accuracy on ImageNet.Tab.2 shows the test accuracy on SEW ResNet and Spiking ResNet.

In this paper, we analyzed the previous Spiking ResNet whose residual block mim-ics ResNet’s block, and found it can hardly implement the identity mapping. Then weproposed the SEW residual block to solve this problem. We tested Spiking ResNet andSEW ResNet on ImageNet and the experiment results showed that our SEW residualblock solves the degradation problem and SEW ResNet can obtain higher accuracy bysimply increasing the network’s depth. 6re-print

References [1] Wei Fang, Yanqi Chen, Jianhao Ding, Ding Chen, Zhaofei Yu, Huihui Zhou,Yonghong Tian, and other contributors. Spikingjelly. https://github.com/fangwei123456/spikingjelly , 2020.[2] Wei Fang, Zhaofei Yu, Yanqi Chen, Timoth´ee Masquelier, Tiejun Huang, andYonghong Tian. Incorporating learnable membrane time constant to enhancelearning of spiking neural networks, 2020.[3] Priya Goyal, Piotr Doll´ar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski,Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, largeminibatch sgd: Training imagenet in 1 hour, 2018.[4] Bing Han and Kaushik Roy. Deep spiking neural network: Energy efﬁciencythrough time based coding. In

Proc. IEEE Eur. Conf. Comput. Vis.(ECCV) , pages388–404, 2020.[5] Bing Han, Gopalakrishnan Srinivasan, and Kaushik Roy. RMP-SNN: Resid-ual Membrane Potential Neuron for Enabling Deeper High-Accuracy and Low-Latency Spiking Neural Network. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 13558–13567, 2020.[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learn-ing for image recognition, 2015.[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappingsin deep residual networks. In

European conference on computer vision , pages630–645. Springer, 2016.[8] Yangfan Hu, Huajin Tang, and Gang Pan. Spiking deep residual network, 2020.[9] Eric Hunsberger and Chris Eliasmith. Spiking deep networks with LIF neurons. arXiv preprint arXiv:1510.08829 , 2015.[10] Sergey Ioffe and Christian Szegedy. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. arXiv preprintarXiv:1502.03167 , 2015.[11] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014.[12] Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks,2014.[13] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature ,521(7553):436–444, 2015.[14] Chankyu Lee, Syed Shakib Sarwar, Priyadarshini Panda, Gopalakrishnan Srini-vasan, and Kaushik Roy. Enabling spike-based backpropagation for training deepneural network architectures.

Frontiers in neuroscience , 14, 2020.7re-print[15] Jun Haeng Lee, Tobi Delbruck, and Michael Pfeiffer. Training deep spiking neu-ral networks using backpropagation.

Frontiers in Neuroscience , 10:508, 2016.[16] Ilya Loshchilov and Frank Hutter. SGDR: Stochastic gradient descent with warmrestarts. arXiv preprint arXiv:1608.03983 , 2016.[17] Emre O Neftci, Hesham Mostafa, and Friedemann Zenke. Surrogate gradientlearning in spiking neural networks.

IEEE Signal Processing Magazine , 36:61–63, 2019.[18] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gre-gory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al.Pytorch: An imperative style, high-performance deep learning library. In

Ad-vances in Neural Information Processing Systems , pages 8026–8037, 2019.[19] Kaushik Roy, Akhilesh Jaiswal, and Priyadarshini Panda. Towards spike-basedmachine intelligence with neuromorphic computing.

Nature , 575(7784):607–617, 2019.[20] Bodo Rueckauer, Iulia-Alexandra Lungu, Yuhuang Hu, Michael Pfeiffer, andShih-Chii Liu. Conversion of continuous-valued deep networks to efﬁcient event-driven networks for image classiﬁcation.

Frontiers in Neuroscience , 11:682,2017.[21] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, SeanMa, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al.Imagenet large scale visual recognition challenge.

International journal of com-puter vision , 115(3):211–252, 2015.[22] Abhronil Sengupta, Yuting Ye, Robert Wang, Chiao Liu, and Kaushik Roy. Goingdeeper in spiking neural networks: Vgg and residual architectures.

Frontiers inneuroscience , 13:95, 2019.[23] Sumit B Shrestha and Garrick Orchard. Slayer: Spike layer error reassignment intime.

Advances in Neural Information Processing Systems , 31:1412–1421, 2018.[24] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks forlarge-scale image recognition. arXiv preprint arXiv:1409.1556 , 2014.[25] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks forlarge-scale image recognition, 2015.[26] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, DragomirAnguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Goingdeeper with convolutions. In

Proceedings of the IEEE conference on computervision and pattern recognition , pages 1–9, 2015.[27] Amirhossein Tavanaei, Masoud Ghodrati, Saeed Reza Kheradpisheh, Timoth´eeMasquelier, and Anthony Maida. Deep learning in spiking neural networks.

Neu-ral Networks , 111:47–63, 2019. 8re-print[28] Yujie Wu, Lei Deng, Guoqi Li, Jun Zhu, and Luping Shi. Spatio-temporal back-propagation for training high-performance spiking neural networks.

Frontiers inNeuroscience , 12:331, 2018.[29] Friedemann Zenke and Tim P Vogels. The remarkable robustness of surro-gate gradient learning for instilling complex function in spiking neural networks.