An Efficient Spiking Neural Network for Recognizing Gestures with a DVS Camera on the Loihi Neuromorphic Processor
Riccardo Massa, Alberto Marchisio, Maurizio Martina, Muhammad Shafique
TTo appear at the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, Scotland, July 2020.
An Efficient Spiking Neural Network forRecognizing Gestures with a DVS Cameraon the Loihi Neuromorphic Processor
Riccardo Massa , , ∗ , Alberto Marchisio , ∗ , Maurizio Martina , Muhammad Shafique Technische Universit¨at Wien, Vienna, Austria Politecnico di Torino, Turin, ItalyEmail: [email protected], { alberto.marchisio, muhammad.shafique } @tuwien.ac.at, [email protected] Abstract —Spiking Neural Networks (SNNs), the thirdgeneration NNs, have come under the spotlight for machinelearning based applications due to their biological plausibilityand reduced complexity compared to traditional artificial DeepNeural Networks (DNNs). These SNNs can be implemented withextreme energy efficiency on neuromorphic processors like theIntel Loihi research chip, and fed by event-based sensors, suchas DVS cameras. However, DNNs with many layers can achieverelatively high accuracy on image classification and recognitiontasks, as the research on learning rules for SNNs for real-worldapplications is still not mature. The accuracy results for SNNsare typically obtained either by converting the trained DNNs intoSNNs, or by directly designing and training SNNs in the spikingdomain. Towards the conversion from a DNN to an SNN, weperform a comprehensive analysis of such process, specificallydesigned for Intel Loihi, showing our methodology for the designof an SNN that achieves nearly the same accuracy results asits corresponding DNN. Towards the usage of the event-basedsensors, we design a pre-processing method, evaluated for theDvsGesture dataset, which makes it possible to be used in theDNN domain. Hence, based on the outcome of the first analysis,we train a DNN for the pre-processed DvsGesture dataset, andconvert it into the spike domain for its deployment on IntelLoihi, which enables real-time gesture recognition. The resultsshow that our SNN achieves 89.64% classification accuracy andoccupies only 37 Loihi cores.
Index Terms —Machine Learning, Spiking Neural Networks,Gesture Recognition, Event-Based Processing, NeuromorphicProcessor, Loihi, Accuracy, Conversion, DVS Camera.
I. I
NTRODUCTION
Recent developments of artificial Deep Neural Networks(DNNs) have pushed forward the state-of-the-art in the fieldof image recognition [9]. However, the high power demandrequired by these networks when it comes to perform inferencetasks on the edge devices [16][24] limits the spread of DNNsin scenarios/use-cases where the energy/power consumption iscrucial [23][28]. On the other hand, Spiking Neural Networks(SNNs), due to their biologically plausible model, have shownpromising results both in terms of power/energy efficiency andreal-time classification performance [27]. By leveraging thespike-based communication between neurons, SNNs guaranteea lower computational load, as well as a reduction in the latency.As a side effect, SNNs have also shown a different behaviorthan DNNs when threatened by adversarial attacks [17].Along with the development of efficient SNN specializedaccelerators (like TrueNorth [18], SpiNNaker [7] and IntelLoihi [5]), another advancement in the field of neuromorphichardware has come from a new generation of camera, theDVS event-based sensor [15]. Such a device, differently from *These authors contributed equally to this work. a classical frame-based camera, works emulating the behaviorof the human retina. Thus, the recorded information is not aseries of time-wise separated frames, but a sequence of spikes,which are generated every time a change of light intensity isdetected. The event-based behavior of these sensors pairs wellwith SNNs, i.e., the output of a DVS camera can be used asthe input of an SNN to elaborate events in real-time.A promising approach to train SNNs in a supervised learningscenario is to train a DNN with state-of-the-art backpropagationapproaches, and then assign the trained parameters (weightsand biases) to an equivalent SNN representation by applyinga conversion process. This approach has shown promisingresults [22], mostly because it allows to get the best fromthe two worlds: the converted SNN totally behaves like anormal SNN, with its benefits in terms of efficiency andlatency. At the same time, the network has been trained withefficient methodologies that ensure good results in classificationtasks. However, such a conversion may not always providethe expected results. In fact, many aspects have to be takeninto account, like the original DNN structure, the trainingprocess, as well as the parameters that control the DNN-to-SNNconversion. This is especially true when the converted SNN hasto be deployed on a limited precision hardware like Intel Loihi,which restricts the degree of freedom of the conversion process.Towards this, in this paper, we present a complete DNN-to-SNN design process (Figure 1A), systematically discussing theeffects of the key parameters that are used in the conversion.We evaluate their effect, and extract important general rulesthat can be successfully applied when it comes to develop anSNN for Intel Loihi or similar neuromorphic processors. Oncewe have an SNN that gets good accuracy results both on theMNIST [14] and the CIFAR10 [13] datasets, we evaluate italso on the DvsGesture dataset [2], which comprise 11 gesturesrecorded with a DVS event-based camera (Figure 1B). Themain challenge when adopting the DNN-to-SNN conversionapproach to get a trained SNN is that we cannot train a DNNon the event-series coming from the DVS camera. For thisreason, we first need to collect the events into frames, andthen train the DNN on such converted dataset. Different pre-processing techniques are discussed in this paper, also reportingthe accuracy results achieved by the DNN on the generatedconverted dataset. Finally, after performing the conversion, theSNN is tested on the DvsGesture dataset, and afterwards, it isready to be deployed for real-time classification on Intel Loihi.
In a nutshell, our Novel Contributions are: • We perform a comprehensive parameter analysis of theprocess of converting a DNN into an SNN. (
Section III ) • We design a pre-processing method for the DvsGesture1 a r X i v : . [ c s . N E ] M a y NN-to-SNN conversion analysis
DNN training frame basedaccumulation
DVS pre-processingand training
SNN testing on Loihi
DNN training
MNIST CIFAR10 conversion conversion
SNN model map SNNon Loihi
Events can be directly sent to Loihi and pre-processedbefore deployment (B)(A)
Fig. 1:
Workflow of our research. dataset through frame-based accumulation, to make suchdataset compatible with the DNN domain. (
Section IV ) • We train a given DNN for the pre-processed DvsGesturedataset and convert it to an SNN that can then be deployedon Intel Loihi. (
Section V )Before proceeding to the technical sections, in
Section II we present an overview of the SNNs, the Intel Loihi researchchip, and of the DNN-to-SNN conversion approach, to a levelof detail necessary to understand the contributions of this paper.II. B
ACKGROUND AND R ELATED W ORK
A. Spiking Neural Networks
Spiking Neural Networks (SNNs) are based on thebiologically plausible models of neurons [20], whichcommunicate asynchronously through series of spikes.The structure and behavior of an SNN are presented inFigure 2.The SNNs’ major improvements over traditional DNNs arethe following [20]: • The intrinsic asynchronous, spike-based communicationprotocol adopted in the network allows to reduce thepower/energy consumption required in the computations andcommunication. • The asynchronous, spike-based design makes this networksideal to cooperate with event-based sensors . The eventsprovided as an input can be seen as a train of spikes directlyprocessable by the network.The SNNs’ main weakness relies on the fact that the classicalsupervised learning approach, i.e., the backpropagation, cannotbe applied due to the non-differentiability of the SNN lossfunction [21]. Therefore, two main approaches have beenproposed to achieve supervised learning in SNNs: • Use the backpropagation algorithm directly in the spikingdomain. This method generally requires to substitute theloss function with a placeholder function, that can bedifferentiated [19][25]. • Train an equivalent DNN model and then convert it to anSNN in the spiking domain. In this article, we focus on the latter approach. Training thenetwork in the DNN domain allows us to use the current state-of-the-art training policies and techniques. Moreover, the DNN-to-SNN conversion technique has shown promising results,allowing to get SNNs that reach the same, or very closelevels of accuracy, compared to their corresponding DNNversions [21][22]. However, some precautions and limitationshave to be considered when using this approach, as we willexplain in our analysis in Section III.
B. Intel Loihi Neuromorphic Research Chip
DNNs achieve the best results in terms of accuracy andefficiency when executed on highly parallel hardware likeGPUs, and even more with specialized hardware accelerators,like Google TPU [10], MPNA [8]. Similarly, SNNs require theirspecialized hardware to achieve the best results in terms ofpower/energy efficiency and latency [3]. Neuromorphic chipsrepresent an efficient hardware solution when it comes to theimplementation of SNNs. Unlike the artificial neuron modeland synchronous structure of traditional DNNs, the highlyparallel asynchronous structure, combined with the hardwareimplementation of a biologically plausible neuron model, suchas the leaky-integrate-and-fire (LIF) model [26], allows toachieve far better results both in latency and power/energyefficiency with SNNs when compared to their CPU andGPU implementation. Recent developments in the field ofneuromorphic hardware have brought valid and powerfulsolutions for the deployment of SNN models, like IBMTrueNorth [18], SpiNNaker [7] and Intel Loihi [5].In this paper, we focus on the
Intel Loihi [5], whichis a neuromorphic processor providing highly parallel andenergy efficient asynchronous computation. The chip comprise aneuromorphic mesh of 128 neurocores, and 3 x86 processors, aswell as an asynchronous network-on-chip (NoC) that connectsneurocores allowing neuron-to-neuron communication. Eachneurocore implements up to 1024 spiking neural compartmentsunits, such that the compartments can be combined in a treestructure to form multi-compartment neurons. Neurons variablesare updated at every algorithmic time-step. The spikes generatedby a neuron are delivered to all the compartments belonging toits synaptic fan-out through the NoC. The NoC allows to deliverspikes between different neurocores in a packet-messaged form,following a mesh operation that is executed over a series ofalgorithmic time-steps. In the absence of a global clock, abarrier synchronization mechanism is used to ensure that at theend of each time-step all neurons are synchronized. An off-chipcommunication interface allows to extend the mesh up to 4096on-chip cores, and up to 16,384 hierarchically connected cores.The architecture of a single Loihi chip is displayed inFigure 3. The biologically-plausible neuron model adopted bythe Loihi architecture is based on a modified version of theCUBA leaky-integrate-and-fire model [5]. More specifically,each neuron is represented as a dendritic compartment, whichreceives the incoming spikes from the pre-synaptic neurons.Each neuron is characterized by its compartment current u c ( t ) and its compartment membrane potential v c ( t ) [5]. Givena postsynaptic neuron n i , it receives an input train of spikesfrom presynaptic neuron n j that can be represented as a train2 euron integrates dendritic currentand generates neuron timeVth membrane potentialspikes spikes bias current spikerates of output neurons are computedand output is predictedclassOutput spiketrains (A)(B) (C) (D) spikes travel to the post-synaptic neuron s p i ke s time pixel intensity sets of input neurons time Fig. 2:
Structure and main steps for an SNN for image classification task. In this example, a fully-connected SNN, with neurons represented as circles andsynapses as lines, is shown. (A)
The input image is not converted with a spike-encoding algorithm, but pixel intensities set the bias currents of neurons inthe input layer. (B)
The spikes from the pre-synaptic neuron travel across the synapse and accumulate in the dendritic tree of post-synaptic neurons. Themembrane current of the post-synaptic neuron integrates the incoming weighted spike trains. (C)
The neuron membrane potential integrates the bias currentand the membrane current. An output spike is generated each time the potential reaches a predefined threshold. Afterwards, the membrane potential is set backto the initial level. (D)
The output neurons, one for each class, generate spike trains. For each neuron, its corresponding spikerate in a predefined time-windowis computed, which is then used as the output prediction for its class.
PARALLEL I/OPARALLEL I/O PA R A LL E L I/ O PA R A LL E L I/ O L M T L M T L M TF P I O NoC x86 processoroff-chipinterfaceNeuromorphiccore
Fig. 3:
Loihi single chip architecture [5]. of Dirac delta functions: σ j ( t ) = (cid:80) k δ ( t − t k ) where t k is thespike time.The train of spikes is first filtered by a synaptic filter inputresponse α u ( t ) and then multiplied by the synaptic weight w ij associated to the synapse that connects neurons n i and n j . Thesynaptic response current can then be computed as the sum ofall the weighted and filtered spike trains, with an additional biascurrent u bias i : u c i ( t ) = (cid:88) j w ij ( α u ∗ σ j )( t ) + u bias i Finally, the synaptic current is integrated by the membranepotential v c i ( t ) . ˙ v c i ( t ) = − τ v v c i ( t ) + u c i ( t ) − V th i σ i ( t ) When the membrane potential reaches a threshold V th , theneuron spikes. Then, the membrane potential is reset to a V rest value and starts increasing again as new input spikes are received. The time constant τ v is responsible for the leaky behavior of the model [5]. C. DNN-to-SNN Conversion
The DNN-to-SNN conversion approach has shown promisingresults in terms of accuracy consistency among the originalDNN and the converted SNN [22]. To reach such results,the trained parameters of the DNN must be efficientlyconverted into the corresponding parameters of the SNN.This also requires to take into consideration the intrinsicdifferences between the two models, and some adjustmentsare consequently required to get a correct conversion. Duringthe training, for each connection among two neurons of theconsecutive layers i and i+1 , the weight w i,i +1 is learned.Moreover, for each neuron of the layer i+1 , also the bias b i +1 is derived. In the equivalent SNN model, these parameters needto be translated into an equivalent value for the spiking neuralmodel. Specifically referring to the Loihi model, the conversionworks as follows: • the bias b i +1 is associated to the bias current u bias of theneuron n i +1 . • w i,i +1 is directly set as the weight of the synapse connectingneurons n i and n i +1 .Besides the learned parameters, each layer of the DNNhas to be converted to an equivalent spiking version. Thismeans that each layer will be composed of equivalent spikingneurons that follow the neuron model adopted by the Loihiarchitecture. To apply the DNN-to-SNN conversion, we use the SNNToolBox (SNN-TB) [22], an open-source conversion toolthat is compatible with Loihis Python NxSDK-0.9.5.The results obtained with the conversion process may notalways be optimal, due to several limitations of the NxSDK APIand specific constraints of the Loihi neurocores. Therefore, in3he following Section III, we present a case study for the DNN-to-SNN conversion, specifying a set of general guidelines tofollow for achieving a converted SNN that reaches the sameaccuracy levels as of the corresponding DNN.III. A C
OMPREHENSIVE A NALYSIS ON THE
DNN- TO -SNNC ONVERSION S ETUP
A. Evaluation Metrics for the Conversion Quality
The conversion process requires a series of preliminaryconsiderations for a successful conversion. First of all, the Loihiarchitecture uses limited precision synaptic weights, definedwithin the interval [-256,255]. On the other hand, the trainedDNN uses full precision weights. Therefore, a preliminaryquantization of the DNN-trained weights is crucial to get aprecise converted SNN. In this quantization step, the distributionof the input weights has a major role in the outcome of theconversion. That is, the input weights has to be clipped into theLoihi quantized range, therefore a tight weight distribution canbe mapped to the quantized interval without relevant errors. Onthe other hand, the presence of outliers in the original weightsdistribution can be the main source of an inprecise conversion.This is due to the fact that very high weights are clipped to fitinto the quantized interval, leading to possible inconsistenciesbetween the pre- and post-quantization weight distributions. Todecrease strong outliers in the final trained weights, the L2regularization, applied both on activations and kernels duringthe training, helps to keep weights into a limited range.A good practice to evaluate the quality of the conversion is tolook at the correlation plots between the DNN layer activationsand the corresponding SNN layer output spikerates. Figure 4shows three typical correlation plots that can be obtained withgood and bad conversion processes.
DNN activations S NN s p i k e r a t e s DNN activations DNN activations S NN s p i k e r a t e s S NN s p i k e r a t e s (A) (B) (C) DNN-to-SNN correlation plots
Fig. 4:
Examples of correlation plots between the DNN activations and theirconverted SNN spikerates.
The plot 4(a) is an example of a good correlation plot,where the DNN activations are properly converted into SNNspikerates, being all the points distributed along the maindiagonal. On the contrary, the plot 4(b) shows a relatively worseconversion, where the DNN activations and the SNN spikeratesare still distributed along the diagonal, but the distribution ofpoints is not confined to the desired range. The plot 4(c) isanother example of a bad conversion. However, in this case,the activations and the spikerates are totally non-correlated.
B. Tunable Conversion Parameters
Many parameters can be tuned during the DNN-to-SNNconversion process, and a detailed analysis over their effectson the converted SNN is necessary. These parameters modifythe spiking neuron model, the characteristics of the network andthe experiment duration. • Reset mode : The reset mode defines the behavior of theneuron after a spike. As previously said, the neuron spikesevery time when its membrane potential exceeds the threshold V th . After the spike, the membrane potential is reset to a valuethat depends on the chosen reset mode: – Hard Reset : The membrane potential is reset to a valueequal to 0 after a neuron spikes. This solution is lesscomputationally expensive, but relatively less accurate. – Soft Reset : The membrane potential is reset to a value equalto the difference between the highest value reached bythe membrane potential and the membrane threshold. Thissolution is relatively more accurate, but more expensiveas well when compared to the hard reset, because theamount of compartments needed to simulate each neuronis doubled. • Desired Threshold to Input Ratio ( DThIR ): As describedin Section II, the weights of the input DNN model has tobe converted to synaptic weights of the SNN. Because ofthe limited dynamic range of spiking neurons, the output ofa spiking neuron may saturate due to an excessively highinput, given by some out-of-scale synaptic weights. Hence,it is necessary to normalize the network and set a constantratio between the incoming neuron inputs and its membranethreshold [6]. • Experiment duration : This parameter defines the number oftime-steps for which the network receives the same image asan input, i.e., the inference time. A longer duration gives thenetwork more time to output its prediction, but it increasesthe latency of the system.The development of the DNN architecture has to be realizedusing the python Keras API [4], which is one of the APIsupported by Intel NxSDK. Currently, not all the Keraslayers are supported by the Loihi’s Python NxSDK. The onlysupported layers are the one in Table I. This limitation has tobe taken into consideration during the development of a DNNarchitecture.
TABLE I:
Layers supported by NxSDK.Dense Flatten Reshape PaddingAvgPooling2D DepthwiseConv2D Conv1D Conv2D
C. DNN Training
To study the behavior of the conversion, a small network hasbeen used for evaluating the process. Such a network, that wewill refer to as cNet , is a convolutional neural network thatcontains only convolutional layers and a final dense layer. Itsstructure is reported in Table II.
TABLE II: cNet architecture for the MNIST dataset.
Layer features Kernel stride Output Shape Activation
Input 1 28x28x1 ReLUConv2D 16 4x4 2 13x13x16 ReLUConv2D 32 3x3 1 11x11x32 ReLUConv2D 64 3x3 2 5x5x64 ReLUConv2D 10 4x4 1 2x2x10 ReLUFlatten 40Dense 10 SoftMax
To achieve a better conversion process, both activation andweight
L2 Reguralization are applied on the network layers. Inboth cases, the value is set to · − . The use of regularizationduring training is preferable for preventing the divergence of the4arameter distribution and for avoiding the information loss dueto the quantization process of the parameters, as discussed inSection III-A.The datasets on which the analyses have been performed arethe MNIST [14] and CIFAR10 [13]. For each input image, theintensity values are normalized between 0 and 1. Both networksare developed in Keras, using TensorFlow [1] as the backend.The training is performed with the following policies: • learning rate decay: initially set to 0.001, it is halvedafter 15 consecutive epochs without validation accuracyimprovements, until it reaches a final value of · − . • Adam optimizer [12]. • Small data augmentations , with width and height shifts of0.1, and 10 rotations.After training, the values of test accuracy achieved by thenetworks are reported in Table III.
TABLE III:
Accuracy results of theDNN models.
Nework Dataset Accuracy cNet MNIST 98.79%cNet CIFAR10 78.92%
TABLE IV:
Constraints of theLoihi neurocores.
Neurocore constraints max compartments 1024max fan-in axons 4096max fan-out axons 4096
D. Conversion Process
The trained DNN model is then converted into its equivalentspiking model via the SNN-TB tool. The conversion requiresfour main steps: • Parsing : The toolbox extracts the relevant informations fromthe original model, discarding layers that are not used inthe inference stage (Dropout, BatchNormalization, etc.) andconverting the MaxPooling2D layers that may be present intoAveragePooling2D, which are supported. The parsed modelis the one used as reference for the following conversion. • Conversion : An NxSDK-compatible spiking model isobtained, applying a normalization process that adapts theweights and biases to the limited dynamic range of thespiking neurons, satisfying the selected value of
DThIR . • Partition : The conversion process requires to find a validpartition of the neural network on the Loihi chip. Someconstraints have to be respected in order to have a validpartition. These constraints, reported in Table IV, are relatedto the synaptic fan-in and fan-out of each neurocore, andthe maximum number of neurons that can be mapped onto asingle neurocore. • Mapping : The partition is mapped onto the Loihi chip, andthe model is now ready to be used in the SNN deployment.
E. Experimental Setup
The tool flow is depicted in Figure 5. All the experimentsare executed on the Intel Neuromorphic Research Cloud (NRC)server, using one of the avaible Loihi partitions. The reportedexperiments are executed on the Nahuku32 board, whichcomprises 32 Loihi chips. As described in section III-B, thethree main parameters that have been analyzed for a fine tuningconversion are the reset mode , DThIR , and experiment duration .Different experiments have been done to evaluate the effects ofthese parameters on the final SNN accuracy. definenetworkmodel trainmodel
Keras Loihi
DNN domain runmodel evaluateresults
SNN domain set conversionparameters parse &convert
DNN-to-SNN conversion
SNN-ToolBox find validpartitionand map
Fig. 5:
Tool flow of our experimental process.
F. Results Varying the DThIR
In this experiment, we evaluate the conversion results varyingthe DThIR. The experiment duration is set to 256 time-steps thatis a reasonable choice for both the soft and the hard reset, as wewill discuss later. The tested DThIR levels are , and .Selecting higher levels is usually not a good solution becausethe membrane potential threshold may get too large. The resultsare reported in Figure 6(a). DThIR
32 64 128
256 512 1024
32 64 128
256 512 1024
DThIR timesteps
32 64 128 256 512 1024 32 64 128 256 512 1024
D T h I R A N A LY S I S timestepstimestepstimesteps SO FT RESE T HARD
RESET
M N I S T
E X P E R I M E N T D U R A T I O N A N A LY S I S( A )( B )
C I F A R 1 0
M N I S T
C I F A R 1 0
Fig. 6:
The legend is commong for all the plots. cNet , results on MNIST andCIFAR10, varying (a) the DThIR and (b) the experiment duration.
Analysis for MNIST : In both cases of soft reset and hardreset, the SNN accuracy is equal to the DNN accuracy value forDThIR = and . However, when the parameter is increasedto , the accuracy drops in both soft and hard reset cases. Analysis for CIFAR10 : Also in this case the highestaccuracy is reached for DThIR= , for both the hard and thesoft reset. However, the accuracy starts reducing when theDThIR is set to , and gets to a minimum when the DThIRis increased to .As a consequence of these results, a value of DThIR = ischosen for the following further analysis. G. Results Varying the Duration and Reset Mode
This analysis tries to find a good compromise betweenexperiment duration and reset mode. By choosing a longerduration, we expect to get more precise results, paying in termsof output latency. Moreover, the soft reset is expected to providehigher accuracy. The results are reported in Figure 6(b).
Analysis for MNIST:
Looking at the results achieved on theMNIST dataset, a test accuracy of . (i.e., only . lower than the one obtained with the DNN model) is reachedin the soft reset case, when the experiment duration is longerthan 64 time-steps. On the other hand, it takes at least 128 time-steps for the hard reset case to reach the same level of accuracy.5oreover, the accuracy reached by both the soft and the hardreset remains stable also for longer experiment duration. Analysis for CIFAR10:
The results for the CIFAR10 datasetclearly show that for the hard reset case the DNN accuracy of . is never reached. The maximum accuracy is . when the experiment gets longer than 256 time-steps. On theother hand, the soft reset shows better results than the hard reset,despite not achieving the same results as the correspondingDNN. An accuracy of . is reached with 256 time-steps,slowly growing to . with a longer experiment of 1024time-steps.For an experiment duration of 256 time-steps, the averagetime for executing a single inference step of image classificationand the Loihi chip usage are reported in Table V. Looking atthe number of occupied neurocores, for both the MNIST andCIFAR10 cases, the soft reset makes use of more cores. TABLE V:
Accuracy results of the DNN models.
Reset Mode Dataset Classification time Neurocores soft MNIST 8.312 ms 27hard MNIST 6.464 ms 20soft CIFAR10 21.371 ms 37hard CIFAR10 26.159 ms 29
For better understanding the reason why the soft resetachieves better results than the hard reset conversion, wecompare the correlation plots of the converted layers. Figure 7shows the correlation plots of the first 4 layers, both for thesoft reset and the hard reset versions, and on both datasets. Ineach of the 4 presented cases, an experiment duration of 256time-steps is applied, as well as a DThIR equal to .At a first glance, it is immediately clear that the correlationplots of the soft reset conversion are far more compliant withthe expected behavior when compared to the hard reset results,both for the MNIST and CIFAR10 datasets. Looking at theMNIST - soft reset experiment, the correlation plot of the firstlayer shows a perfect conglomeration of activations (x axis)vs. spikerates (y axis), along the main diagonal. This meansthat the conversion of the layer is working as desired, havingall the SNN neurons spiking with a rate equivalent to theircorresponding DNN activations. The same principle is adoptedfor the following layers.Looking at the MNIST - hard reset experiment, thecorrelation plots show a relatively worse conversion result.Starting from the first layer, the points are distributed with aoverlapped-staircase behavior. The same happens in the secondlayer, where it is also present a dilatation of the agglomerateof points along the x-axis. However, both in the rd and th layers correlation plots, the points are sufficiently compactedalong the diagonal, and in fact the final accuracy achieved bythis SNN is similar to the DNN accuracy.Regarding the CIFAR10 analysis, the soft reset gives goodcorrelation plots, even if the points form a thicker agglomeratew.r.t. the MNIST case. On the other hand, the hard resetgives worse results. The correlation between activations andspikerates is relatively less evident, with a general behaviorthat follows the one of MNIST case, but more emphasized. Theanalyses reported in these plots justify the 10% accuracy dropobtained using the hard reset conversion, as seen in Figure 6.Overall, the results obtained for the CIFAR10 dataset areworse than the ones obtained on the MNIST, both for the soft and the hard reset. This can be addressed to the highercomplexity of the CIFAR10, which represents a challengingdataset to work with. H. Results Discussion
Overall, the use of the soft reset mode gives higher accuracyresults, because of the lower information loss that occurs duringthe conversion, as clearly shown by the correlation plots inFigure 7. A good choice for the experiment duration seemsto be ≥ time-steps. A shorter experiment may lead to anaccuracy loss, as shown for the CIFAR10 dataset. On the otherhand, using more than 512 time-steps does not lead to a higherlevel of accuracy, as shown in both the MNIST and CIFAR10analyses. Finally, a DThIR value equal to seems to be thebest choice to reduce the loss during the conversion.Furthermore, the conversion results are also stronglyinfluenced by the DNN architecture, as well as by the DNNtraining policies. To have a deeper evaluation of the conversionprocess, several other DNN models have been trained andconverted. These models vary in terms of size, number of layers,and layers characteristics. Not always the conversion processhas shown successful results, even applying the soft reset.The problems generally arise when the DNN layers are toowide, making the conversion infeasible, because the neurocoreconstraints are violated. Therefore, when it comes to buildvery large SNNs, it is suggested to use depthwise separableconvolutional layers that require less core occupation than thetraditional ones. However, a comprehensive analysis concerningthese DNN characteristics is not easily practicable, and it isconsidered beyond the scope of this article.IV. P RE - PROCESSING METHODS FOR THE D VS G ESTUREDATASET
The IBM DvsGesture dataset [2] is a fully event-basedgesture recognition dataset. Each gesture is recorded with aDVS128 camera, providing a total of 1342 samples dividedin 122 trials. In each trial, 1 subject executes the 11 differentgestures in sequence. A total of 29 subjects under 3 differentlight conditions form the whole dataset. Each gesture has anaverage duration of 6 seconds, and is composed of a collectionof all the events (positive and negative) that have been recordedby the DVS camera. A positive (or negative) event is recordedevery time a positive (or negative) variation of light is detected.Event-based data are ideal when used as an input to theSNNs, thanks to their intrinsic asynchronous and spikingbehavior. However, in the context of our research, we aretraining a network in the DNN domain, and only at the secondstage we convert it into the SNN domain. This forces us tofind an alternative representation of the input data, being theDNN not trainable on pure sequences of events. A valid solutioncan be to train the DNN with a series of frames obtained bycollecting the incoming events. However, some choices have tobe made to achieve a good conversion into frames, that is: • Choose the amount of events to collect into a single frame. • Select the size of the frame and its number of channels. • Set a policy for positive and negative events accumulation.
A. Events Accumulation
As reported in [21], there are two accumulation approaches: • Time-based accumulation: all events that occur in a fixed timewindow are accumulated in a single frame.6 O F T R E S E T H A R D R E S E T C I F A R 1 0 S NN s p i k e r a t e s [ H z ] S NN s p i k e r a t e s [ H z ] M N I S T
Layer 1 Layer 2 Layer 3Layer 1 Layer 2 Layer 3 Layer 1 Layer 2 Layer 3Layer 1 Layer 2 Layer 3Layer 4
Layer 4Layer 4 S NN s p i k e r a t e s [ H z ] S NN s p i k e r a t e s [ H z ] ANN activations
DNN activations DNN activations DNN activationsDNN activationsDNN activations DNN activations S NN s p i k e r a t e s [ H z ] S NN s p i k e r a t e s [ H z ] S NN s p i k e r a t e s [ H z ] S NN s p i k e r a t e s [ H z ] DNN activations S NN s p i k e r a t e s [ H z ] DNN activations S NN s p i k e r a t e s [ H z ] DNN activations S NN s p i k e r a t e s [ H z ] DNN activations S NN s p i k e r a t e s [ H z ] DNN activationsDNN activations S NN s p i k e r a t e s [ H z ] DNN activations S NN s p i k e r a t e s [ H z ] DNN activations S NN s p i k e r a t e s [ H z ] DNN activations S NN s p i k e r a t e s [ H z ] DNN activations
Layer 4
Fig. 7:
Correlation plots for the first 4 layers of cNet after its conversion to the corresponding SNN model. The first column shows the results on the MNISTdataset, whereas the second column presents the results for the CIFAR10 dataset. • Quantitative-based accumulation: a fixed number ofconsecutive events are accumulated in a single frame.The former solution ensures that the timing informationwithin frames is respected. On the other hand, the latter solutionguarantees that each frame will have the same amount ofinformation. However, this may not be a good choice whenit comes to gesture recordings. In fact, the number of eventsgenerated by a gesture in a fixed time window also dependson the type of the gesture itself. Not all the gestures generatethe same amount of events per second. Therefore, using aquantitative approach, the number of the generated framesgenerated depends on the number of events produced by thegesture. Gestures with the same time length may lead to adifferent amount of frames, having different event rates.As a consequence, the final dataset will result in animbalance, having a diverse amount of frames per classes,both in the train and test sets. In order to balance thedataset, one may reduce the amount of frames per gestureto a number that is equal for all classes, but this wouldcome out in a drastic reduction of the used informationfrom the original event-based recordings. Hence, based onthese considerations, the time-based accumulation is preferable,because it guarantees a balanced dataset. Therefore, the resultsrelative to the quantitative-based accumulation are not discussedin the following section.
B. Time Window Size
The amount of events per seconds varies not only fromgesture to gesture, but also between different trials of thesame gesture. A mean number of 98 events/ms is estimatedby evaluating the original dataset over all the available gesturesof all the different trials. This information is a relevant startingpoint in the choice of the time window size that each frame hasto cover. In this research, the time windows of 60ms, 150ms,235ms and 300ms are explored. Choosing a time window ofless than 60ms would bring it to an insufficient amount ofevents collected per frame, thus preventing from having a properclassification. On the other hand, an accumulation time of morethan 300ms would lead to a total of less than 3 frames persecond, that we consider as the minimum for a real-worldapplication.A single frame may also have more than one channel, eachof them covering a subset of the complete time window. Forexample, a frame covering a window of 300ms can have 3 channels, where each channel covers a sub-windows of 100ms.This solution allows to get frames in which the temporalinformation is preserved, since the channels cover consecutivetime sections.Moreover, another solution may be to use overlapped frames,i.e., the time windows covered by two consecutive framesare partially overlapped. For example, using an overlap factorof 2 with frames of 300ms, the frames will cover partiallyoverlapped ranges. The first frame will be [0 ms ; 300 ms ] andthe following frame will cover the range [150 ms ; 450 ms ] . There are several advantages in choosing this solution: • The number of frames generated from the original dataset ismultiplied by the overlap factor, leading to a bigger datasetthat guarantees better training results. • The frames can cover different time windows, augmentingthe temporal information in the dataset. • The system’s throughput is multiplied by the overlap factor.In our experiments, an overlap factor of 2 has been chosen.Using an overlap factor n > would lead to generatingredundant overlapped frames. On the other hand, a value n < would reduce the benefits of having overlapped frames. C. Events Polarity
Each event carries the x and y position of the detected event,as well as the polarity of the event that can be either positiveor negative. • The first possibility is to accumulate the positive and negativeevents in two different channels of the frame, c + and c − . Boththe channel pixels are initialized at 0, and when a positiveevent is detected, the pixel (x, y, c + ) is incremented by 1.On the other hand, a negative event increases the pixel (x,y, c − ) by 1. Finally, the pixel intensities are normalized inthe range [0; 255] . Since the accumulation of opposite signedevents form a trace of the gesture motion over time, thissolution prevents the information loss, because the polarityinformation becomes relevant when the gestures differ onlyw.r.t. their sense of rotation. • The second solution (as inspired from the work of [21]) isto accumulate all negative and positive events on the samechannel, keeping the polarity information. All the pixels areinitialized to a mean value of 128, and are incremented ordecremented by 1, depending on the polarity of the event.7
The third possibility (as inspired from the work of [21]) isto discard the polarity information and collect all the eventsin a single channel, by simply incrementing the pixel (x, y)every time either a positive or a negative event occurs.The above-described three solutions have been tested onthe DNN, and based on the accuracy achieved, the followingconsiderations can be made. Overall, the best solution hasproved to be the third one, in which the polarity is discarded.The 2-channel accumulation solution has not shown particularimprovements on the final accuracy, when compared to the casein which the polarity is discarded. At the same time, havingtwo channels that separately store the polarity comes with aseries of drawbacks, such as, an increase in the size of thedataset as well as in the dimension of the DNN. Moreover,the number of neurocores occupied by the converted SNNis higher than using a single channel, and this also impactson the latency of the system. For this reason, the 2-channelpolicy can be discarded. Considering the 1-channel polarityaccumulation, the obtained results have shown an accuracy dropof ( (cid:39) − ) w.r.t. the discarded polarity case. This solutionleads to having frames with generally a high level of pixelintensities, being all initialized to a non-zero value, therebyleading to lower classification results. For these reasons, inTable VI, only the results achieved without signed polarityaccumulation are reported. D. Frame Size
Lastly, the dimension of the frame has to be chosen. Theoriginal recordings have a dimension of 128x128. However,such a dimension may be too large when used as an input toour converted SNN, leading to a high number of neurocoresrequired to deploy the SNN on Loihi, as well as increasing thelatency of the prediction. Therefore, we resized the image to adimension of 32x32, by applying a preliminary Average Poolingstep. This process is also useful to remove the noisy events fromthe original recordings, thereby producing a input frames thatcontain only the relevant gesture information. Also a 64x64size has been evaluated, but the accuracy results obtained bythe DNN did not show any improvement over the 32x32 size.On the other hand, a size of 16x16 would be too small forachieving a good recognition by the DNN.Another solution, which has been proposed by [11] for thesame dataset, is to collect only the events that are inside a64x64 attention window, which moves and keeps track of theincoming gestures. Then, the Average Pooling is applied on the64x64 frame, reducing its size to 32x32.This solution has been evaluated, but the accuracy resultswere ( (cid:39) − ) lower than the one achieved with the wholeimage frame. The reason for such an accuracy drop may befound in the fact that, by shrinking the input window to thearea where the actual gesture takes place, the gesture itself istaken out of its contest. In this way, the DNN cannot distinguishbetween equivalent gestures executed with opposite arms. E. Dataset Structure
In all the above-discussed pre-processing approaches, theframes are associated to their corresponding labels, andaccumulated into a train set and a test set . The dimension of thedataset depends on the chosen pre-processing approaches. Less frames are generated with longer time-windows, whereas theamount of frames increases as the time-window covered by eachframe gets shorter. The pre-processing stages are summarizedin Figure 8. positivenegative t i m e incoming eventsframe accumulation Average Pooling channels E V E N T S A C C U M U L AT I O N
Fig. 8:
DvsGesture pre-processing: the number of frame channels may dependon the chosen polarity policy or, in a time based accumulation, on the timelength of each channel.
V. A
CCURACY R ESULTS
All the obtained pre-processed datasets have been tested withthe cNet , the same DNN analyzed in Section III, along withthe same training parameters for the MNIST and CIFAR10datasets. This choice has been made to ensure that the possibledifferences between the DNN and SNN accuracy results dependon the data pre-processing stage, and are not related to thenetwork architecture or the training policies. As explainedin Section III-H, if the DNN architecture is modified, theconversion process may suffer and the resulting SNN may notshow the expected behavior.The conversion process has been executed applying thesoft reset mode, and an experiment duration of 256 time-steps, with a DThIR= , since these are the settings that haveshown the best results for both the MNIST and the CIFAR10analyses. Given the analysis provided in Section IV, a set ofdifferent frame-converted datasets have been realized. In allthese datasets, the size of the frame is set to 32x32, and theevents polarity is discarded. On the other hand, the converteddatasets differ in the frame accumulation time duration, thepossible use of the temporal overlapping between frames, andthe number of channels per frame. Table VI shows the accuracyresults for the DNN on the different post-processed datasets. TABLE VI:
Pre-processing techniques applied to the original gesture DVSdataset and relative DNN accuracies. For all the datasets, the frame size isequal to 32x32 and the polarity inormation is discarded. All the generateddatasets have been tested with the cNet
DNN.
Dataset duration(ms) overlap channels DNN accuracy
D1 60 (10 per ch.) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55)
Dataset D1 shows that, choosing a time window of only60ms gives relatively low accuracy results, similar to the case8f dataset D2, where the time range covered by each channelis doubled. This can be attributed to a few events accumulatedper channel.When discussing the datasets D3-5, the time window isprogressively incremented, until a maximum duration of 300msis covered. The results show that a good level of accuracy isreached with a 3-channel frame covering a period of 235ms.Dataset D6 has been realized to see if using a single channelframe could be a valid solution. In this case, the accuracy dropis evident, and this can be addressed to the fact that the singleframe does not contain the temporal information, being all theevents accumulated in a single channel.When discussing the datasets D7 and D8, an overlap factorequal to 2 is introduced. The accuracy increases, reaching avalue of in dataset D8, which is the best obtained value.The cNet
DNN model trained on dataset D8 is then convertedto its equivalent SNN model representation, and deployed onthe Intel Loihi research platform. The converted SNN modelreaches a test accuracy of , which is only 0.82%lower than the original DNN model representation. Moreover,the average time for classifying an input frame is .These results have to be compared with the state-of-the-arttest accuracies achieved in [2] and in [25]. The work in [2]reaches a test accuracy of 94.59% with a 64x64 frame size,whereas the accuracy achieved on a 32x32 frame drops downto 90.78%. This last value is only 1,14% higher than the oneobtained in this research using frames with the same dimensionof 32x32, but it is obtained with a DNN that is much bigger(i.e., 16 convolutional layers with a lot more feature maps perlayer) w.r.t. the one used in this work (see Table II for ournetwork configuration). However, we did not consider to employsuch large and deep networks purposely, in order to maintaina low resource utilization and a low latency for the real-timeembedded implementations.In [25], the test accuracy reached on a smaller portion of theoriginal dataset (1.5 seconds per gesture) is 93.64%, that is higher w.r.t. the one obtained with our methodology. However,considering the fact that their SNN is designed and trained fromscratch (i.e., not a converted one), they have directly used theoriginal event-based dataset, avoiding an inevitable informationloss that is related to the pre-processing step.In terms of latency, with our best solution (D8) the totaltime needed for a frame classification is ms + 11 . ms = . This configuration gives a throughput of 6.24frames-per-second, which is a feasible solution for a real-timesystem. VI. C ONCLUSION
In this paper, we have proposed an efficient method fordeploying gesture recognition through a DVS camera onthe Loihi neuromorphic processor. After a careful study ofconverting a given artificial Deep Neural Network (DNN) to thecorresponding Spiking Neural Network (SNN) representation,we devised an efficient pre-processing method for accumulatingthe events coming from the DvsGesture dataset. As shown byour results, this process enables the training in the DNN domain.Therefore, the well-known training policies and optimizations Since the overlap factor is 2, the next frame starts after 150ms, thereforewe considered 150ms per frame. for DNNs can be employedin this methodology. An efficientconversion of the trained DNN into the SNN domain enablesthe accurate, energy-efficient and real-time processing on aneuromorphic embedded platform such as the Intel Loihi.A
CKNOWLEDGMENTS
This work has been partially supported by the Doctoral CollegeResilient Embedded Systems which is run jointly by TU Wien’sFaculty of Informatics and FH-Technikum Wien. R EFERENCES [1] M. Abadi et al. TensorFlow: Large-scale machine learning onheterogeneous systems, 2015.[2] A. Amir et al. A low power, fully event-based gesture recognition system.In
CVPR , 2017.[3] M. Bouvier et al. Spiking neural networks hardware implementations andchallenges: A survey.
J. Emerg. Technol. Comput. Syst. , 2019.[4] F. Chollet et al. Keras. https://github.com/fchollet/keras, 2015.[5] M. Davies et al. Loihi: A neuromorphic manycore processor with on-chiplearning. 2018.[6] P. U. Diehl, D. Neil, J. Binas, M. Cook, S. Liu, and M. Pfeiffer. Fast-classifying, high-accuracy spiking deep networks through weight andthreshold balancing. In
IJCNN , 2015.[7] S. Furber, F. Galluppi, S. Temple, and L. Plana. The spinnaker project.2014.[8] M. A. Hanif et al. Mpna: A massively-parallel neural array acceleratorwith dataflow optimization for convolutional neural networks.
ArXiv ,abs/1810.12910, 2018.[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for imagerecognition.
CoRR , 2015.[10] N. Jouppi, C. Young, N. Patil, and D. Patterson. Motivation for andevaluation of the first tensor processing unit.
IEEE Micro , 2018.[11] J. Kaiser et al. Embodied event-driven random backpropagation.
CoRR ,2019.[12] D. Kingma and J. Ba. Adam: A method for stochastic optimization.
International Conference on Learning Representations , 2014.[13] A. Krizhevsky. Learning multiple layers of features from tiny images.Technical report, 2009.[14] Y. LeCun, C. Cortes, and C. Burges. Mnist handwritten digit database.
ATTLabs [Online] , 2, 2010.[15] P. Lichtsteiner, C. Posch, and T. Delbruck. A 128 x 128 120db 30mwasynchronous vision sensor that responds to relative intensity change. In
ISSCC , 2006.[16] A. Marchisio et al. Deep learning for edge computing: Current trends, cross-layer optimizations, and open research challenges. In
ISVLSI , 2019.[17] A. Marchisio et al. Is spiking secure? a comparative study on the securityvulnerabilities of spiking and deep neural networks.
IJCNN , 2020.[18] P. A. Merolla et al. A million spiking-neuron integrated circuit with ascalable communication network and interface.
Science , 2014.[19] E. O. Neftci, H. Mostafa, and F. Zenke. Surrogate gradient learning inspiking neural networks: Bringing the power of gradient-based optimizationto spiking neural networks.
Signal Processing Magazine , 2019.[20] M. Pfeiffer and T. Pfeil. Deep learning with spiking neurons: Opportunitiesand challenges.
Frontiers in Neuroscience , 2018.[21] B. R¨uckauer, N. K¨anzig, S. Liu, T. Delbr¨uck, and Y. Sandamirskaya.Closing the accuracy gap in an event-based visual recognition task.
CoRR ,2019.[22] B. Rueckauer, I.-A. Lungu, Y. Hu, M. Pfeiffer, and S.-C. Liu. Conversionof continuous-valued deep networks to efficient event-driven networks forimage classification.
Frontiers in Neuroscience .[23] M. Shafique et al. An overview of next-generation architectures for machinelearning: Roadmap, opportunities and challenges in the iot era. In
DATE ,2018.[24] M. Shafique et al. Robust machine learning systems: Challenges,currenttrends, perspectives, and the road ahead.
IEEE Design & Test , 2020.[25] S. B. Shrestha and G. Orchard. Slayer: Spike layer error reassignment intime. In
NeurIPS . 2018.[26] G. Srinivasan, P. Panda, and K. Roy. Stdp-based unsupervised featurelearning using convolution-over-time in spiking neural networks for energy-efficient neuromorphic computing.
J. Emerg. Technol. Comput. Syst. , 2018.[27] D. Zambrano and S. M. Bohte. Fast and efficient asynchronous neuralcomputation with adapting spiking neural networks.
CoRR , 2016.[28] J. J. Zhang et al. Building robust machine learning systems: Currentprogress, research challenges, and opportunities.
DAC , 2019., 2019.