The Heidelberg spiking datasets for the systematic evaluation of spiking neural networks
Benjamin Cramer, Yannik Stradmann, Johannes Schemmel, Friedemann Zenke
TThe Heidelberg spiking datasets for the systematic evaluation ofspiking neural networks
Benjamin Cramer , Yannik Stradmann , Johannes Schemmel , and Friedemann Zenke Kirchhoff Institute for Physics, University of Heidelberg, Germany Friedrich Miescher Institute for Biomedical Research, Basel,SwitzerlandDecember 10, 2019
Abstract
Spiking neural networks are the basis of versatile and power-efficient information processing in the brain. Althoughwe currently lack a detailed understanding of how these networks compute, recently developed optimization techniquesallow us to instantiate increasingly complex functional spiking neural networks in-silico. These methods hold thepromise to build more efficient non-von-Neumann computing hardware and will offer new vistas in the quest ofunraveling brain circuit function. To accelerate the development of such methods, objective ways to comparetheir performance are indispensable. Presently, however, there are no widely accepted means for comparing thecomputational performance of spiking neural networks. To address this issue, we introduce a general audio-to-spikingconversion procedure and provide two novel spike-based classification datasets. The datasets are free and requireno additional preprocessing, which renders them broadly applicable to benchmark both software and neuromorphichardware implementations of spiking neural networks. By training a range of conventional and spiking classifiers,we show that leveraging spike timing information within these datasets is essential for good classification accuracy.These results serve as the first reference for future performance comparisons of spiking neural networks.
Spiking neural networks (SNNs) are biology’s solutionfor fast and versatile information processing. From acomputational point of view, SNNs have several desir-able properties: They process information in parallel,are noise-tolerant, and highly energy efficient [Boahen,2017]. Precisely which computations are carried out ina given biological SNN depends in large part on its con-nectivity structure. To understand how SNNs operate, itis therefore vital to learn how connectivity gives rise tofunction in such networks. For conventional non-spikingartificial neural networks (ANNs), the problem of find-ing suitable connectivity structures that solve a specificcomputational task is usually phrased as an optimizationproblem. By performing gradient descent on a sensi-bly chosen objective function, the desired functionalityand connectivity are acquired simultaneously for a prede-fined architecture. To allow gradient-based optimization,ANN models are traditionally constructed using gradedneuronal activation functions [Rumelhart et al., 1986],which are in contrast to the binary nature of individualspikes of biological neurons. Importantly, binary spikinglargely precludes the use of gradient-based optimizationtechniques and thus the construction of functional SNNs[Neftci et al., 2019]. However, a growing number of novelalgorithms promises to translate the success of gradient-based learning from conventional neural networks to thespiking domain [Neftci et al., 2019, Zenke and Ganguli,2018, Pfeiffer and Pfeil, 2018, Tavanaei et al., 2018, Bel-lec et al., 2018, Shrestha and Orchard, 2018, Wozniak et al., 2018] and thus to instantiate functional SNNs in-silico both on conventional computers and neuromorphichardware [Schemmel et al., 2010, Friedmann et al., 2016,Furber et al., 2013, Davies et al., 2018, Moradi et al.,2018, Roy et al., 2019]. The emergent diversity of dif-ferent learning algorithms urgently calls for principledmeans of comparison between different methods. Unfor-tunately, widely accepted benchmark datasets for SNNsthat would permit such comparisons are scarce [Roy et al.,2019, Davies, 2019]. In this article, we seek to fill this gapby introducing two new broadly applicable classificationdatasets for SNNs.In the following, we provide a brief motivation for whybenchmarks are crucial before reviewing existing tasksthat have been used to assess SNN performance in the past.By analyzing the strengths and shortcomings of thesetasks, we motivate our specific choices for the datasets weintroduce in this article. Finally, we establish the first setof baselines by testing a range of conventional and SNNclassifiers on these datasets.
The ultimate goal of a benchmark is to provide a quantita-tive unbiased way of comparing different approaches andmethods to the same problem. While each modeler usu-ally works with a set of private benchmarks, tailored totheir specific problem of study, it is equally important tohave shared benchmarks, which ideally everybody agreesto use, to allow for unbiased comparison and to fosterconstructive competition between approaches [Roy et al.,1 a r X i v : . [ c s . N E ] D ec un-obtainable . For instance, it could be unpublished, behinda paywall, or too difficult to use. Second, a publishedbenchmark might be tailored to a specific problem and,therefore, not general enough to be of interest to otherresearchers. Third, a benchmark may be saturated , whichmeans that it is already solved with high precision by anexisting method. Naturally, this precludes the character-ization of improvements over these approaches. Finally,a benchmark could require extensive preprocessing. Be-cause preprocessing can have a substantial impact onperformance, leaving too many preprocessing decisionsto the user might have adverse effects on comparability.Consider, for example, the MNIST dataset: We can con-vert the images to Poisson spike trains with firing ratesproportional to the pixels’ gray values, or we can converttheir values into a spike latency. Depending on the im-plementation details both approaches could yield vastlydifferent results.The question, therefore, is: What would an ideal bench-mark dataset for learning in SNNs be? While this questionis difficult, if not impossible, to answer, it is probablyfair to say that an ideal benchmark should be at least un-saturated, require minimal preprocessing, be sufficientlygeneral, easy to obtain, and free to use. Numerous studies have measured performance in SNNsdifferently. In the following, we give a brief overviewof how previous work has assessed the computationalperformance of structured SNNs. We limit our scopeto benchmarks that have been used to solve supervisedlearning tasks with SNNs. For an extensive review ofstudies with a primary focus on unsupervised learningand biologically plausible plasticity models such as spike-timing dependent plasticity (STDP), see Tavanaei et al.[2018]. Although supervised learning is not necessarilybiologically plausible, it is a useful tool to characterizethe computational capabilities of spiking circuits and toengineer functional circuits for neuromorphic applications.Supervised learning in SNNs can coarsely be categorizedinto steady-state rate-coding and sequence-to-sequencemapping, although also hybrids between the two exist. Insteady-state rate-coding, SNNs approximate conventional analog neural networks by using an effective firing ratecode in which both input and output firing rates remainconstant during the presentation of a single stimulus [Pfeif-fer and Pfeil, 2018, Zylberberg et al., 2011, Neftci et al.,2017]. Inputs to the network enter as Poisson distributedspike trains for which the firing rates are proportional tothe current input level. Similarly, network outputs aregiven as a firing rate or spike count of designated out-put units. Because of these input-output specifications,steady-state rate-coded networks can often be trainedusing network translation [Pfeiffer and Pfeil, 2018], and,importantly, they can be directly applied to and comparedwith standard machine learning datasets (e.g. MNIST [Le-Cun et al., 1998], CIFAR10 [Krizhevsky et al., 2009], orSVHN [Netzer et al., 2011]).The capabilities of SNNs, however, go beyond suchrate-coding networks. Specifically, spike timing can serveas an extra coding dimension [Bohte et al., 2002, Mostafa,2018, Comsa et al., 2019]. To that end, several stud-ies have made use of temporal coding schemes, for in-stance, to solve sequence-to-sequence mapping problems.In this setting, input and output activity varies duringthe processing of a single input example. Within thiscoding scheme, outputs can be either individual spikes[Gütig, 2014], spike trains with predefined firing times[Memmesheimer et al., 2014], or continuously varyingquantities derived from the spikes. The latter are typi-cally defined as linear combinations of low-pass filteredspike trains [Eliasmith and Anderson, 2004, Denève andMachens, 2016, Abbott et al., 2016, Nicola and Clopath,2017, Gilra and Gerstner, 2017].One of the simplest temporal coding benchmarks isthe temporal exclusive-OR (XOR) task, which exists indifferent variations [Bohte et al., 2002, Abbott et al., 2016,Huh and Sejnowski, 2018]. A simple SNN without hiddenlayers cannot solve this problem, similar to the Percep-tron’s inability to solve the regular XOR task. Hence,the temporal XOR is commonly used to demonstratethat a specific method supports hidden-layer learning. Inthe temporal XOR task, a neural network has to solve aBoolean XOR problem in which the logical off and on levels correspond to early and late spike times respectively.While the temporal XOR does require a hidden layer tobe solved correctly, its intrinsic low-dimensionality andthe low number of input patterns render this benchmarksaturated. Therefore, its possibilities for quantitativecomparison between training methods are limited.To asses learning in a more fine-grained way, severalstudies have focused on SNNs’ abilities to generate pre-cisely timed output spike trains in more general scenarios[Memmesheimer et al., 2014, Ponulak and Kasiński, 2009,Pfister et al., 2006, Florian, 2012, Mohemmed et al., 2012,Gardner and Grüning, 2016]. To that end, it is custom-ary to use several Poisson input spike trains to generatea specific target spike train. Apart from regular out-put spike trains (e.g. Gardner and Grüning [2016]), alsorandom target spike trains with increasing length andPoisson statistics have been considered [Memmesheimeret al., 2014]. Similarly, the Tempotron [Gütig and Som-polinsky, 2006] uses an interesting hybrid approach in2 a) Recording (b) Cutting (c) Basilar membrane (d) Hair cells (e) Bushy cellsHeidelberg DigitsSpeech Commands (a) Recording (b) Cutting (c) Basilar membrane (d) Hair cells (e) Bushy cellsHeidelberg DigitsSpeech Commands(a) Recording (b) Cutting (c) Basilar membrane (d) Hair cells (e) Bushy cellsHeidelberg DigitsSpeech Commands Figure 1:
Processing pipeline for the Heidelberg Digits(HD) and the Speech Commands (SC) dataset. (a) TheHD are recorded in a sound-shielded room. (b) Afterwards, theresulting audio files are cut and mastered. (c) The HD as well asthe SC are fed through a hydrodynamic basilar membrane model.(d) Basilar membrane decompositions are converted to phase-codedspikes by use of a transmitter-pool based hair cell model. (e) Thephase-locking is increased by combining multiple spiketrains of haircells at the same position of the basilar membrane in a single bushycell. which random temporally encoded spike input patternsare classified into binary categories corresponding to spik-ing versus quiescence of a designated output neuron. Inthe associated benchmark, task performance is measuredas the number of binary patterns that can be classifiedcorrectly. While mapping random input spikes to outputspikes allows a fine-grained comparison between methods,the aforementioned tasks lack a non-random structure.A standard benchmark for studies with a focus onmodeling continuous low-dimensional dynamics with re-currently connected spiking neural network (RSNN) is toreproduce the dynamics of classic nonlinear systems. Oneexample is the two-dimensional van-der-Pol oscillator orthe chaotic three-dimensional Lorenz system [Nicola andClopath, 2017, Gilra and Gerstner, 2017, Thalmeier et al.,2016]. However, low-dimensional dynamical systems maynot be a good model for the processing of high-dimensionalsensory data.Finally, some datasets for n-way classification were bornout of practical engineering needs. The majority of thesedatasets are based on the output of neuromorphic sensorslike, for instance, the dynamic vision sensor (DVS) [Licht-steiner et al., 2008] or the silicon cochlea [Anumula et al.,2018]. An early example of such a dataset is Neuromor-phic MNIST [Orchard et al., 2015], which was generatedby a DVS recording MNIST digits that were projectedon a screen. The digits were moved at certain intervalsto elicit spiking responses in the DVS. The task is toidentify the corresponding digits from the elicited spikes.This benchmark has been used widely in the SNN com-munity. However, being based on the MNIST datasetit is nearing saturation. The DASDIGITS dataset [Anu-mula et al., 2018] was created by processing the originalTIDIGITS spoken digit dataset with a 64 channel siliconcochlea. Unfortunately, because TIDIGITS is under aproprietary license, the license requirements for the de-rived dataset are not entirely clear. Moreover, becauseTIDIGITS contains sequences of spoken digits, the taskgoes beyond a straight-forward n-way classification prob-lem and therefore is beyond the scope for many current SNN implementations. More recently, IBM has releasedthe DVS128 Gesture Dataset [Amir et al., 2017] undera Creative Commons license. The dataset consists ofnumerous DVS recordings of 11 unique hand gesturesperformed by different persons under varying lightingconditions. The spikes in this dataset are provided as acontinuous data stream, which makes extensive cuttingand preprocessing necessary. Finally, the 128 ×
128 pixelsize renders this dataset computationally expensive unlessadditional preprocessing steps such as downsampling areapplied.In this article, we sought to generate widely applica-ble SNN benchmarks with comparatively modest com-putational requirements. To that end, we developed aprocessing framework to convert audio data into spikes.Using this framework, we generated two new spike-baseddatasets for speech classification and keyword spottingthat are not saturated by current methods. Moreover,solving these problems with high accuracy requires tak-ing into account spike timing. Finally, to facilitate theiruse, extension, and future generation of new datasets, wereleased both datasets and software under permissivepublic domain licenses. To improve the quantitative comparison between SNNs,we have created two large spike-based classificationdatasets from audio data. We focused on audio signals ofspoken words due to their natural temporal dimension andlower bandwidth in comparison to video data. To that end,we employed an artificial model of the inner ear and partsof the ascending auditory pathway to convert audio datainto spikes (Figure 1). Directly providing input spikes,allows us to sidestep the issue of user-specific preprocess-ing that can confound comparability. Taken together,these design decisions provide the basis for comparablebenchmark results at a comparatively low computationalcost.The first dataset, the Spiking Heidelberg Digits (SHD),is based on a spoken digit dataset that was recorded in-house at the University of Heidelberg in a soundproofedroom with professional recording equipment (MethodsSection 4.1.1). The underlying Heidelberg Digits (HD)audio dataset contains 20 classes of spoken digits, namelythe English and German digits from zero to nine , spokenby 12 speakers (Figure 2). The second dataset, the Spik-ing Speech Commands (SSC) was derived from Google’sfree Speech Commands (SC) dataset [Warden, 2018] ofspoken command words. Most importantly, we considerall 35 different words as separate classes deviating fromthe original proposed key-word spotting task with only12 classes (10 key-words, unknown word, and silence).However, the data can still be used in the originally in-tended way.To separate the data into training and test sets, wehave applied two different partitioning strategies: For theSHD, we held out two speakers exclusively for the test set. https://compneuro.net https://github.com/electronicvisions C o un t (a) Per-speaker digit count GermanEnglish C o un t (b) Per-class digit count GermanEnglish .
50 0 .
75 1 .
00 1 . Duration (s) C o un t (c) Recording duration Figure 2:
The Heidelberg Digits (HD) have a balanced classcount and variable temporal duration.
The HD consist of10420 recordings of spoken digits ranging from zero to nine inEnglish and German language. (a) Histogram of per-speaker digitcounts. The different shadings correspond to the German (light gray)and English (dark gray) digit counts. Variable numbers of digitsare available for each speaker and each language. (b) Histogram ofper-class digit counts. The dataset is balanced in terms of digitswithin each language. (c) Histogram of audio recording durations.The HD audio recordings were cut for minimal duration to keepcomputation time at bay. The remainder of the test set was filled with samples ( of the trials) from speakers also present in the training set.This division allows to assess a trained network’s abilityto generalize across speakers. For the SSC, the recordingswere divided randomly by a hashing function [Warden,2018]. Importantly, we adhere to the splitting proposedby the author that also includes a predefined validationdata set. The partitioning was designed such to result ina -
20 % split ratio between training and testing samplesfor both datasets. For validation purposes, we used
10 % of the samples of the training set.While both datasets were generated using the sameoverall processing pipeline (cf. Figure 1), the underlyingaudio data differ in important respects. The HD datasetwas optimized for recording quality and precise audioalignment. In contrast, the SC dataset is intended toclosely mimic real-world conditions for key-word spottingon mobile devices. On the one hand, it has higher noiselevels and lower temporal alignment precision. On theother hand, it features additional classes and an about10-fold larger number of trials.To facilitate the use of these datasets and to simplify theaccess to a broader community, we used an event-basedrepresentation of spikes in the Hierarchical Data Format5 (HDF5) files. This choice was to ensure short downloadtimes and ease of access from most common programmingenvironments. For each partition and dataset, we providea single HDF5 file which holds spikes, digit labels, andadditional meta information such as the speaker’s identity,age, body height, and gender (Methods Section 4.2). . . . . . . A cc u r a c y (a) SHDLinearSVM Poly-2SVM Poly-3SVM RBFSVM LSTM CNN . . . . . . A cc u r a c y (b) SSC TrainValidTest
Figure 3:
Temporal information is essential to classify theSHD and the SSC datasets with high accuracy. (a) Bargraph of classification accuracy for different SVMs trained on spikecount vectors and LSTM as well as CNN classifiers trained on thebinned spiketrains of the SHD dataset. The different shadingscorrespond to training (dark gray), validation (light gray), and testaccuracy (hatched). Error bars indicate the standard deviation over10 repetitions. Classification accuracy on SHD is significantly higherfor LSTMs and CNNs which also show a lower degree of overfitting.(b) Same as in (a), but showing performance on SSC. LSTMs andCNNs with access to temporal information outperform the SVMclassifiers by a large margin.
We sought to establish that the datasets were not satu-rated and that spike timing information is essential tosolve the tasks with high accuracy. To test this, we firstgenerated a reduced version of the datasets in which weremoved all temporal information. To that end, we com-puted spike count patterns from both datasets, which,by design, do not contain temporal information aboutthe stimuli. Using these reduced spike count datasets,we then trained different linear and nonlinear supportvector machine (SVM) classifiers (Methods Section 4.4.1)and measured their classification performance on the re-spective test sets. We found that while a linear SVMreadily overfitted the data in the case of SHD, its testperformance only marginally exceeded the
55 % accuracymark (Figure 3a; here and in the following error estimatescorrespond to the standard deviation n = 10 ). For theSSC, overfitting was less pronounced, but also the overalltest accuracy dropped to
20 % (Figure 3b). Thus linearclassifiers provided a low degree of generalization.To assess whether this situation was different for nonlin-ear classifiers, we trained SVMs with polynomial kernelsup to a degree of 3. For these kernels, overfitting was lesspronounced. Slightly better performance of about
60 % on the SHD and
30 % on the SSC was achieved whenusing a SVM with a radial basis function (RBF) kernel.The performance on the SHD test set, which includesspeakers that are not part of the training set, was notice-ably lower compared to the accuracy on the validationdata. Especially, for polynomial and RBF kernels the gen-eralization across speakers was worse than for the linear4 n ˜ n ˜ n ˜ n input output readout (a) Schematic SNN setupneural processing Last-Time-StepMax-Over-Time . . . A cc u r a c y (b) Last-time-step loss TrainValid
Spiking1-layer Spiking2-layer Spiking3-layer Spikingrecurrent LSTM . . . A cc u r a c y (c) Max-over-time loss Figure 4:
Setup and multiple choices of loss functions for SNNs. (a) Schematic of a single layer recurrent network with tworeadout units. We applied two different loss functions for long short-term memorys (LSTMs) and SNNs: First, a max-over-time loss wasconsidered, where the time step with maximal activity of each readout was used to calculate the cross entropy (marked by colored arrows).Second, a last-time-step loss was utilized where only the last time step of the activation was considered in the calculation of the crossentropy (marked by gray arrow). (b) Bar graph of classification accuracy for different SNNs and a LSTM on the SHD. The differentshadings indicate training (dark gray), and validation (light gray). Error bars correspond to the standard deviation of 10 repetitions. Onlythe LSTM generalized well when trained with a last-time-step loss. (c) Same as in (b), but showing performance for a max-over-time loss.Overall, SNNs and LSTMs performed better when trained with the max-over-time loss. kernel (Figure 3a). In contrast, we found the performanceon the SSC test set to be on par with the accuracy on thevalidation set (Figure 3b), which is most likely an effect ofthe uniform speaker distribution. These results illustratethat both linear and nonlinear classifiers trained on spikecount patterns without temporal information were unableto surpass the
60 % accuracy mark for the SHD and the
30 % mark for the SSC dataset. Therefore, spike countsare not sufficient to achieve high classification accuracyon the studied datasets.Next, we wanted to assess whether decoding accuracycould be improved when training classifiers that have ex-plicit access to temporal information of the spike times.To that end, we trained LSTMs on temporal histogramsof spiking activity (Methods Section 4.4.2). In spite of thesmall size of the SHD dataset, LSTMs showed reducedoverfitting and were able to solve the classification prob-lem with an accuracy of (85 . ± .
4) % (Figure 3a) whichwas significantly higher than the best performing SVM.Similarly, for the SSC dataset the LSTM test accuracy (75 . ± .
2) % was more than twice as high as the best-performing classifier on the spike count data. However,the degree of overfitting we observed was slightly higherthan on SHD.Since both, kernel machines and LSTMs, were affectedby overfitting, we tested whether performance could be in-creased with convolutional neural networkss (CNNs) dueto their inductive bias on translation invariance in bothfrequency and time and their reduced number of param-eters. To that end, we binned spikes in spatio-temporalhistograms and trained a CNN classifier (Methods 4.4.3).CNNs showed the least amount of overfitting among alltested classifiers; the accuracy dropped by only . onSHD and by . on SSC (Figure 3). Especially, theperformance on the SHD test data was on par with theone on the validation set, demonstrating a high degree ofgeneralisation. These findings highlight that the temporal informationcontained in both datasets can be exploited by suitableneural network architectures. Moreover, these resultsprovide a lower bound on the performance ceiling for bothdatasets. It seems likely that a more careful architecturesearch and hyperparameter tuning will only improve uponthese results. Thus, both the SHD and the SSC will beuseful for quantitative comparison between SNNs up toat least these empirical accuracy values. Having established that both spiking datasets containuseful temporal information that can be read out bya suitable classifier, we sought to train SNNs of leakyintegrate-and-fire (LIF) neurons using backpropagationthrough time (BPTT) and to assess their generalizationperformance. One problem with training SNNs with gra-dient descent arises because the derivative of the neuralactivation function appears in the evaluation of the gradi-ent. Since spiking is an intrinsically discontinuous process,the resulting gradients are ill-defined. To neverthelesstrain networks of LIF neurons using supervised loss func-tions, we used a surrogate gradient approach [Neftci et al.,2019]. Surrogate gradients can be seen as a continuousrelaxation of the real gradients of a SNN which can beimplemented as an in-place replacement while performingBPTT. Importantly, we did not change the neuron modeland the associated forward-pass of the model, but useda fast sigmoid as a surrogate activation function whencomputing gradients (Methods Section 4.3.3).In contrast to LSTMs, bio-inspired SNNs have fixed,finite time constants on the order of ms . Because ofthis constraint, we considered two different loss functionsfor both LSTMs and SNNs (Figure 4a). The resultsfor LSTMs shown in Fig. 3 were obtained by trainingwith a last-time-step loss , where the activation of the last5 Steepness β A cc u r a c y (a) Performance η = 0 . η = 0 . η = 0 . η = 0 . Steepness β E p o c h n . (b) Convergence time Epoch L o ss (c) Loss TrainValid
Epoch0.00.51.0 A cc u r a c y (d) Performance Figure 5:
Accuracy, but not convergence time is only mildlyaffected by the steepness β of the surrogate derivative. (a) Accuracy as a function of β on a validation set of the SHD.Different colors and linestyles indicate different learning rates η .Errors correspond to the standard deviation of 10 repetitions. Per-formance is highest for a wide range of β values ( β ≥ ) anddepends only slightly on η . (b) Number of epochs needed to reachan accuracy > . , n . . In contrast to the performance, n . strongly depends on both β and η . (c) Loss curves on the SHD for β = 40 and η = 10 − . The shadings and linestyles indicate the losson training (solid, dark gray), and validation (dotted, light gray).(d) Same as in (c), but showing the accuracy on the SHDs. time step of each example and readout unit was used tocalculate the cross entropy loss at the output. In addition,we also considered a max-over-time loss , in which thetime step with maximum activation of each readout unitwas taken into account (Fig. 4a). This loss function ismotivated by the Tempotron [Gütig and Sompolinsky,2006] in which the network signals its decision about theclass membership of the applied input pattern by whethera neuron spiked or not.We evaluated the performance of LSTMs and SNNs forboth aforementioned loss functions on the SHD. TrainingLSTMs with a cross entropy loss based on the activityof the last time step of every sample was associated withhigh performance in contrast to SNNs (Figure 4b). Theslightly reduced performance of feed-forward SNNs trainedwith last-time-step loss compared to RSNNs suggeststhat time constants were too low to provide all necessaryinformation at the last time step. This was presumablydue to active memory implemented through reverberatingactivity through the recurrent connections. Overall, SNNsperformed better in combination with the max-over-timeloss function (Figure 4c). Also LSTMs showed increasedperformance in combination with a max-over-time loss;the validation accuracy increased from (95 . ± .
7) % forthe last-time-step loss to (97 . ± .
9) % for the max-over-time loss. Motivated by these results, we used a max-over-time loss for SNNs as well as LSTMs throughout theremainder of this manuscript.Surrogate gradient learning introduces a new hyperpa-rameter β associated with the steepness of the surrogatederivative (Methods Equation (20)). Because changesin β may require a different optimal learning rate η , weperformed a grid search over β and η based on a sin- gle layerRSNN architecture trained on SHD. We foundthat sensible combinations for both parameters lead tostable performance plateaus over a large range of values(Figure 5a). Only for small β the accuracy dropped dra-matically, whereas it decreased only slowly for high values.Interestingly, the learning rate had hardly any effect onpeak performance for the tested parameter values. Asexpected, convergence speed heavily depended on both η and β . These results motivated us to use β = 40 and η = 1 × − for all SNNs architectures presented in thisarticle unless mentioned otherwise. For this choice, theperformance of the RSNN on the validation set reached itspeak after about 150 epochs (Figure 5d). Additional train-ing only increased performance on the training dataset(Figure 5c), but did not impact generalization (Figure 5d).With the parameter choices discussed above, we trainedvarious SNN architectures on the SHD and the SSC. Tothat end, we considered feed-forward SNNs with multiplelayers l and a single-layer RSNN. Interestingly, increasing l did not significantly improve performance on the SHD(Figure 6a). In addition, all choices of l caused highlevels of overfitting. Moreover, feed-forward SNNs reachedslightly lower accuracy levels than the SVMs on the SHD(Figure 3a). For the larger SSC dataset, the degree ofoverfitting was much smaller (Figure 6b) and performancewas significantly better than the one reached by SVMs(Figure 3b). Here, increasing the number of layers of feed-forward SNNs led to a monotonic increase of performanceon the test set from (32 . ± .
5) % for a single layer to (41 . ± .
5) % in the case of l = 3 . However, when testingRSNNs, we found consistently higher performance andimproved generalization across speakers. In comparison tothe accuracy of LSTMs, RSNNs showed higher overfitting,and generalized less well across speakers. As for SHD, theRSNN achieved the highest accuracy of (71 . ± .
9) % onthe SHD and (50 . ± .
1) % on the SSC which was stillless than the LSTM with (85 . ± .
4) % on the SHD and (75 . ± .
2) % on the SSC.
For robust spoken word classification, the generalizationacross speakers is a key feature. This generalization can beassessed by evaluating the accuracy per speaker on SHD,as the digits spoken by speakers four and five are onlypresent in the test set. We compared the performance onthe digits of the held-out speakers to all other speakers andfound a clear performance drop across all classificationmethods for the speakers four and five (Figure 7a and c).For SVMs, the linear kernel led to the smallest accuracydrop of about
18 % , whereas we found a decrease of
26 % for the RBF kernel. CNNs generalized best with a drop ofonly , followed by the LSTMs with
10 % . Among SNNs,feed-forward architectures were most strongly affectedwith a drop of about
24 % to
27 % . RSNNs, however, onlyunderwent a decline of
21 % in performance (Figure 7b).This illustrates that the composition of the test set ofSHD can provide meaningful information with regard togeneralization across speakers.6 . . . . . . A cc u r a c y (a) SHDSpiking1-layer Spiking2-layer Spiking3-layer Spikingrecurrent LSTM CNN . . . . . . A cc u r a c y (b) SSC TrainValidTest
Figure 6:
Recurrent SNNs outperform feed-forward archi-tectures on both datasets. (a) Bar graph of classification ac-curacy for different SNN architectures on the SHD. The differentshadings indicate training (dark gray), validation (light gray), andtesting accuracy (hatched). Error bars correspond to the standarddeviation of 10 repetitions. The accuracy reached by RSNN is com-parable to the performance of LSTMs with a max-over-time loss.Increasing the number of layers in feed-forward architectures hardlyaffected performance. (b) Same as in (a), but showing performanceon the SSC. The performance of SNNs was lower than the onereached by LSTMs. In contrast to (a), an increasing number oflayers lead to a monotonic increase of accuracy.
Because English digits are part of both datasets, wewere able to test the generalization across datasets bytraining SNNs, LSTM and CNNs on the full SHD datasetwhile testing on a restricted SSC dataset and vice versa(Figure 7b and d). For testing, the datasets were restrictedto the common English digits zero to nine . Perhaps notsurprisingly, networks generalized better, when trainedon the larger SSC dataset as a reference and tested onSHD. Nevertheless, all architectures trained on the SHDand tested on the SSC reached performance levels abovechance. Again, recurrent architectures reached highestperformance among all tested SNNs. In this article, we introduced two new public domain spike-based classification datasets to facilitate the quantitativecomparison of SNNs. By training a range of spiking andnon-spiking classifiers, we provide the first set of baselinesfor future comparisons.Both spiking datasets are based on auditory classifica-tion tasks but were derived from data that was acquiredin different recording settings. We chose audio data setsas the basis for our benchmarks because audio has atemporal dimension which makes it a natural choice forspike-based processing. However, in contrast to moviedata, audio requires fewer input channels for a faithfulrepresentation, which renders the derived spiking datasetscomputationally more tractable.We did not use other existing audio datasets as a ba-sis for the spiking version for different reasons. For in-stance, a large body of spoken digits is provided by the . . . A cc u r a c y (a) Control classifier Linear SVMRBF SVM LSTMCNN (b) SHD → SSC
ReferenceGeneralisationChance level . . . A cc u r a c y (c) Spiking classifier - l a y e r - l a y e r - l a y e r R ec u rr e n t L S T M C N N (d) SSC → SHD
Figure 7:
Networks generalize across speakers and datasets.
By reserving two speakers for the test set, the SHD dataset allowsto assess speaker generalization performance. (a) Per-speaker clas-sification accuracy on the test set of the SHD. The different colorsindicate linear (dark gray circles), and RBF (red triangles) SVM,LSTM (blue crosses), and CNN (green squares). Errors correspondto the standard deviation of 10 repetitions. A clear decrease inperformance was observable for samples spoken by the held-outspeakers four and five (highlighted). (b) Bar graph of the perfor-mance of SNNs, LSTM and CNN trained on the SHD and tested onthe English digits of the SSC. The different shadings indicate theperformance on the SHD test set (dark gray), and the performanceon the English digits of the SSC test set (light gray). Chance levelis highlighted by a dashed line. Accuracy on the SSC digits wassignificantly lower than on the digits in the SSC test set. (c) Sameas in (a), but showing the per-speaker accuracy of SNNs. Here,different colors correspond to 1- (dark gray circle), 2- (red triangles),and 3- hidden layers (blue crosses), and RSNNs (green squares).As for (a), a decrease in performance for the held-out speakerswas observed. (d) Same as in (b), but showing the performance ofnetworks when trained on SSC and tested on the English digits ofthe SHD. As opposed to (b), networks trained on the SSC digitsgeneralize well across datasets.
TIDIGITS dataset [Leonard and Doddington, 1991]. How-ever, this dataset is only available under a commerciallicense and we were aiming for fully open datasets. Asopposed to this, the Free Spoken Digit Dataset [Zoharet al., 2018] is available under Creative Commons BY 4.0license. Since this dataset only contains 2k recordingswith an overall lower recording and alignment quality,we deemed recording HD as a necessary contribution.Other datasets, such as Mozilla’s Common Voice [Mozilla,2019], LibriSpeech [Panayotov et al., 2015], and TED-LIUM [Rousseau et al., 2012] are also publicly available.However, these datasets pose more challenging speechdetection problems since they are only aligned at thesentence level. Such more challenging tasks are left for fu-ture research on functional SNNs. The Spoken WikipediaCorpora [Köhn et al., 2016], for instance, also providesalignment at the word level, but requires further prepro-cessing such as the dissection of audio files into separatewords. Moreover, the pure size and imbalance in sam-ples per class render the dataset more challenging. Wetherefore left its conversion for future work.The only existing public domain dataset with word-level alignment, tractable size, and preprocessing require-7 able 1: Performance comparison
SHD SSCSVM
Linear . ± . . ± . Poly-2 . ± . . ± . Poly-3 . ± . . ± . RBF . ± . . ± . LSTM . ± . . ± . CNN . ± . . ± . Spiking . ± . . ± . . ± . . ± . . ± . . ± . Recurrent . ± . . ± . Trained with max-over-time loss ments that we were aware of at the time of writing thismanuscript was the SC dataset. This is the reason whywe chose to base one spiking benchmark on SC whilesimultaneously providing the separate and the smallerHD dataset with higher recording quality and alignmentprecision. Finally, the high-fidelity recordings of the HDalso make it suitable for quantitative evaluation of theimpact of noise on network performance, because well-characterized levels of noise can be added.In our model, the spike conversion step consists of apublished physical inner-ear model [Sieroka et al., 2006]followed by an established hair-cell model [Meddis, 1988].The processing chain is completed by a single layer ofbushy cells (BCs) to increase phase-locking and to de-crease the overall number of spikes. With this setup weprovide a standardized conversion pipeline from raw audiosignals to spikes by generating spikes from the HD andthe SC audio datasets. In doing so, we both improve theusability settings and reduce a common source of perfor-mance variability due to differences in the preprocessingpipelines of the end-user.Our approach of directly providing spikes is similar tothe publicly available DASDIGIT dataset [Anumula et al.,2018]. DASDIGIT is composed of recordings from theTIDIGIT dataset [Leonard and Doddington, 1991] whichhave been played to a dynamic audio sensor with × frequency selective channels. In contrast to SHD andSSC, the raw audio files of the TIDIGIT dataset are onlyavailable under a commercial license. Also, the frequencyresolution, measured in frequency selective bands of thebasilar membrane (BM) model, is about a factor of 10lower. Since the software used for processing the SHD andthe SSC datasets is publicly available, extension is straightforward. This step is more challenging for DASDIGIT,because it requires a dynamic audio sensor.To establish the first set of baselines, we trained a rangeof non-spiking and spiking classifiers on both the SHD aswell as the SSC. In comparing the classification accuracyon the full datasets with the performance obtained onreduced spike count datasets, we found that the temporalinformation available in the spike times can be leveragedfor better classification by suitable classifiers. Moreover,architectures with explicit recurrence, like LSTMs andRSNNs, were the best performing models among all archi-tectures we tested. Most likely, the reverberating activitythrough recurrent connections in RSNNs implements therequired memory, thereby bridging the gap between neu- ral time constants and audio features. Therefore, theinclusion of additional state variables evolving on a slowertime scale as in [Bellec et al., 2018] will be an interestingextension to improve performance of SNNs.Our analysis of the SHD and the SSC using LSTMs andSNNs showed that the choice of loss functions can have amarked effect on classification performance. While LSTMsperformed well with a last-time-step loss, in which onlythe last time step was used to calculate the cross entropyloss, highest accuracy for LSTMs and SNNs was achievedfor a max-over-time loss, in which the time step withmaximum activation of each readout unit was considered.A detailed analysis of suitable cost functions for trainingSNNs will be an interesting direction for future research.In summary, we have introduced two versatile and openspiking datasets and conducted a first set of performancemeasurements using SNN classifiers. This constitutes animportant step forward toward the more quantitative com-parison of functional SNNs in-silico both on conventionalcomputers and neuromorphic hardware. We start with a description of the audio recordings in Sec-tion 4.1 which served as the basis for our spiking datasets.The transformation associated with the ascending audi-tory pathway is explained in Section 4.2. Finally, themethods for supervised learning in SNNs are specified inSection 4.3 and for control networks in Section 4.4. Allmodel parameters are listed in Table 2.
The two spiking datasets introduced in this article werederived from two sets of audio recordings. Specifically, wenewly recorded the HD dataset and we used the publishedSC dataset by the TensorFlow and AIY teams [Warden,2018].
The Heidelberg Digits (HD) consist of approximately10k high-quality recordings of spoken digits ranging from zero to nine in English and German language . In total speakers were included, six of which were female andsix male. The speaker ages ranged from
21 yr to
56 yr with a mean of (29 ±
9) yr . We recorded around digitsequences for each language with a total digit count of
10 420 (cf. Figure 2).The digits were acquired in sequences of ten succes-sive digits. Audio recordings were performed in a sound-shielded room at the Heidelberg University Hospital withthree microphones; two AudioTechnica Pro37 in differ-ent positions and a Beyerdynamic M201 TG (Figure 1).Digitized by a Steinberg MR816 CSX audio interface,recordings were made in WAVE format with a samplerate of
48 kHz and
24 bit precision.To improve the yield of the following automated pro-cessing, a manual pre-selection and cutting of the raw https://compneuro.net/posts/2019-spiking-heidelberg-digits/
30 ms
Hanning windows were applied to the start and end of thepeak normalized audio signals as further processing stagesinvolve the computation of fast Fourier transformations(FFTs).We partitioned the digits into training and testingdatasets by assigning the digits of two speakers exclusivelyto the testing dataset to create space for well-foundedstatements on generalization. In more detail, all digitsspoken by the speakers four and five were added to thetesting dataset. Moreover, of the recordings of eachdigit and language of all other speakers were appendedto the testing dataset.
The Speech Commands (SC) dataset is composed of
WAVE-files with
16 kHz sample rate containing a singleEnglish word each [Warden, 2018] . The words werespoken by speakers and published under the CreativeCommons BY 4.0 license. In this study, we consideredversion 0.02 with
105 829 audio files, in which a total of 24single word commands (
Yes , No , Up , Down , Left , Right , On , Off , Stop , Go , Backward , Forward , Follow , Learn , Zero , One , Two , Three , Four , Five , Six , Seven , Eight , Nine ) were repeated about five times per speaker, whereasten auxiliary words (
Bed , Bird , Cat , Dog , Happy , House , Marvin , Sheila , Tree , and
Wow ) were only repeated aboutonce. Partitioning into training, testing and validationdataset was done by a hashing function [Warden, 2018].In addition, we applied
30 ms
Hanning windows to thestart and end of each waveform.
The audio files described in the previous sections servedas the basis for our spiking datasets. The spikes weresaved in event-based form and stored together with thecorresponding digit and speaker ID as well as speaker metainformation in a HDF5 file. We made these files availableto the public , including supplementary information onthe general usage as well as code snippets. A single file isorganized as follows: download.tensorflow.org (a) xz yh b v sig Factory (b)Free trans-mitter q Cleft c LossRe-processingstore wy (1 − q ) kq lcrc Reuptake nw Figure 8:
Schematic view of the inner ear model. (a) Illus-tration of the basilar membrane (BM) mechanics. The BM (blue)separates the scala tympani (lower chamber) from the scala vestibuli (upper chamber). At the helicotrema (green), the two scalae areconnected. The scala tympani ends in the round window (yellow).A sound wave v sig is penetrating the eardrum, appling pressure atthe oval window (red) by moving the ossicles , leading to a com-pression and slower traveling wave. We have neglected the scalamedia [Sieroka et al., 2006] and consider a stretched form. (b)Schematic view of the transmitter flow within the hair cell (HC)model proposed by Meddis [1988]. Figure adapted from Meddis[1986]. The model comprises four transmitter pools which allow todescribe the transmitter concentration in the synaptic cleft. rootspikestimes[]units[]labels[]extraspeaker[]keys[]meta_infogender[]age[]body_height[] The entries times and units in spikes hold VLArrays [Alted and Fernández-Alonso, 2003] of samples. Eachlisting of these
VLArrays holds a
Numpy-Array [Oliphant,2006] containing the spike times or the spike emitting unit,respectively. The item labels consists of an array of digitIDs for each sample in spikes . The extra entry comprisesadditional information about the speaker: First, speaker holds an array of speaker IDs for each sample, second, keys contains an array of strings where each element i describes the transformation between the digit ID i and the spoken words. Last, meta_info involves thearrays gender , age , and body_height with the respectiveinformation in which entry i corresponds to speaker i .The speaker meta information is only available for SHDand is therefore omitted for SSC.To obtain the aforementioned spiketrains, the acousticsignals of the HD and the SC dataset were transformedinto neural activity by an inner ear model (Figure 1). Inmore detail, we applied a hydrodynamic basilar membrane(BM) model and a transmitter pool based hair cell (HC)model in succession. Further, we increased phase-lockingby a layer of bushy cells (BCs) integrating the signalsfrom HCs at the same position of the BM. As a complete consideration of hydrodynamic BM modelsis beyond the scope of this manuscript, we follow the steps9f Sieroka et al. [2006] and in the following highlight thekey steps of their derivation. A fundamental aspect ofa cochlea model is the interaction between a fluid and amembrane causing spatial frequency dispersion [Sierokaet al., 2006, de Boer, 1980, 1984]. Key mechanical featuresof the cochlea are covered by the simplified geometry ofthe BM in Figure 8a. Here, we assumed the fluid to beinviscid and incompressible. Furthermore, we expect theoscillations to be small that the fluid can be described aslinear. The BM was expressed in terms of its mechanicalimpedance ξ ( x, ω ) which depends on the position in the x -direction and the angular frequency ω = 2 πν : ξ ( x, ω ) = 1 iω (cid:2) S ( x ) − ω m + iωR ( x ) (cid:3) , (1)with a transversal stiffness S ( x ) = C e − αx − a , a resistance R ( x ) = R e − αx/ and an effective mass m [de Boer, 1980].The damping of the BM was described by γ = R / √ C m .Variations of the stiffness over several orders of magni-tude allowed to encompass the entire range of audiblefrequencies.Let p ( x , ω ) be the difference between the pressure inthe upper and lower chamber. The following expressionfulfills the boundary conditions v y = 0 for y = h , and v z = 0 at z = ± b [de Boer, 1980]: p ( x , ω ) = (cid:88) n (cid:90) ∞ d k π e − ikx p ( k ) (cid:20) cosh( m ( h − y ))cosh( m h )+ m tanh( m h ) cosh( m ( h − y )) m tanh( m h ) cosh( m h ) · cos (cid:16) πznb (cid:17)(cid:21) . (2)The Laplace equation yields expressions for m = k and m = (cid:112) k + π /b . Only the principal mode of excita-tion in the z -direction was considered by setting n = 1 .With the assumptions made above, the Euler equationreads for the y -component of the velocity in the middleof the BM: ∂ y p ( x, ω ) = − iωρv y ( x, ω ) = 2 iωρξ ( x, ω ) p ( x, ω ) , (3)where we dropped the y and z argument for readability.In the following, we consider the limiting case of longwaves with kh (cid:28) . By combining Equations (2) and (3),one gets: ∂ x p ( x, ω ) = iωρhξ ( x, ω ) p ( x, ω ) , (4)where the replacement ˆ p ( k ) → p ( x ) , k → i∂ x and k → ∂ x has been applied. Here, ˆ p ( k ) denotes the Fouriertransform of p ( x, , . The solution of this equation wasapproximated by: p ( x, ω ) = (cid:115) G ( x, ω ) g ( x, ω ) H (2)0 ( G ( x, ω )) , (5)where H (2)0 is the second Hankel function and g ( x, ω ) and G ( x, ω ) are given by: g ( x, ω ) = ω (cid:114) ρhξ ( x, ω ) , (6) G ( x, ω ) = (cid:90) x dx (cid:48) g ( x (cid:48) , ω ) + 2 α g (0 , ω ) . (7) Table 2: Model parameters
Parameter Symbol Value
Damping const. γ .
15 s − Greenwoods const. a
35 kg s − cm − Stiffness const. C g s − cm − Fluid density ρ . − Attenuation factor α . − Height of scala h . Effective mass m .
05 g cm − Number of channels N ch Permeability offset A Permeability rate B Maximum permeability g Replenishing rate y . Loss rate l Reuptake rate r
16 667
Reprocessing rate n Propability scaling h
50 000
Number of HCs per position N HC Synaptic time const. τ syn .
10 ms Membrane time const. τ mem
20 ms Refractory time const. τ ref Leak potential u leak Reset potential u reset Threshold potential u thres Number of neurons per layer N Simulation step size δt . Simulation duration T . Batch size N batch Learning rate η . Steepness of gradient β Regularization lower threshold θ l . Regularization lower strength s l . Regularization upper threshold θ u . Regularization upper strength s u . First moment decay rate β . Second moment decay rate β . BC parameter SNN parameter
An analytical expression for G ( x, ω ) can be foundin Sieroka et al. [2006]. The model was applied to agiven stimulus by: v y ( x, t ) = (cid:90) dω π iZ in v y ( x, ω ) p (0 , ω ) e − iωt v sig ( ω ) , (8)where v sig ( ω ) is the Fourier transformation of the stimulus.The input impedance of the cochlea was modeled by: Z in ( ω ) = p ( x = 0) v x ( x = 0) ≈ (cid:114) C h iJ ( ζ ) + Y ( ζ ) J ( ζ ) − iY ( ζ ) , (9)with the Bessel functions of first J β and second Y β kindof order β and ζ = 2 ω/α (cid:112) / ( hC ) .To process the audio data, we evaluated v y ( x, t ) in therange (0 , . in N ch steps of equal size (Figure 1c).Specifically, we chose N ch = 700 as a compromise be-tween a faithful representation of the underlying audiosignal and manageable computational cost when using thedataset. Before applying the BM model, each recordingwas normalized to
65 dB root mean square (RMS).
The transformation of the movement of the BM to spikeswas realized by the HC. We normalized the velocity of theBM, v y ( x, t ) to a RMS displacement of | v y ( x, t ) | = 1 in10he resonance region in response to a
500 Hz sine stimulusat
30 dB to combine the HC model described in Meddis[1988, 1986] with the BM model depicted in the previousparagraph. The following description illustrates the keysteps of Meddis [1986], to which we refer for furtherdetails.In the HC model, one assumes that the cell contains aspecific amount of free transmitter molecules q ( x, t ) whichcould be released by use of a permeable membrane to thesynaptic cleft (Figure 8b). The permeability is a functionof the velocity of the BM, v y ( x, t ) : k ( x, t ) = (cid:40) g · [ v y ( x,t )+ A ] v y ( x,t )+ A + B for v y ( x, t ) + A > else . (10)The amount c ( x, t ) of transmitter in the cleft is subjectto chemical destruction or loss through diffusion l · c ( x, t ) as well as re-uptake into the cell r · c ( x, t ) : d c d t = k ( x, t ) q ( x, t ) − l · c ( x, t ) − r · c ( x, t ) . (11)A fraction n · w ( x, t ) of the reuptaken transmitter w ( x, t ) is continuously transferred to the free transmitter pool: d w d t = r · c ( x, t ) − n · w ( x, t ) . (12)The transmitter originates in a manufacturing base thatreplenishes the free transmitter pool at a rate y [1 − q ( x, t )] : d q d t = y [1 − q ( x, t )] + n · w ( x, t ) − k ( x, t ) q ( x, t ) (13)While in the cleft, transmitter quanta have a finite proba-bility P spike = h · c ( x, t ) d t of influencing the post-synapticexcitatory potential. A refractory period was imposed bydenying any event which occurs within of a previ-ous event. At each position x of the BM, we simulated N HC = 40 independent HCs. The phase-locking of HC outputs was increased by feedingtheir spike output to a population of N ch BCs (Figure 1).In contrast to Rothman et al. [1993], we implementedthe BCs as LIF neurons, described in Section 4.3.2. Inmore detail, we considered a single layer ( l = 1 ) of BCswithout recurrent connections ( V ( l ) ij = 0 ∀ i, j ). The feed-forward weights were uniformly initialized to W ( l ) ij =0 . /N HC ∀ i, j . A single BC was used to integrate thespiketrains of the N HC = 40 HCs for each channel of theBM.
To establish a performance reference on the two spik-ing datasets we trained networks of LIF neurons withsurrogate gradients and BPTT using a supervised lossfunction. We start with a description of the network ar-chitectures (Section 4.3.1), followed by the applied neuronand synapse model (Section 4.3.2). We close with a depic-tion of the supervised learning algorithm (Section 4.3.3)and the loss (Section 4.3.4) as well as the regularizationfunction (Section 4.3.5).
The spiketrains emitted by the N ch BCs were used to stim-ulate the actual classification network. In this manuscript,we applied feed-forward networks with l ∈ { , , } anda recurrent network with l = 1 , each layer containing N = 128 LIF neurons. For all network architectures, thelast layer was accompanied by a linear readout consistingof leaky integrators which do not spike.
We considered LIF neurons where the membrane potential u ( l ) i of the i -th neuron in layer l obeys the differentialequation: τ mem d u ( l ) i d t = − [ u ( l ) i ( t ) − u leak ] + RI ( l ) i ( t ) , (14)with the membrane time constant τ mem , the input resis-tance R , the leak potential u leak , and the input current I ( l ) i ( t ) . Spikes were described by their firing time. The k -th firing time of neuron i in layer l is denoted by k t ( l ) i and defined by a threshold criterion: k t ( l ) i : u ( l ) i ( k t ( l ) i ) ≥ u thres . (15)Immediately after k t ( l ) i , the membrane potential isclamped to the reset potential u ( l ) i ( t ) = u reset for t ∈ (cid:16) k t ( l ) i , k t ( l ) i + τ ref (cid:105) , with the refractory period τ ref . Thesynaptic input current onto the i -th neuron in layer l wasgenerated by the arrival of presynaptic spikes from neuron j , S ( l ) j ( t ) = (cid:80) k δ ( t − k t ( l ) j ) . A common first-order approx-imation to model the time course of synaptic currentsare exponentially decaying currents which are linearlysummed [Gerstner and Kistler, 2002]: d I ( l ) i d t = − I ( l ) i ( t ) τ syn + (cid:88) j W ( l ) ij S ( l − j ( t ) + (cid:88) j V ( l ) ij S ( l ) j ( t ) , (16)where the sum runs over all presynaptic partners j and W ( l ) ij are the corresponding afferent weights from the layerbelow. The V ( l ) ij resemble the recurrent connections withineach layer.For neurons subject to supervised learning, the case τ ref = 0 was considered for simplicity. Here, the reset canbe incorporated in Equation (14) through an extra term: d u ( l ) i d t = − u ( l ) i − u rest + RI ( l ) i τ mem + S ( l ) i ( t )( u rest − u thres ) . (17)To formulate the above equations in discrete time fortime step n and stepsize δt over a duration T = n · δt , theoutput spiketrain S ( l ) i [ n ] of neuron i in layer l at time step n is expressed as a nonlinear function of the membranevolatge S ( l ) i [ n ] = Θ (cid:16) u ( l ) i − u thres (cid:17) with the Heavyside11unction Θ . For small time steps δt we can express thesynaptic current in discrete time as follows I ( l ) i [ n + 1] = κI ( l ) i [ n ]+ (cid:88) j W ( l ) ij S ( l ) j [ n ] + (cid:88) j V ( l ) ij S ( l ) j [ n ] . (18)Further, by asserting u leak = 0 and u thres = 1 , the mem-brane potential can be written compactly as u ( l ) i [ n + 1] = λu ( l ) i [ n ](1 − S ( l ) i [ n ]) + (1 − λ ) I ( l ) i [ n ] , (19)where we have set R = (1 − λ ) and introduced the con-stants κ ≡ exp ( − δt/τ syn ) and λ ≡ exp ( − δt/τ mem ) . Inall our spiking network simulations we use Kaiming’s uni-form initialization [He et al., 2015] for the weights W ij and V ij . Specifically, the initial weights were drawn inde-pendently from a uniform distribution U ( −√ k, √ k ) with k = ( ) − . The task of learning was to minimize a cost function L over the entire dataset. To achieve this, gradient descentwas applied which modifies the network parameters W ij : W ij ← W ij − η ∂ L ∂W ij , (20)with the learning rate η . In more detail, we used customPyTorch [Paszke et al., 2017] code implementing the SNNswith surrogate gradients [Neftci et al., 2019] and applyingthe BPTT algorithm to compute the gradients. We chosea fast sigmoid to calculate the surrogate gradient: σ ( u ( l ) i ) = u ( l ) i β | u ( l ) i | , (21)with the steepness parameter β . We applied a cross entropy loss to the activityof the readout layer l = L . This function L on data with N batch samples and N class classes, { ( x s , y s ) | s = 1 , ..., N batch ; y s ∈ { , ..., N class }} takes theform: L = − N batch N batch (cid:88) s =1 ( i = y s ) · log exp (cid:16) u ( L ) i [˜ n i ] (cid:17)(cid:80) N class i =1 exp (cid:16) u ( L ) i [˜ n i ] (cid:17) , (22)with the indicator function . We stuck to the follow-ing two choices for the time step ˜ n (Figure 4): For the max-over-time loss, the time step with maximal mem-brane potential for each readout unit was considered ˜ n i = argmax n u ( L ) i [ n ] . In contrast, the last time step T for all samples was chosen for each readout neuron ˜ n i = T in case of the last-time-step loss. We minimizedthe cross entropy in Equation (22) by the Adamax opti-mizer [Kingma and Ba, 2014]. For our experiments, we added synaptic regularizationterms to the loss function to avoid pathologically highor low firing rates. In more detail, we used two differentregularization terms: As a first term, we used a per neuronlower threshold spike count regularization of the form: L = s l N batch + N N batch (cid:88) s =1 N (cid:88) i =1 (cid:34) max (cid:40) , T T (cid:88) n =1 S ( l ) i [ n ] − θ l (cid:41)(cid:35) , (23)with strength s l , and threshold θ l . Second, we used anupper threshold mean population spike count regulariza-tion: L = s u N batch N batch (cid:88) s =1 (cid:34) max (cid:40) , N N (cid:88) i =1 T (cid:88) n =1 S ( l ) i [ n ] − θ u (cid:41)(cid:35) , (24)with strength s u , and threshold θ u . For validation purposes, we applied three standard non-spiking methods for time-series classification to thedatasets, namely SVMs, described in Section 4.4.1,LSTMs as shown in Section 4.4.2, and CNNs detailedin Section 4.4.3.
We trained linear and non-linear SVMs using scikit-learn [Pedregosa et al., 2011]. Specifically, we trainedSVMs with polynomial (up to third degree) and RBFkernels. The vectors in the input space were constructedsuch that for each sample a N ch -dimensional vector x i is generated by counting the number of spikes emittedby each BC in each sample. Furthermore, features werestandardized by removing the mean and scaling to theunit variance. We used LSTMs for validation purposes of the temporaldata [Hochreiter and Schmidhuber, 1997]. The inputs tothe LSTM consist of the N ch spike trains emitted by theBCs, but binned in time bins of size
10 ms . We trainedLSTM networks using
TensorFlow 1.14.0 with the
Keras2.3.0 application programming interface (API) [Abadiet al., 2015, Chollet et al., 2015]. For all used layers, westuck to the default parameters and initialization unlessmentioned otherwise. Specifically, we considered a singleLSTM layer with 128 cells with a dropout probability of . for the linear transformation of the input as well as forthe linear transformation of the recurrent states. Last, areadout with softmax activation was applied. The model12as trained with the Adamax optimizer [Kingma and Ba,2014] and a categorical cross entropy loss defined on theactivation of the last time step and the time step withmaximal activation. We applied CNNs to further test for sperability of thedatasets. To that end, the spike trains were not onlybinned in time, but also in space. The temporal bin-width was set to
10 ms . Along the spatial dimension, thedata was binned to result in distinct input units. Asfor LSTMs, CNN networks were trained using Tensor-flow with the
Keras
APIs with default parameters andinitialization unless mentioned otherwise. First, a 2D con-volution layer with 32 filters of size × and rectifiedlinear unit (ReLU) activation function was applied. Next,the output was processed by 3 successive blocks, eachcomposed of two 2D convolutional layers, each of them ac-companied by batch normalization and ReLU activation.Both convolutional layers contain 32 filters of size × .We finalized the blocks by a 2D max-pooling layer withpool size × and a dropout layer with rate 0.2. Theoutput of the last of the three blocks was processed by adense layer with 128 nodes and ReLU activation functionfollowed by a readout with softmax activation. The wholemodel is trained with the Adamax optimizer [Kingmaand Ba, 2014] and a categorical cross entropy loss foroptimization. Acknowledgment
We gratefully acknowledge funding from the EuropeanUnion under grant agreements 604102, 720270, 785907(HBP) and the Manfred Stärk Foundation. The authorsacknowledge support by the state of Baden-Württembergthrough bwHPC. This work was supported by the No-vartis Research Foundation. Prof. Dr. Sebastian Hoth isacknowledged for enabling access to the sound-shieldedroom at the university hospital Heidelberg. Prof. Dr.Hans Günter Dosch is acknowledged for discussions andadvice on auditory preprocessing. David Schumann isacknowledged for mastering the audio files of HD.
References
Kwabena Boahen. A neuromorph’s prospectus.
Comput-ing in Science Engineering , 19(2):14–28, Mar 2017. doi:10.1109/MCSE.2017.33.David E. Rumelhart, Geoffrey E. Hinton, andRonald J. Williams. Learning representations by back-propagating errors.
Nature , 323(6088):533–536, October1986. ISSN 0028-0836. doi: 10.1038/323533a0.Emre O. Neftci, Hesham Mostafa, and Friedemann Zenke.Surrogate Gradient Learning in Spiking Neural Net-works. arXiv:1901.09948 [cs, q-bio] , January 2019.arXiv: 1901.09948. Friedemann Zenke and Surya Ganguli. SuperSpike: Super-vised Learning in Multilayer Spiking Neural Networks.
Neural Computation , 30(6):1514–1541, April 2018. ISSN0899-7667. doi: 10.1162/neco_a_01086.Michael Pfeiffer and Thomas Pfeil. Deep Learning WithSpiking Neurons: Opportunities and Challenges.
Front.Neurosci. , 12, 2018. ISSN 1662-453X. doi: 10.3389/fnins.2018.00774.Amirhossein Tavanaei, Masoud Ghodrati, Saeed RezaKheradpisheh, Timothée Masquelier, and AnthonyMaida. Deep learning in spiking neural networks.
Neu-ral Networks , December 2018. ISSN 0893-6080. doi:10.1016/j.neunet.2018.12.002.Guillaume Bellec, Darjan Salaj, Anand Subramoney,Robert Legenstein, and Wolfgang Maass. Long short-term memory and learning-to-learn in networks of spik-ing neurons. In
Advances in Neural Information Pro-cessing Systems , pages 787–797, 2018.Sumit Bam Shrestha and Garrick Orchard. SLAYER:Spike Layer Error Reassignment in Time. In S. Ben-gio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,
Advances in NeuralInformation Processing Systems 31 , pages 1419–1428.Curran Associates, Inc., 2018.Stanislaw Wozniak, Angeliki Pantazi, and Evangelos Eleft-heriou. Deep Networks Incorporating Spiking Neu-ral Dynamics. arXiv:1812.07040 [cs] , December 2018.arXiv: 1812.07040.J. Schemmel, D. Brüderle, A. Grübl, M. Hock, K. Meier,and S. Millner. A wafer-scale neuromorphic hardwaresystem for large-scale neural modeling.
Proceedings ofthe 2010 IEEE International Symposium on Circuitsand Systems (ISCAS"10) , pages 1947–1950, 2010.Simon Friedmann, Johannes Schemmel, Andreas Grübl,Andreas Hartel, Matthias Hock, and Karlheinz Meier.Demonstrating hybrid learning in a flexible neuromor-phic hardware system.
IEEE transactions on biomedicalcircuits and systems , 11(1):128–142, 2016.S. B. Furber, D. R. Lester, L. A. Plana, J. D. Garside,E. Painkras, S. Temple, and A. D. Brown. Overview ofthe spinnaker system architecture.
IEEE Transactionson Computers , 62(12):2454–2467, Dec 2013. ISSN 0018-9340. doi: 10.1109/TC.2012.142.M. Davies, N. Srinivasa, T. Lin, G. Chinya, Y. Cao,S. H. Choday, G. Dimou, P. Joshi, N. Imam, S. Jain,Y. Liao, C. Lin, A. Lines, R. Liu, D. Mathaikutty, S. Mc-Coy, A. Paul, J. Tse, G. Venkataramanan, Y. Weng,A. Wild, Y. Yang, and H. Wang. Loihi: A neuromor-phic manycore processor with on-chip learning.
IEEEMicro , 38(1):82–99, January 2018. ISSN 0272-1732. doi:10.1109/MM.2018.112130359.Saber Moradi, Ning Qiao, Fabio Stefanini, and GiacomoIndiveri. A scalable multicore architecture with hetero-geneous memory structures for dynamic neuromorphic13synchronous processors (dynaps).
IEEE transactionson biomedical circuits and systems , 12(1):106–122, 2018.Kaushik Roy, Akhilesh Jaiswal, and Priyadarshini Panda.Towards spike-based machine intelligence with neu-romorphic computing.
Nature , 575(7784):607–617,November 2019. ISSN 1476-4687. doi: 10.1038/s41586-019-1677-2. URL .Mike Davies. Benchmarks for progress in neuromorphiccomputing.
Nature Machine Intelligence , 1(9):386–388,2019.Yann LeCun, Léon Bottou, Yoshua Bengio, PatrickHaffner, et al. Gradient-based learning applied to doc-ument recognition.
Proceedings of the IEEE , 86(11):2278–2324, 1998.Joel Zylberberg, Jason Timothy Murphy, andMichael Robert DeWeese. A Sparse Coding Modelwith Synaptically Local Plasticity and Spiking NeuronsCan Account for the Diverse Shapes of V1 Simple CellReceptive Fields.
PLoS Comput Biol , 7(10):e1002250,October 2011. doi: 10.1371/journal.pcbi.1002250.Emre O. Neftci, Charles Augustine, Somnath Paul,and Georgios Detorakis. Event-driven randomback-propagation: Enabling neuromorphic deeplearning machines.
Frontiers in Neuroscience , 11:324, 2017. doi: 10.3389/fnins.2017.00324. URL .Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiplelayers of features from tiny images. Technical report,Citeseer, 2009.Yuval Netzer, Tao Wang, Adam Coates, AlessandroBissacco, Bo Wu, and Andrew Y. Ng. ReadingDigits in Natural Images with Unsupervised FeatureLearning. 2011. URL http://ufldl.stanford.edu/housenumbers/nips2011_housenumbers.pdf .Sander M Bohte, Joost N Kok, and Han La Poutre. Error-backpropagation in temporally encoded networks ofspiking neurons.
Neurocomputing , 48(1):17–37, 2002.H. Mostafa. Supervised Learning Based on TemporalCoding in Spiking Neural Networks.
IEEE Transactionson Neural Networks and Learning Systems , PP(99):1–9,2018. doi: 10.1109/TNNLS.2017.2726060.Iulia M. Comsa, Krzysztof Potempa, Luca Versari,Thomas Fischbacher, Andrea Gesmundo, and JyrkiAlakuijala. Temporal coding in spiking neural networkswith alpha synaptic function. arXiv:1907.13223 [cs,q-bio] , July 2019.Robert Gütig. To spike, or when to spike?
CurrentOpinion in Neurobiology , 25:134–139, April 2014. ISSN0959-4388. doi: 10.1016/j.conb.2014.01.004. Raoul-Martin Memmesheimer, Ran Rubin, Bence P.Ölveczky, and Haim Sompolinsky. Learning PreciselyTimed Spikes.
Neuron , 82(4):925–938, May 2014. ISSN0896-6273. doi: 10.1016/j.neuron.2014.03.026.Chris Eliasmith and Charles H. Anderson.
Neural Engi-neering: Computation, Representation, and Dynamicsin Neurobiological Systems . A Bradford Book, Cam-bridge, Mass., new ed edition edition, August 2004.ISBN 978-0-262-55060-4.Sophie Denève and Christian K. Machens. Efficient codesand balanced networks.
Nat Neurosci , 19(3):375–382,March 2016. ISSN 1097-6256. doi: 10.1038/nn.4243.L. F. Abbott, Brian DePasquale, and Raoul-MartinMemmesheimer. Building functional networks of spik-ing model neurons.
Nature Neuroscience , 19(3):350–355,March 2016. doi: 10.1038/nn.4241.Wilten Nicola and Claudia Clopath. Supervised learningin spiking neural networks with FORCE training.
Na-ture Communications , 8(1):2208, December 2017. ISSN2041-1723. doi: 10.1038/s41467-017-01827-3.Aditya Gilra and Wulfram Gerstner. Predicting non-lineardynamics by stable local learning in a recurrent spikingneural network. eLife Sciences , 6:e28295, November2017. ISSN 2050-084X. doi: 10.7554/eLife.28295.Dongsung Huh and Terrence J Sejnowski. Gradient de-scent for spiking neural networks. In
Advances in Neu-ral Information Processing Systems , pages 1433–1443,2018.Filip Ponulak and Andrzej Kasiński. Supervised Learningin Spiking Neural Networks with ReSuMe: SequenceLearning, Classification, and Spike Shifting.
NeuralComputation , 22(2):467–510, October 2009. ISSN 0899-7667. doi: 10.1162/neco.2009.11-08-901.Jean-Pascal Pfister, Taro Toyoizumi, David Barber, andWulfram Gerstner. Optimal Spike-Timing-DependentPlasticity for Precise Action Potential Firing in Super-vised Learning.
Neural Computation , 18(6):1318–1348,April 2006. ISSN 0899-7667. doi: 10.1162/neco.2006.18.6.1318.Răzvan V. Florian. The Chronotron: A Neuron ThatLearns to Fire Temporally Precise Spike Patterns.
PLOS ONE , 7(8):e40233, August 2012. ISSN 1932-6203. doi: 10.1371/journal.pone.0040233.Ammar Mohemmed, Stefan Schliebs, Satoshi Matsuda,and Nikola Kasabov. Span: spike pattern associationneuron for learning spatio-temporal spike patterns.
Int.J. Neur. Syst. , 22(04):1250012, June 2012. ISSN 0129-0657. doi: 10.1142/S0129065712500128.Brian Gardner and André Grüning. Supervised Learn-ing in Spiking Neural Networks for Precise TemporalEncoding.
PLOS ONE , 11(8):e0161335, August 2016.ISSN 1932-6203. doi: 10.1371/journal.pone.0161335.14obert Gütig and Haim Sompolinsky. The tempotron: aneuron that learns spike timing-based decisions.
NatNeurosci , 9(3):420–428, March 2006. ISSN 1097-6256.doi: 10.1038/nn1643.Dominik Thalmeier, Marvin Uhlmann, Hilbert J. Kappen,and Raoul-Martin Memmesheimer. Learning UniversalComputations with Spikes.
PLOS Comput Biol , 12(6):e1004895, June 2016. ISSN 1553-7358. doi: 10.1371/journal.pcbi.1004895.P. Lichtsteiner, C. Posch, and T. Delbruck. A 128x128120 dB 15us Latency Asynchronous Temporal ContrastVision Sensor.
IEEE Journal of Solid-State Circuits ,43(2):566–576, February 2008. ISSN 0018-9200. doi:10.1109/JSSC.2007.914337.Jithendar Anumula, Daniel Neil, Tobi Delbruck, and Shih-Chii Liu. Feature Representations for NeuromorphicAudio Spike Streams.
Frontiers in neuroscience , 12:23,2018. ISSN 1662-453X. doi: 10.3389/fnins.2018.00023.Garrick Orchard, Ajinkya Jayawant, Gregory K. Cohen,and Nitish Thakor. Converting Static Image Datasets toSpiking Neuromorphic Datasets Using Saccades.
Front.Neurosci. , 9, 2015. ISSN 1662-453X. doi: 10.3389/fnins.2015.00437.Arnon Amir, Brian Taba, David Berg, Timothy Melano,Jeffrey McKinstry, Carmelo Di Nolfo, Tapan Nayak,Alexander Andreopoulos, Guillaume Garreau, MarcelaMendoza, and others. A Low Power, Fully Event-BasedGesture Recognition System. In
Proceedings of theIEEE Conference on Computer Vision and PatternRecognition , pages 7243–7252, 2017.Pete Warden. Speech commands: A dataset forlimited-vocabulary speech recognition. arXiv preprintarXiv:1804.03209 , 2018.R Gary Leonard and George R Doddington. A speaker-independent connected-digit database.
InstrumentsIncorporated, Central Research Laboratories, Dallas,TX , 75266, 1991.Jackson Zohar, Souza César, Flaks Jason, Pan Yuxin,Nicolas Hereman, and Thite Adhish. Jakobovski/free-spoken-digit-dataset: v1.0.8, August 2018. URL https://doi.org/10.5281/zenodo.1342401 .Mozilla. Mozilla common voice, August 2019. URL https://voice.mozilla.org/en .Vassil Panayotov, Guoguo Chen, Daniel Povey, and San-jeev Khudanpur. Librispeech: an asr corpus based onpublic domain audio books. In , pages 5206–5210. IEEE, 2015.Anthony Rousseau, Paul Deléglise, and Yannick Esteve.Ted-lium: an automatic speech recognition dedicatedcorpus. In
LREC , pages 125–129, 2012. Arne Köhn, Florian Stegen, and Timo Baumann. Miningthe spoken wikipedia for speech data and beyond. InNicoletta Calzolari (Conference Chair), Khalid Choukri,Thierry Declerck, Marko Grobelnik, Bente Maegaard,Joseph Mariani, Asuncion Moreno, Jan Odijk, and Ste-lios Piperidis, editors,
Proceedings of the Tenth Interna-tional Conference on Language Resources and Evalua-tion (LREC 2016) , Paris, France, may 2016. EuropeanLanguage Resources Association (ELRA). ISBN 978-2-9517408-9-1.N Sieroka, H G Dosch, and A Rupp. Semirealistic modelsof the cochlea.
The Journal of the Acoustical Society ofAmerica , 120(1):297–304, 2006. ISSN 0001-4966. doi:10.1121/1.2204438.R Meddis. Simulation of auditory–neural transduction:Further studies.
The Journal of the Acoustical Societyof America , 83(3):1056–1063, 1988.David Schumann. dev-core. https://dev-core.org/ .Accessed: 2019-08-23.Paul Knysh and Yannis Korkolis. Blackbox: A proce-dure for parallel optimization of expensive black-boxfunctions. arXiv preprint arXiv:1605.00998 , 2016.Francesc Alted and Mercedes Fernández-Alonso. Pytables:processing and analyzing extremely large amounts ofdata in python.
PyCon2003. April , pages 1–9, 2003.Travis E Oliphant.
A guide to NumPy , volume 1. TrelgolPublishing USA, 2006.R Meddis. Simulation of mechanical to neural trans-duction in the auditory receptor.
The Journal of theAcoustical Society of America , 79(3):702–711, 1986.E de Boer. Auditory physics. Physical principles in hearingtheory. I.
Physics reports , 62(2):87–174, 1980.E de Boer. Auditory physics. Physical principles in hearingtheory. II.
Physics Reports , 105(3):141–226, 1984.Jason S Rothman, Eric D Young, and Paul B Manis.Convergence of auditory nerve fibers onto bushy cellsin the ventral cochlear nucleus: implications of a com-putational model.
Journal of Neurophysiology , 70(6):2562–2583, 1993.Wulfram Gerstner and Werner M Kistler.
Spiking neuronmodels: Single neurons, populations, plasticity . Cam-bridge university press, 2002.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Delving deep into rectifiers: Surpassing human-levelperformance on imagenet classification. In
Proceed-ings of the IEEE international conference on computervision , pages 1026–1034, 2015.Adam Paszke, Sam Gross, Soumith Chintala, GregoryChanan, Edward Yang, Zachary DeVito, Zeming Lin,Alban Desmaison, Luca Antiga, and Adam Lerer. Au-tomatic differentiation in PyTorch. In
NIPS AutodiffWorkshop , 2017.15iederik P Kingma and Jimmy Ba. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,D. Cournapeau, M. Brucher, M. Perrot, and E. Duches-nay. Scikit-learn: Machine learning in Python.
Journalof Machine Learning Research , 12:2825–2830, 2011.Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.
Neural computation , 9(8):1735–1780,1997.Martín Abadi, Ashish Agarwal, Paul Barham, EugeneBrevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado,Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghe-mawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving,Michael Isard, Yangqing Jia, Rafal Jozefowicz, LukaszKaiser, Manjunath Kudlur, Josh Levenberg, Dande-lion Mané, Rajat Monga, Sherry Moore, Derek Murray,Chris Olah, Mike Schuster, Jonathon Shlens, BenoitSteiner, Ilya Sutskever, Kunal Talwar, Paul Tucker,Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas,Oriol Vinyals, Pete Warden, Martin Wattenberg, Mar-tin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow:Large-scale machine learning on heterogeneous systems,2015. URL . Softwareavailable from tensorflow.org.François Chollet et al. Keras. https://keras.iohttps://keras.io