Supervised Learning in Temporally-Coded Spiking Neural Networks with Approximate Backpropagation
Andrew Stephan, Brian Gardner, Steven J. Koester, Andre Gruning
SSupervised Learning in Temporally-Coded SpikingNeural Networks with ApproximateBackpropagation
Andrew Stephan, Brian Gardner, Steven J. Koester,
Fellow, IEEE, and Andr´e Gr¨uning
Abstract —In this work we propose a new supervised learningmethod for temporally-encoded multilayer spiking networks toperform classification. The method employs a reinforcement sig-nal that mimics backpropagation but is far less computationallyintensive. The weight update calculation at each layer requiresonly local data apart from this signal. We also employ a rulecapable of producing specific output spike trains; by setting thetarget spike time equal to the actual spike time with a slightnegative offset for key high-value neurons the actual spike timebecomes as early as possible. In simulated MNIST handwrittendigit classification, two-layer networks trained with this rulematched the performance of a comparable backpropagationbased non-spiking network.
Index Terms —Spiking Neural Networks, Backpropagation,Supervised Learning, Temporal Encoding, Reinforcement.
I. I
NTRODUCTION
Backpropagation is a commonly used optimization proce-dure for Artificial Neural Networks (ANNs) in data classi-fication, and has demonstrated particular success in recentyears as increased processing speed has become availablethrough massively parallel computing architectures. Despitethis, the energy requirements of computing as applied totraditional ANNs continues to rise with the complexity ofinput data, limiting the versatility of machine learning in real-world applications. Spiking Neural Networks (SNNs), whichincorporate the actual firing times, or spikes, of simulatedneurons as part of their operational model, have been identifiedas a means to provide more energy efficient, event-basedcomputing, while still theoretically still being capable ofgreatly increased computational power over ANNs [1]. Despitethis, applying backpropagation to SNNs has proven to be challenging, since the spikes emitted by neurons have nosmooth functional dependence on network parameters, makingreliable gradient estimation an issue. Over the last decadethere have been numerous attempts to address this spike-gradient challenge, using various approximations to producinggradients that can be differentiated with respect to networkparameters.One of the first examples in combining backpropagationwith an SNN is described in [2], where a feedforward networkcontaining a single hidden layer was trained to associatespatiotemporal input patterns with desired network responses.The network’s error-function was defined in terms of thesquared difference between desired and actual firing times inthe output layer, upon which gradient descent was performedto obtain weight gradients in each layer. To solve for thegradient of a discontinuous firing time, the authors assumed alinear dependence of this quantity on its activating input for asufficiently small temporal region. While this approximationdisplayed success on limited, small-sized datasets, some draw-backs were noted such as the rule’s reliance on small learningrates to ensure eventual convergence, as well as its limitationto just learning single-spike responses in each layer.A later approach to combining backpropagation with SNNshas been proposed in [3] for learning desired input-outputspike pattern associations. A first step taken by the authorsof this study was to relate a previous spike-based supervisedlearning method, called ReSuMe [4], to an error-minimizationprocedure based on gradient descent. Secondly, the activationsof neurons in a multilayer network were formulated in termsof their instantaneous firing rates, which were assumed to havea linear functional dependence on their previous layer inputs.Hence, by applying backpropagation to this approximately-equivalent rate-based network, and then substituting neuronalfiring rates for their actual spike trains, ReSuMe could beapplied to multilayer learning. As a proof-of-concept theauthors demonstrated the success of this rule on a small,benchmark data classification task. The main limitation ofthis approach relates to the assumption that neurons in thenetwork can be treated as linear activations, which potentiallydiminishes the ability of the network to solve linearly non-separable classification tasks.A further example of backpropagation in SNNs was shownin [5], which extended a maximum-likelihood learning schemefrom single to multilayer network structures. Building onthe work of [6], this study considered networks containing a r X i v : . [ c s . N E ] J u l tochastic spiking neurons, such that spike-gradients could beapproximated by their expectations of generating sequences ofoutput spikes. This probabilistic approach has the advantage ofstill retaining the non-linear behavior of the network, since theactual firing times of hidden neurons are taken into accountwhen computing weight gradients. The capability of this rulewas demonstrated quite extensively on a variety of associativelearning tasks, although its technical performance on real-world datasets has not been tested.A more recent examination of backpropagation in SNNshas taken advantage of the analytical tractability of non-leaky integrate-and-fire neurons constrained to single-spikeresponses [7]; Specifically, the author arrived at a closed-form solution for backpropagation in a multi-layered networkof such neurons, making it more easily extendable to SNNscontaining more than one hidden layer. Furthermore, since justthe first output spikes emitted by neurons were considered,the network was capable of classifying MNIST digits withminimal delay, with low test errors of around 3 %. The mainlimitation of this study relates to its restriction to just singlespikes per neuron in all layers.One of the most recent approaches to formulating spike-based backpropagation has relied on the use of ‘surrogate gra-dients’ in order to establish smooth functional derivatives withrespect to network parameters [8]. Termed SuperSpike, thislearning rule works to minimize a spike train (dis)similaritymeasure called the van Rossum Distance (vRD) [9] which is afunction of the distance between target and actual output spiketrains in the final layer of a multilayer SNN. In estimating thegradient of a spike train, the author used the slope of a sigmoidas an auxiliary function. In terms of the rule’s performance,good accuracy was demonstrated in terms of its ability toprecisely match arbitrary spatiotemporal spike patterns. As isthe limitation with most spike-based backpropagation rules,SuperSpike relies on the presence of hidden spikes in order toallow the computation of weight gradients.The above studies highlight the typical approach that istaken to training multilayer SNNs: selecting an appropriateoutput objective function, and establishing a smooth functionaldependence with respect to the network’s parameters such thatbackpropagation can be applied. Although this procedure hasdisplayed success on certain learning tasks, there remains theissue that computing weight gradients in such a way is usuallycomputationally expensive, and in most cases difficult to scalewhen applied to deeper network architectures. Certain modelsimplifications such as those described in the work of [7]help mitigate these issues, although this then trades off theincreased computational capability afforded by SNNs whencompared against ANNs.Generally-speaking, training SNNs using an unsupervisedlearning method such as Spike-Timing-Dependent Plasticity(STDP) [10] or a single-layer, supervised learning rule [4],[6], [11]–[14] is inherently advantageous in terms of com-putational efficiency over spike-based backpropagation whencomputing weight gradients. This follows from their relianceon immediately available learning factors: specifically pre- and postsynaptic activity variables, whereas backpropagation alsoexplicitly depends on an error signal that requires a series ofbackpropagation steps to be computed, starting from the finallayer. Furthermore, as described previously, backpropagationapplied to SNNs also relies extensively on approximate spike-gradients, which for the most part is avoided with localizedlearning methods. Despite this, it is clear that backpropagationis better suited to training multilayer networks as applied todata classification tasks. This is apparent when considering thatindividualised error signals are backpropagated through thenetwork, informing each synapse of its specific contributionto the network’s overall accuracy during training. Hence, itbecomes reasonable to suppose that a hybrid approach to SNNtraining combining localized layer-wise learning with gener-alised feedback signalling is capable of network performanceapproaching that of backpropagation, but without the sacrificein computational cost.In this paper we propose a novel hybrid learning algorithmfor multilayer SNNs that uses one learning process for allof its layers. The algorithm builds on the work of [5], [14],beginning with a supervised learning method that circumventsthe discontinuous spike-gradient issue by taking a maximum-likelihood approach to learning, which is then taken in thelimit of a deterministic system to allow for more precisenetwork responses [14]. With respect to each layer, the al-gorithm computes weight gradients by combining a set oflocalized learning factors with a simple reinforcement signalthat carries a summary of the behavior of downstream layersin the network. This signal is condensed into a set of analogvalues associated with each postsynaptic neuron. This allowsthe learning algorithm to make decisions while armed withsome knowledge of the global impact of each synapse withoutneeding to backpropagate a complex error gradient function.The rest of this paper is organized as follows. In Section IIwe describe the mathematical model of the neurons and inputencoding, as well as the basic network structures employed.The learning algorithm is explained in Section III, and thesimulation results given in Section IV. We conclude with adiscussion in Section V.II. N ETWORK M ODEL
A. Spiking Neuron Model
This work uses the simplified Spike Response Model(SRM ) to describe the dynamics of simulated neurons [10].Specifically, a postsynaptic neuron’s membrane potential u i attime t is described by u i ( t | x , y i ) = (cid:88) j w ij (cid:88) t j ∈ x j (cid:15) ( t − t j ) + (cid:88) t i ∈ y i κ ( t − t i ) , (1)where x j ∈ x , x j = { t j, , t j, , . . . } , is the spatiotemporalspike pattern of all presynaptic neurons indexed by j , and y i = { t i, , t i, , . . . } is the sequence of emitted spikes, or spiketrain, of the postsynaptic neuron with index i . On the right-hand side of the equation, w ij is the synaptic weight betweenpre- and postsynaptic neurons j and i , respectively, (cid:15) describesthe form of the Postsynaptic Potential (PSP) evoked at the2ostsynaptic neuron in response to a single presynaptic spikeand κ is the reset kernel. The first double sum determines thenet weighted PSP response due to all incoming spikes, whilethe second sum describes postsynaptic refractory effects dueto the emission of output spikes. The (cid:15) kernel is given by (cid:15) ( s ) = (cid:15) (cid:20) exp (cid:18) − sτ m (cid:19) − exp (cid:18) − sτ s (cid:19)(cid:21) Θ( s ) , (2)where τ m = 10 ms and τ s = 5 ms are the membrane andsynaptic time constants, respectively, and the coefficient (cid:15) =4 mV. Θ is the Heaviside step function. The reset kernel isgiven by κ ( s ) = ( u r − V t ) exp (cid:18) − sτ m (cid:19) Θ( s ) , (3)where u r = 0 mV and V t = 15 mV are the reset and firingthreshold potentials, respectively. In terms of spike-generation,the postsynaptic neuron fires a spike when its membranepotential exceeds V t , immediately after which its potential isreset to u r in response to Eq (3). B. Network Structure
We will employ two different network structures in theexperiments carried out in this paper. The first consists of asingle SRM neuron with either one or two input channels,each of which contributes a single spike at a preset time. Thislayout will be used to study the firing responses of individualneurons subject to the training scheme that will be describedin Section III. The second network will be designed to testthe performance of the proposed hybrid learning algorithmas applied to the MNIST handwritten digits classificationproblem, consisting of an input layer, at least one hiddenlayer and an output layer. The input layer fulfills the roleof encoding digits as precisely-timed spikes, which shall bedescribed in more detail in the following subsection. Thehidden and output layers consist of SRM neurons that performcomputations on these inputs, which shall be trained using therule proposed below. The hidden layers shall contain 100, 200or 300 neurons, while the size of the output layer shall beheld fixed at 10 neurons: each of which corresponds to oneof the 10 handwritten digit classes. Whichever output neuronproduces the first spike in response to a presented digit is usedto decide the network’s classification of the input. If no outputneurons spike, no classification is made. The network is trainedin 500 batches of 20 iterations each, as limited by availablecomputing resources. Both the single-layer and multi-layernetwork layouts are simulated for 10 ms per iteration. C. Temporal Encoding of Input
To solve the MNIST classification problem [15] we mustfirst translate the images into input spikes. Specifically, theinput layer, or layer 0, is set up to contain 784 channels,corresponding to the 784 pixels in the 28x28 image with one-to-one association. Each channel is then driven to produceup to one spike, where the spike timing depends on itsassociated pixel intensity. The timings of spikes are determined as follows. First, pixels are normalized such that their valuesfall within the range [0, 1]. We then choose a Gaussiansensitivity function for each timing, inspired by the populationencoding model in [16]: t j ( p j ) = (cid:40) T ∗ (cid:104) − exp (cid:16) − ( p j − · σ (cid:17)(cid:105) for p j ≥ p t ∞ for p j < p t , (4)where t j ( p j ) is the time of the input spike produced bychannel j in the th layer as a function of the pixel strength p j , T = 10 ms is the duration of the simulated time and σ = 0 . determines the breadth of the sensitivity function.Any pixel value below the threshold p t = 0 . is considerednot to produce a spike, as represented by an infinite responsetime. For a pixel value greater than the threshold, as the valueincreases the spike timing occurs closer to t j ( p j ) = 0 , whilea decrease in its value delays the spike up to a maximumdelay of almost . From a biological perspective, thisrapid encoding scheme is supported by observations of neuralpopulations in the brain being capable of encoding visualinformation using spikes occurring within a very small timewindow, on the order of around
10 ms [17].III. L
EARNING T HEORY
A. FILTered-error (FILT) Learning Rule
We start our analysis by first describing the FILT synapticplasticity rule, a supervised method which was originallyderived in [14] to impose a target output spike train on anSRM neuron. The distinction of this rule as compared withmost other single-layer, spike-based learning rules is that,in the initial stage of its derivation, a stochastic firing ratesubstitution was used to circumvent the non-differentiablespike-gradient problem. Thereafter, in taking the limit of adeterministic spiking neuron with a fixed firing threshold, thefollowing λ learning window, as a function of the separationbetween target and actual output spike times, was derived: λ ( s ) = (cid:15) (cid:104) C m exp (cid:16) − sτ m (cid:17) − C s exp (cid:16) − sτ s (cid:17)(cid:105) for s > (cid:15) ( C m − C s )exp (cid:16) sτ q (cid:17) for s ≤ , (5)where τ q = 10 ms, C m = τ m τ m + τ q and C s = τ s τ s + τ q . The τ q term is a spike-linkage variable, describing the coincidencebetween target and actual output spike times. Hence, givenan input spike pattern x , and single target and actual outputspike times ˜ t i and t i , respectively, the learning rule is appliedas follows with learning rate η : ∆ w ij = η (cid:88) t j ∈ x j λ (˜ t i − t j ) − (cid:88) t j ∈ x j λ ( t i − t j ) . (6)This learning rule takes the difference between two spike-timing coincidence windows, driving a neuron to fire a singlespike at its target timing via synaptic weight modification.Specifically, the first term on the right-hand side measures theability of each input spike at t j to induce an output spike3t ˜ t i , while the second term measures the responsibility ofeach input spike at t j for causing an output spike at t i . Usingthis information, the rule changes weights in such a way asto suppress any inputs which contribute to an output spikeoccurring far from ˜ t i , and to reinforce any inputs which willcontribute to the emission of an output spike near ˜ t i . The rulealso works to produce graded weight adjustments when t i isclose to ˜ t i , resulting in smooth convergence towards a solutionby gradually changing the moment at which the neuron’smembrane potential crosses its firing threshold in responseto input-evoked PSPs. In this work only the first output spikefrom each neuron is used for the learning window, although theFILT rule is capable of handling multiple spikes per neuron.In its original formulation the FILT rule was devised to trainsingle-layer SNNs, and tested on randomly-generated inputdata. B. Approximate Backpropagation using FILT
This work uses an augmented version of the FILT rulein order to tackle the MNIST classification problem using amultilayer SNN. In [14] the goal was restricted to trainingsingle-layer SNNs to learn associations between arbitrary inputand output spike patterns. The goal in this work is to correctlyclassify input images of handwritten digits via a first-to-spikedecision process in the output layer of a network, as facilitatedusing precise spike time learning in hidden and output layers.With the ultimate purpose of inducing the correct outputneuron to spike earliest in response to its associated class ofinput data, we use the target spike time ˜ t lj of each neuron j in a layer l as a training parameter. Rather than being staticas in [14], each ˜ t lj is selected anew for each neuron in eachtraining iteration according to two metrics. These metrics are:(1) the first actual output spike time t lj , if present, and (2) afigure of merit associated with each neuron which we term the“desirability” d lj . The desirability reflects a measure, relative tothe other neurons in layer l , of how helpful an early spike fromneuron j will be. The desirability of every neuron in each layerlies within the range [ − , and is computed recursively withthe output layer l = L as the base case. For classification, theoutput layer neuron which is assigned to the correct class hasa default desirability of 1 while all other outputs have -1. Forneurons in any non-output layer, ≤ l < L , the desirability iscalculated using the desirabilities of the neurons in layer l + 1 and the corresponding weights w l +1 ij . Specifically, the vectorof pre-normalized desirabilities for all N l neurons in layer l is ˜ d l ∈ R N l , calculated as ˜ d l = ( w l +1 ) T · d l +1 , (7)where w l +1 ∈ R N l +1 × N l is the matrix of weights connectinglayer l to layer l + 1 , and d l +1 ∈ R N l +1 is the vector ofdesirabilities in layer l + 1 . Bold symbols are used to indicatevector or matrix quantities. The ˜ d l vector is then normalized: d l = 1 + 2 ˜ d l − max ( ˜ d l ) max ( ˜ d l ) − min ( ˜ d l ) (8) This shifts the desirabilities in layer l to fall within the range [ − , , preventing signal decay in a network with manyhidden layers. The relative rankings of the neurons in layer l are preserved by this normalization. The lowest ˜ d lj in thelayer becomes d lj = − and the highest becomes d lj = +1 .The desirability of a given neuron thus gives a relative measureof how strongly that neuron will excite (suppress) neurons inlayer l + 1 that themselves have a high (low) desirability.The desirability rule is somewhat analogous to a typicalSNN error backpropagation algorithm in that it contains ameasure of how beneficial each neuron is given the desiredoutcome of the final layer. However, unlike true backprop-agation the desirability rule does not explicitly compute theerror gradient with respect to each weight. Instead a muchsimpler calculation is used, thereby saving significantly oncomputing power. This also equips the network to counteractthe vanishing gradient problem for deep networks.After computing d l , the neurons are all trained via the FILTrule (6) with ˜ t lj = (cid:40) t lj − δt for d lj ≥ d t ∞ for d lj < d t , (9)where d t is the desirability threshold. If t lj does not exist, it issubstituted by · l ms. The purpose of the target shift term δt isto draw forward the spike times of the most desirable neuronsto distinguish them from the others. In turn, the spike timesof the neurons in layer l + 1 connected most strongly to thedesirable neurons in layer l will be similarly drawn forward.Ultimately the ideal output neuron will be encouraged to spikeearlier than its neighbors, thus correctly classifying the input.The impact of the target shift δt for simple systems will beexamined in Sec. IV. C. Additional Learning Constraints
To accompany the FILT rule we also apply RMSprop[18] for synapse-specific, adaptive learning rates, providinga means to modulate the magnitude of weight changes.RMSprop has successfully been applied to multilayer SNNtraining in [8], motivating our choice here. With a typical valueof β = 0 . , we calculate the RMSprop scaling term R b foreach batch b as R b = β · R b − + (1 − β ) · ∆ w b , (10)where ∆ w b is the set of all basic weight changes computedaccording to (6) in batch b and R = . Then the weights w b are updated: w b = w b − + η · ∆ w b √ (cid:15) + R b , (11)where η = 0 . and (cid:15) = 0 . are the learning rate andRMSprop offset term, respectively. The offset (cid:15) is used toavoid singularities when R b has entries close to zero.Synaptic scaling is also applied to the hidden layers toensure that a reasonable level of spiking activity is present.Without this the hidden units may stop spiking entirely andthus present no activity to excite subsequent layers. The final4 ig. 1. (a) The predicted change in a neuron’s membrane potential u i at itsactual firing time t when trained with the FILT rule for a single iteration with δt = 0.5 ms. The neuron is stimulated by two input spikes occurring at 0 and2 ms, each of which is received through a different synaptic weight. The terms ∆ u i and ∆ u i indicate the individual contributions to the net change ∆ u i ,which result from training on the first and second input spikes, respectively.An equilibrium point exists at t = t E when ∆ u i is zero. (b) Two simulatedspike rasters showing that the spike time of a single neuron trained on theseinputs approaches t E whether the initial spike time is before or after t E ,demonstrating that t E is an attractor point.Fig. 2. (a) Attractive equilibrium spike time t E for a single-input neuron asa function of the applied training shift δt . Counterintuitively, t E is positivelycorrelated with the δt . (b) t E for a two-input neuron with input spikes at [0 , ∆ t ] as a function of δt where ∆ t is a parameter. Provided that ∆ t isnot too much larger than δt so that there is some training window overlap,the resulting t E is positively correlated with ∆ t . Overall, the greater thedelay between inputs and the greater the training shift, the more delayed theequilibrium spike time will be. weights for each batch w b ∗ are computed with an additionalscaling term: w b ∗ = w b + γ · | w b − ∗ | (1 − S b ) (12)where γ = 0 . is the scaling coefficient, S b is a matrixcontaining the number of spikes produced by each neuron. Theprecise matrix manipulations required to match the dimensionsof S b and w b are omitted for brevity. A dropout rate of 35% is also applied to the network during training. Dropout hasbeen shown to be useful in preventing overfitting [19].IV. S IMULATION AND R ESULTS
A. Single Neuron Response
We first study the behavior of a simple one-neuron networkas described in Sec. II B. The neuron is presented withone or more arbitrary input spike trains injected through twochannels, and its weights updated according to the FILT ruleas defined by (6) with a target shift term δt as in (9). Supposeeach input channel contributes a single spike, with timings Fig. 3. Spike raster for a neuron trained with two different spike trains, one ofwhich is chosen randomly for each training iteration. The solid lines indicatethe ideal equilibrium time t E corresponding to the two spike trains. t and t for the first and second channels, respectively. Themembrane potential of the neuron, excluding the reset kernel,is then described by u i ( t ) = w · (cid:15) ( t − t ) + w · (cid:15) ( t − t ) . (13)If the neuron spikes at t i and its weights subsequently updatedaccording to the FILT rule, with its target firing time shiftedforwards by δt with respect to t i , then its membrane potentialat t i changes by ∆ u i ( t i ) = (cid:16) λ ( t i − δt − t ) − λ ( t i − t ) (cid:17) · (cid:15) ( t i − t ) + (cid:16) λ ( t i − δt − t ) − λ ( t i − t ) (cid:17) · (cid:15) ( t i − t ) . (14)In Fig. 1(a) we show ∆ u i ( t i ) as a function of t i for a sampleselection of spike times: t = 0 ms and t = 2 ms , and δt = 0 . . This includes the individual contributions fromthe two input channels. There is a clear equilibrium point t E at which ∆ u i ( t i ) = 0 , indicating that if t i = t E the membranepotential of the neuron will not change as a result of applyingFILT. However, if t i > t E then ∆ u i ( t i ) will be positive,causing t i to decrease in the next iteration. Alternatively, if t i < t E then δu i ( t i ) becomes negative, leading to an increasein t i . Thus we conclude that t E is an attractive equilibrium for t i under the FILT rule with positive δt . To test this conclusionwe simulated several trials involving a single neuron with twoinput channels contributing spikes at [0 , and randomlyinitialized weights. Fig. 1(b) shows the spike rasters for theneurons in two such trials where the initial spike was eitherearlier or later than t E . In order to further understand thedynamics of a neuron being trained with a target shift, wecalculated the predicted t E value for various δt , for both asingle-input neuron and a two-input neuron with the secondinput spike occurring an arbitrary time ∆ t after the first inputspike (see Fig. 2). We note that, counterintuitively, the larger5 ig. 4. Spike rasters for the input, hidden and output layers arising from two input images presented at 0 and 10 ms. A histogram of the spike timings foreach layer is included as well. the applied δt the greater t E becomes. Adding a second inputspike has the effect of shifting the t E curve upwards when δt is sufficiently large that the corresponding single-input t E value occurs after the second input spike. Prior to this point,the second input has no influence. In Fig. 3 we show a neuronlearning with two different input spike trains. Each input trainhas a unique t E to which the output spike time would convergeif trained solely on that input. For each training iteration, oneof the two spike trains is randomly selected and presented tothe network as an input. Despite having to match two differentinput sets, the neuron does remarkably well in convergingtoward each t E in response to each input train, as indicatedby the solid lines in the figure. We believe this discussion serves to illustrate the potentialusefulness of setting the target time ˜ t i = t i − δt term withinthe FILT rule. In many supervised spiking neuron trainingalgorithms, a specific, absolute target time must be chosen. Inany reasonable neuron model there will be some delay betweena presynaptic spike and the corresponding PSP peak. Withmultiple inputs and multiple layers, it can therefore be difficultto judge the optimal time for each neuron to spike in responseto its particular inputs even prior to considering proper networkoutput. The target shift addition to the FILT rule allows eachneuron to find its own optimal spike time in response to eachdifferent input pattern, similar to an unsupervised neuron.However we are also able, in a reinforcement manner, to6 ig. 5. Classification accuracy of the SNNs with the number of hiddenunits as a parameter. For comparison the dashed lines, from top to bottom,represent the final test accuracy of the Tensorflow networks with 300, 200and 100 hidden units respectively. Two different values of the learning rate η were tested. The network with 100 hidden units appears to have the bestperformance.Fig. 6. (a) Classification accuracy of the SNNs with 100 hidden units, d t = -0.1 and varying η . The best performance is achieved with . althoughthe difference is slight. (b) Classification accuracy with η = 0 . and d t asa parameter. The best performance results from d t = − . . restrict the spiking of specific neurons by modifying theirtarget time according to the desirability value as discussedabove. B. MNIST Classification
We now study the classification of images containing hand-written digits provided by MNIST. An example of the MNISTclassification network output is shown in Fig. 4. The imagesof a ‘7’ and a ‘6’ are provided to the network at 0 and10 ms respectively. The images are transformed into inputspikes as described in Section II C., which mostly cluster near0 and 10 ms. The hidden neurons encode this informationmore evenly across the timing windows. Despite learningrules which attempt to enforce one spike per output neuron,some images such as the ones tested in Fig. 4 result inmultiple output spikes. Despite this the two samples providedto the network are both correctly classified. The 7th and6th neurons, respectively, are the first to spike, as well asbeing the most active during their respective windows. Wetested networks with multiple different hidden layer sizes,learning rates and d t thresholds. Each network was trainedfor 500 batches with batch sizes of either 20 or 100 sampleseach. There was no significant difference in results notedbetween the two batch sizes. For comparison, a standard fully-connected non-spiking network of equivalent size was trained in TensorFlow using sigmoidal activation neurons which wetreat as roughly equivalent to spiking neurons. This Tensorflownetwork utilized full error-backpropagation training. The meantest accuracy using ensembles of 10 networks with 100, 200and 300 hidden units respectively is plotted against the batchnumber in Fig. 5 alongside dashed lines that indicate the finaltest accuracy of the TensorFlow networks. In these simulations d t = 0 . The three trials very nearly approach the accuracyof the backpropagation networks. For 100 hidden units theoptimal learning rate is shown to be 0.03 as indicated in Fig.6(a) although the variation between η = 0 . and η = 0 . isslight. In additional trials that explored η values outside thisrange the performance dropped off sharply. In Fig. 6(b) westudy the effect of varying d t for the 100-unit network. Thebest case appears to be a d t = − . , indicating that roughlythe top 45th percentile of hidden neurons should be reinforcedand the rest suppressed. Indeed the best trial matches oroutperforms even the larger 300-unit backpropagation networkshown in Fig. 5. V. D ISCUSSION
We have devised and demonstrated a new supervised learn-ing rule for classifying images with spiking neural networks.We would like to begin the discussion by highlighting andreiterating several of its important aspects.First, this rule begins with FILT, a single-layer supervisedlearning algorithm, and modulates it using the backpropagateddesirability as a fitness signal. Imprecise backpropagation,relying on approximated feedback signalling, has previouslybeen shown to be an effective learning method [20], [21].The FILT rule, among some others like it, is derived usinga rate-substitution method which solves the discontinuousspike-gradient problem. However unlike some other super-vised spike-based learning rules such as [3], FILT preservesnonlinearity by assuming nonlinear spiking rate functions. Ingeneral, any suitable approximate backpropagated signal couldbe used in tandem with any functional single-layer learningrule for spiking neurons, leading to a new class of hybridmultilayer learning methods. This class possesses some of theadvantages of both full SNN backpropagation methods andlocal learning methods.Second, this hybrid class has significantly reduced compu-tational complexity due to the lack of a true error functionand accompanying gradient calculation. By using only implicitreference to network output, we produce a learning rule thatwould be especially helpful for deep networks and neuromor-phic hardware.Third, by beginning with a single-layer training rule thatseeks to impose a specific output spike train on the neuronsand using ˜ t lj = t lj − δ t we allow the chosen neurons to find theirown natural spike time. This maintains some of the degreesof freedom in the network while still imposing a desiredpattern by selecting which neurons are trained according toa finite ˜ t lj via the d lj and d t values. This bears some relationto STDP, a timing-based rule that many SNN implementationsin hardware employ [22]–[26]. Hardware built to natively7mplement STDP should therefore be more easily adapted touse the reinforced, timing-based FILT rule as opposed to a fullbackpropagation method. An example is Heidelberg’s DigitalLearning System (DLS) platform, which not only implementslocal, STDP-like, pre- and postsynaptic correlation traces,but also includes an embedded microprocessor dedicated toplasticity processing that is capable of combining local traceswith a third factor, such as an external reward signal [25]. DLSalso operates at a large speed-up factor of 1000 compared tothe biological time-scale it emulates, making it ideally suitedto long-term SNN simulations.Interestingly, we observe that our hybrid learning approachshares certain similarities with a reward-maximization proce-dure as studied in [6], [27]–[29]. In more detail, we first notethat the FILT rule approximates the supervised, maximum-likelihood approach to learning for a probabilistic neuronmodel, but taken in the limit of a deterministic system [14].We then consider that, as described in [6], [29], if the targetspike train of a neuron trained by maximum-likelihood issubstituted with its actual output spike train, then the learningrule becomes an unsupervised one. Hence, if this unsupervisedrule is combined with an external ‘success signal’ to guideweight changes, then the rule works in such a way as tomaximize the likelihood of a neuron generating a spike trainwhich positively correlates with the receipt of a positive-valued success signal: a process that is otherwise referred to asreward-maximization [29]. Taking these points into account,we relate our hybrid learning method based on FILT to thereward-maximization method for the following reasons: • In our implementation the target firing time of a neuronis not explicitly supervised: the prescribed target timingactually depends on the actual output firing time of theneuron, but shifted earlier by a small amount. • When unsupervised the rule works to progressively shiftthe actual firing time of a neuron earlier, but sincethe FILT rule is additionally modulated by an external‘desirability’ signal that is linked to the overall successof the network, the firing time is instead adjusted toshift either forwards or backwards according to how thisinfluences the network’s goal. • Since the goal of the network is to drive an early responsein one of the output layer neurons according to itsassociated class label, then the communicated desirabilityworks to shift the strongest contributing hidden spikesearlier. • Although the evaluation of the modulatory signals usedhere involves more computational steps than those con-sidered in [29], they essentially share the same end resultof providing task-specific feedback in order to guidedesirable weight changes.Hence, it is for the above reasons that we can interpret our hy-brid training method as driving a form of reward-maximizationduring network training, by a combination of layer-wise locallearning factors modulated by external signalling. It is also no-table that reward-modulation of local synaptic activity factors is considered a biologically-plausible hypothesis for learning[30], where for example the neuromodulator dopamine ishypothesized to encode a reward-prediction error signal in thebrain [31].In conclusion, we note that although our experimentsdemonstrated reasonably high accuracy of our hybrid trainingmethod on MNIST, this did not reach the levels of some other,more refined spiking classifier implementations such as in [7],[32]. However, we believe this sufficiently demonstrates proofof concept for the hybrid learning rule we pioneer herein.The results achieved here are a lower bound on the possibleperformance of the rule, as different choices of reinforcementsignal and layer-wise learning rule and additional tuning ofthe parameters can further enhance the network accuracy.R
EFERENCES[1] W. Maass, “Networks of spiking neurons: the third generation of neuralnetwork models,”
Neural Networks , vol. 10, no. 9, pp. 1659–1671, 1997.[2] S. M. Bohte, J. N. Kok, and H. La Poutre, “Error-backpropagationin temporally encoded networks of spiking neurons,”
Neurocomputing ,vol. 48, no. 1-4, pp. 17–37, 2002.[3] I. Sporea and A. Gr¨uning, “Supervised learning in multilayer spikingneural networks,”
Neural Computation , vol. 25, no. 2, pp. 473–509,2013.[4] F. Ponulak and A. Kasi´nski, “Supervised learning in spiking neuralnetworks with resume: sequence learning, classification, and spikeshifting,”
Neural Computation , vol. 22, no. 2, pp. 467–510, 2010.[5] B. Gardner, I. Sporea, and A. Gr¨uning, “Learning spatiotemporallyencoded pattern transformations in structured spiking neural networks,”
Neural Computation , vol. 27, no. 12, pp. 2548–2586, 2015.[6] J.-P. Pfister, T. Toyoizumi, D. Barber, and W. Gerstner, “Optimalspike-timing-dependent plasticity for precise action potential firing insupervised learning,”
Neural Computation , vol. 18, no. 6, pp. 1318–1348, 2006.[7] H. Mostafa, “Supervised learning based on temporal coding in spikingneural networks,”
IEEE Transactions on Neural Networks and LearningSystems , vol. 29, no. 7, pp. 3227–3235, 2017.[8] F. Zenke and S. Ganguli, “Superspike: Supervised learning in multilayerspiking neural networks,”
Neural Computation , vol. 30, no. 6, pp. 1514–1541, 2018.[9] M. C. W. van Rossum, “A novel spike distance,”
Neural Computation ,vol. 13, no. 4, pp. 751–763, 2001.[10] W. Gerstner and W. M. Kistler,
Spiking neuron models: Single neurons,populations, plasticity . Cambridge University Press, 2002.[11] R. V. Florian, “The chronotron: a neuron that learns to fire temporallyprecise spike patterns,”
PLoS ONE , vol. 7, no. 8, p. e40233, 2012.[12] A. Mohemmed, S. Schliebs, S. Matsuda, and N. Kasabov, “Span: Spikepattern association neuron for learning spatio-temporal spike patterns,”
International Journal of Neural Systems , vol. 22, no. 04, p. 1250012,2012.[13] R.-M. Memmesheimer, R. Rubin, B. P. ¨Olveczky, and H. Sompolinsky,“Learning precisely timed spikes,”
Neuron , vol. 82, no. 4, pp. 925–938,2014.[14] B. Gardner and A. Gr¨uning, “Supervised learning in spiking neuralnetworks for precise temporal encoding,”
PLoS ONE , vol. 11, no. 8,p. e0161335, 2016.[15] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner et al. , “Gradient-basedlearning applied to document recognition,”
Proceedings of the IEEE ,vol. 86, no. 11, pp. 2278–2324, 1998.[16] S. M. Bohte, H. La Poutr´e, and J. N. Kok, “Unsupervised clustering withspiking neurons by sparse temporal coding and multilayer rbf networks,”
IEEE Transactions on Neural Networks , vol. 13, no. 2, pp. 426–435,2002.[17] C. P. Hung, G. Kreiman, T. Poggio, and J. J. DiCarlo, “Fast readoutof object identity from macaque inferior temporal cortex,”
Science , vol.310, no. 5749, pp. 863–866, 2005.[18] G. Hinton, N. Srivastava, and K. Swersky, “Neural networks for machinelearning lecture 6a overview of mini-batch gradient descent,”
Coursera ,vol. 14, p. 8, 2012.
19] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R.Salakhutdinov, “Improving neural networks by preventing co-adaptationof feature detectors,” arXiv preprint arXiv:1207.0580 , 2012.[20] B. Gardner, I. Sporea, and A. Grning, “Learning spatiotemporallyencoded pattern transformations in structured spiking neural networks,”
Neural Computation , vol. 27, no. 12, p. 25482586, Dec 2015.[21] T. P. Lillicrap, D. Cownden, D. B. Tweed, and C. J. Akerman, “Ran-dom synaptic feedback weights support error backpropagation for deeplearning,”
Nature communications , vol. 7, p. 13276, 2016.[22] A. Sengupta and K. Roy, “Encoding neural and synaptic functionalitiesin electron spin: A pathway to efficient neuromorphic computing,”
Applied Physics Reviews , vol. 4, no. 4, p. 041105, 2017.[23] Y. Kim, Y. Zhang, and P. Li, “A reconfigurable digital neuromorphicprocessor with memristive synaptic crossbar for cognitive computing,”
ACM Journal on Emerging Technologies in Computing Systems (JETC) ,vol. 11, no. 4, p. 38, 2015.[24] S. B. Furber, F. Galluppi, S. Temple, and L. A. Plana, “The spinnakerproject,”
Proceedings of the IEEE , vol. 102, no. 5, pp. 652–665, 2014.[25] S. Friedmann, J. Schemmel, A. Gr¨ubl, A. Hartel, M. Hock, and K. Meier,“Demonstrating hybrid learning in a flexible neuromorphic hardwaresystem,”
IEEE Transactions on Biomedical Circuits and Systems , vol. 11,no. 1, pp. 128–142, 2017.[26] C.-K. Lin, A. Wild, G. N. Chinya, Y. Cao, M. Davies, D. M. Lavery,and H. Wang, “Programming spiking neural networks on intels loihi,”
Computer , vol. 51, no. 3, pp. 52–61, 2018.[27] X. Xie and H. S. Seung, “Learning in neural networks by reinforcementof irregular spiking,”
Physical Review E , vol. 69, no. 4, p. 041909, 2004.[28] R. V. Florian, “Reinforcement learning through modulation of spike-timing-dependent synaptic plasticity,”
Neural Computation , vol. 19,no. 6, pp. 1468–1502, 2007.[29] N. Fr´emaux, H. Sprekeler, and W. Gerstner, “Functional requirementsfor reward-modulated spike-timing-dependent plasticity,”
Journal ofNeuroscience , vol. 30, no. 40, pp. 13 326–13 337, 2010.[30] E. Vasilaki, N. Fr´emaux, R. Urbanczik, W. Senn, and W. Gerstner,“Spike-based reinforcement learning in continuous state and actionspace: when policy gradient methods fail,”
PLoS Computational Biology ,vol. 5, no. 12, p. e1000586, 2009.[31] W. Schultz, “Multiple reward signals in the brain,”
Nature ReviewsNeuroscience , vol. 1, no. 3, p. 199, 2000.[32] P. O’Connor, D. Neil, S.-C. Liu, T. Delbruck, and M. Pfeiffer, “Real-time classification and sensor fusion with a spiking deep belief network,”
Frontiers in Neuroscience , vol. 7, p. 178, 2013., vol. 7, p. 178, 2013.