[PDF] Scalable NoC-based Neuromorphic Hardware Learning and Inference

Abstract

Bio-inspired neuromorphic hardware is a research direction to approach brain's computational power and energy efficiency. Spiking neural networks (SNN) encode information as sparsely distributed spike trains and employ spike-timing-dependent plasticity (STDP) mechanism for learning. Existing hardware implementations of SNN are limited in scale or do not have in-hardware learning capability. In this work, we propose a low-cost scalable Network-on-Chip (NoC) based SNN hardware architecture with fully distributed in-hardware STDP learning capability. All hardware neurons work in parallel and communicate through the NoC. This enables chip-level interconnection, scalability and reconfigurability necessary for deploying different applications. The hardware is applied to learn MNIST digits as an evaluation of its learning capability. We explore the design space to study the trade-offs between speed, area and energy. How to use this procedure to find optimal architecture configuration is also discussed.

Full PDF

SScalable NoC-based Neuromorphic HardwareLearning and Inference

Haowen Fang , Amar Shrestha , De Ma , Qinru Qiu Department of Engineering and Computer Science, Syracuse University, Syracuse, New York College of Computer Science and Technology, Zhejiang University, ChinaEmail: { hfang02,amshrest,qiqiu } @syr.edu, [email protected] Abstract —Bio-inspired neuromorphic hardware is a researchdirection to approach brain’s computational power and energyefﬁciency. Spiking neural networks (SNN) encode informationas sparsely distributed spike trains and employ spike-timing-dependent plasticity (STDP) mechanism for learning. Existinghardware implementations of SNN are limited in scale or do nothave in-hardware learning capability. In this work, we propose alow-cost scalable Network-on-Chip (NoC) based SNN hardwarearchitecture with fully distributed in-hardware STDP learningcapability. All hardware neurons work in parallel and commu-nicate through the NoC. This enables chip-level interconnection,scalability and reconﬁgurability necessary for deploying differentapplications. The hardware is applied to learn MNIST digits as anevaluation of its learning capability. We explore the design spaceto study the trade-offs between speed, area and energy. How touse this procedure to ﬁnd optimal architecture conﬁguration isalso discussed.

Index Terms —Spiking neural network, Network on chip, STDPlearning, Unsupervised learning

I. I

NTRODUCTION

In the ﬁeld of deep learning, convolutional neural networks(CNNs) and recurrent neural networks (RNNs) are developedto perform a series of human-level cognitive applications[1] [2]. However, the tremendous computation and memoryrequirement have been seriously challenging the processingefﬁciency of deep learning systems [3] [4]. The limitationsof Von Neumann architecture coupled with increasing powerdemands due to Dennard scaling and the approaching end ofMoore’s Law have motivated multiple research efforts intolow-power, highly parallel and distributed computing architec-ture [5] [6] [7] [8] and brain-inspired computing architecture[9] [10]. Brain as a source of inspiration is not surprising givenits ability to process massive amounts of real-time informationwhile consuming less than 20 W of power [11]. The goal ofneuromorphic hardware design is to explore the bio-inspiredarchitecture to achieve cognitive functions in real time utilizinglower power and smaller footprint than the traditional VonNeumann architectures.Brain, in simplistic terms, is a collection of neurons, inter-connected in a vast network through links called synapses.The communication between neurons in the vast networkprovides the brain its processing abilities and perform patternrecognition, classiﬁcation, associative memory, reasoning etc.The basis of this communication is the short asynchronouselectrical pulses/action potentials called spikes. Spiking neuralnetworks (SNNs), which use spikes as the basis for commu-nication, is the third generation of neural networks [12]. Aseach neuron works asynchronously in an event-driven manner,SNNs have the potential to reach very low energy dissipation.When spiking activity in SNNs is stochastic i.e. spikes are generated as a stochastic process, the information is carriedby the statistics of a group of spikes instead of individualspikes. This makes the SNN more biologically plausible andalso improves its fault tolerance and noise (delay) resilience.The network of neurons in the brain can learn patterns bymodifying the synapses linking the neurons based on theircausal relative spike timings. This local and causal learningrule is called Spike-Timing Dependent Plasticity ( STDP )[13].As it is based only on local information of individual neu-rons, fully distributed learning[13] can be achieved on SNNs.Several challenges exist in implementing STDP learning onhardware. Firstly, the STDP rule is typically an exponentialfunction, which is expensive for hardware implementation.Secondly, since the ﬁnal value of synaptic weight is unknownduring the learning, the hardware implementation should con-sider the worst case and be ready to provide a wide rangeand high precision for every synapse. Hence, more memoryis required for hardware neurons that has learning capabilitythan the hardware neurons that performs inference only.SNN hardware requires massive interconnection for paral-lelism and scalability. The Network-on-Chip(NoC) architec-ture has been used to provide on-chip communication formassive parallel systems. Traditional NoC design aims atminimizing communication latency and router congestion. Aswe will show later, due to time-multiplexed nature in neuronhardware implementation, and asynchronous and stochasticneuron behavior, latency of inter-neuron communication isnot a performance bottleneck in hardware SNN. This propertyenables us to signiﬁcantly simply router design and reduceshardware cost.In this work, we adopt Q2PS approximation of STDP rule[14] to simplify the hardware exponential function. Differentsynaptic weights are encoded differently to provide both awide range and a high precision without increasing the storage.A very low-cost NoC is designed to provide just enoughcommunication capability for an SNN. Our main contributionsinclude:1) A low-cost time-multiplexed hardware spiking neurondesign. Replacing multiplication and exponential func-tions with add and shift operations reduces hardwarecomplexity as well as power consumption. The time-multiplexed physical neuron design improves resourceutilization and neuron density.2) A compact NoC design with a low-cost router, whichis application speciﬁc and optimized for spiking neuralnetwork.3) Experimental demonstration of STDP learning capabilityof the hardware by applying it to unsupervised learning a r X i v : . [ c s . ET ] S e p f MNIST digits.4) Design space exploration to study speed, area and energytrade-offs and suggestions of design choices based onthe analysis. II. R ELATED W ORKS

There have been several existing research works on SNNfrom the hardware perspective [6] [9] [10]. The IBMsTrueNorth neurosynaptic processor [9] has achieved state-of the-art performance with minimal energy footprint onmany tasks [15][16], but it does not provide in-hardwarelearning capabilities. SpiNNaker [6] has also been popular inthe research community as a testbed for SNN applications.SpiNNaker Chip Multiprocessor integrates 18 ARM cores.It is capable of massively parallel simulations for spikingneural network. [17] presents an analogue device to implementartiﬁcial synapse with high energy efﬁciency, which shows 30nJ energy consumption for an epoch of classiﬁcation task.[18] proposed a programmable CMOS neuromorphic chip.The architecture aims at implementing biologically plausiblecircuits and is limited in scalability. To address the scalabil-ity issue, there are works adopt NoC technique with SNN.EMBRACE is an FPGA based ﬂexible and reconﬁgurableSNN architecture[19]. It uses NoC to handle inter-neuroncommunication. EMBRACE also features a genetic basedonchip training. It randomly initializes neuron conﬁgurationsand performs ﬁtness evaluation, crossover and selection untilthe optimal SNN conﬁguration is obtained. [20] presents H-NoC, an architecture for spiking neural network. The goal ofH-NoC is to reduce packet delay and it assumes that eachneuron in the SNN has a dedicated port to the router althoughthe detail of the neuron is not given. As we will show later, thetime multiplexed neuron core design and asynchronous natureof the neuron activities relax the latency constraint. A moresimpliﬁed NoC design sufﬁces the SNN application.Hardware implementations of STDP learning [21] [22]focus more on circuit and device level analysis to achievevariable synaptic plasticity instead of scalability. [23] proposeda digital hardware neuron model for synaptic plasticity, it fo-cuses on the design of individual neuron cores, interconnectionand scalability are not addressed. Few analog VLSI approachesof synaptic plasticity are proposed in [24][25][26], whichfocus on the individual synapses design without addressinglarge scale network implementation and architectural design.Emerging memristive devices have also been studied to realizeartiﬁcial synapse and synaptic plasticity[27][28][29][30][31].However these researches are still at proof-of-concept leveland the fabrication technology of memristive device is not yetmature. III. N

EURON M ODEL AND L EARNING R ULE

We utilize a simpliﬁed version [32] of the neuron modelproposed in [33]. Here the membrane potential u ( t ) of neuron Z is computed as u ( t ) = w + u ( t −

1) + n (cid:88) i =1 w i · y i ( t ) (1)where w i is the weight of the synapse connecting Z to its i th pre-synaptic neuron y i , y i ( t ) is 1 if y i issues a spike at time Fig. 1: Generic neuron model.t, and w models the intrinsic excitability (bias) of the neuronZ. An integrate and ﬁre neuron Z spikes when the membranepotential crosses the threshold and then its membrane potentialis reset to 0. When the threshold is set to be random over aspeciﬁed range, the stochastic integrate-and-ﬁre neuron (SIF)approximates the Bayesian neuron in [32].In order to aggregate or relay spike activities, we alsointroduce spiking Rectiﬁed Linear Unit (ReLU) neuron. AReLU neuron accumulates every weighted input spike anddischarges it over time resembling a burst ﬁring pattern. Aftera spike, the membrane potential of a ReLU neuron is computedas: u ( t ) = u ( t −

1) + n (cid:88) i =1 y i ( t ) − U th (2)STDP is the basis for learning in a spiking neuron model.Multiplicative STDP are stable but induces low competitionwhereas additive rules are highly competitive but unstable.Both qualities, stability and competitiveness, are highly desir-able. Most existing STDP rules utilize the exponential func-tion, which is expensive for digital hardware implementation.Here, we utilize the low-cost Q2PS STDP rule proposed in[14] to approximate the exponential and multiplications usingshifters, adders and a priority encoder. The analysis in [14]shows that the Q2PS rule is stable and highly competitive.The rule is given in Equation : ∆ w i = (cid:40) << | ¯ Q | , if ¯ Q > >> | ¯ Q | , if ¯ Q < (3)Where ¯ Q is the quantization of Q through priority encodingwhich is given as below.If t post − t pre < τ LT P , then Q = η (cid:48) LT P − w i (4)If t post − t pre > τ LT P or t pre − t post < τ LT D , then Q = η (cid:48) LT D + w i (5)Where η (cid:48) LT P = log η LT P and η (cid:48) LT D = log η LT D . And t post and t pre are the time steps at which the pre and post-synaptic neuron spikes, τ LT P and τ LT D are the LTP and LTDwindow and η LT P and η LT D are the LTP and LTD learningrates respectively.The base 2 exponential function in Q2PS can be imple-mented using a barrel shifter with very low hardware cost.The weight learned by this rule has a limited range[14], whichwill be explored to reduce the storage requirement, as will bediscussed in section IV-E.V. H

ARDWARE ARCHITECTURE

The proposed architecture consists of a grid of homoge-neous neurons. Each individual neuron’s behavior is pro-grammable and detailed neuron conﬁguration will be discussedin Section IV-D. We adopt a globally asynchronous, locallysynchronous (GALS) approach and avoid using a globalclock. Neurons and routers work asynchronously in differentclock domains. NoC is used as the global communicationinfrastructure to address massive interconnections of SNN.In this section, we will discuss the NoC design, the routerarchitecture, the network interface and the hardware neurondesign.

A. Network-on-Chip design

SNNs have massive numbers of interconnected neurons run-ning simultaneously with each neuron having fan-outs largerthan [34]. Traditional on-chip communication solutionssuch as bus or point-to-point connection are limited in eitherscalability or ﬂexibility [35]. NoC has been widely used toprovide inter-core communication for massive parallel on-chip systems. A typical NoC architecture consists of threecomponents; router, channel and PE (processing element).Routers are interconnected by channels. Each PE is attachedwith a router and communication with each other via multi-hoppacket transmission. Based on the destination address providedby the packet, routers make routing decision to forward iteither to the next router or to the local PE. In this way, arbitrarynetwork topology can be implemented.Traditional NoC design aims at minimizing the communica-tion latency and router congestion to ensure reliable communi-cation. Large buffer, wide interconnects and faster router clock(compared to PE clock) are widely used techniques to achievethe goal. However, the proposed hardware SNN is highlyresistant to latency. In a typical SNN, the spiking activities issparse and sporadic [36]. The sparsity is even more visible inthe hardware design due to the time-multiplexed nature of theneuron cores. As we will explain in section IV-D, for a neuroncore of M logical neurons with N axons, each neuron will beevaluated once every ( M + C +4)( N +1) cycles. This intervalis referred as neuron evaluation cycle (NEC). Assume allneuron evaluations are randomly ordered, the average latencybetween the spike generation and required spike reception is T NEC / . Furthermore, because of the asynchronous behaviorof neurons, it is not absolutely necessary for a spike generatedin current NEC to be received in the very next NEC. The STDPwindow is usually set to be multiple NECs. A communicationdelay of 1 NEC will hardly affect learning and inference atall. Finally, in-hardware learning automatically helps to adjustsynaptic weight based on the hardware. Links that consistentlyhave long latency or dropped packet will eventually have lowsynaptic weight and hence become less important. Therefore,in this work, our application speciﬁc NoC design will aim atminimum silicon area and low overhead.The router consists of ﬁve ports, dual clock FIFO, a crossbarswitch and an arbiter as shown in Figure 2. Every portis independent from other ports and all of them work inparallel. Each port has a controller and a routing logic. Therouting logic implements routing algorithm and determines thenext hop of a packet. The controller detects channel statusand coordinates with arbiter to make transmission decisions. The arbiter handles crossbar conﬂict during the time whenan output is requested by multiple inputs. Each router isconnected to 4 neighbor nodes, which are north, south, east,west respectively. Each router also connect to its local PE.Each port is attached with a FIFO buffer that can hold onepacket. We set the FIFO size to minimum to reduce hardwarecost. Routers work at the same frequency as hardware neuron.All routers form a 2-D mesh.Physical link width is a key factor to the NoC perfor-mance and hardware cost. Link is realized by a number ofparallel wires connecting two neighboring routers. A widerphysical link can provide a larger bandwidth and reducetransmission latency. However, the area overhead of routerincreases quadratically as the link width increases [37]. Sincethe proposed hardware SNN is resistant to latency, we adoptthe minimum cost solution and set the link width to be 4 bits. Crossbar ArbiterPort

Controller

Routing

LogicPort

Controller

Routing

Logic

Request/Arbitration .. NorthLocal

NorthLocalCrossbar .. Fig. 2: Router Structure

B. Routing and ﬂow control

We adopt X-Y routing for its low hardware cost and asits deadlock free [38]. Each node holds its own coordinate ( Xc, Y c ) and the packets contain destination node’s coordi-nate ( Xd, Y d ) . Router compares its own coordinate with thecoordinate of the destination and decides where to forward thepacket. Horizontal direction has higher priority than verticaldirection. If Xd > Xc , packet is forwarded to east neighbor,otherwise to the west neighbor. If Xd = Xc , packet is routedvertically based on comparison result between Y c and

Y d . If Xd = Xc and Y d = Y c , then packet is forwarded to the localport.Each packet is an address event representation (AER) con-sisting of two ﬁelds; header and body. The header is H bitsabsolute coordinates of destination, where H is determined bythe NoC size. The body again is divided into two parts. Theﬁrst part is L-bit axon index, where L = N , and N is thenumber of axons of a neuron core. The second part is reservedfor debug and function extension.Router employs wormhole switching as the ﬂow controlmechanism. A packet is split into a few ﬂow control units(ﬂits). Each ﬂit has the same size as the physical link width,which is 4-bit. The H/ header ﬂits contain the routinginformation. As long as header is received, the router canforward the header ﬂits to the next desired hop and allsubsequent payload ﬂits will follow the same path. In this way,he asynchronous buffer does not have to store entire packetsand both the depth and width of buffer can be minimizedmitigating high silicon area cost of the asynchronous buffers.Router stalls transmission when its neighbor is busy. C. Network Interface

Network interface(NI) is the bridge between router andNeuron. NI is also responsible for decoding packets andbuffering incoming spikes. NI has a register array of length N .Each bit corresponds to an input of a hardware neuron. Once apacket is received, the axon ID ﬁeld is decoded into an address.Speciﬁed bit of the register array is set to 1, indicating a spikeis received. NI also provides local trafﬁc bypass. All packetstargeting the local neuron will be directly decoded. D. Neuron operations

Inspired by IBM TrueNorth[9], we designed the hardwareneuron architecture aiming at low overhead. In-hardwareSTDP learning capability is added. The proposed hardwaresupports two major functions, inference and learning. Theinference function integrates weights, updates membrane po-tential and generates spikes. And the learning function updatesweights and bias based on the rules proposed in section III.In order to improve resource utilization and achieve highdensity neurons in SNN, the hardware works in a time-multiplexed manner. The data path and control logic canbe used for multiple neurons’ computation. We refer to thephysical circuit that implements neuron behavior as a phys-ical neuron. Each physical neuron can implement M logicalneurons. As we will show in Section VI, the value of M willaffect the speed, cost and energy efﬁciency of the system.A physical neuron has N inputs called axons. The set of N axons are shared by all logical neurons in a neuron core. Inthis way, a logical neuron can connect up to N logical neuronsthrough a single spike packet, which reduces the NoC trafﬁc.We refer every connection between an axon and a logicalneuron as a synapse. The connectivity between axons andlogical neurons can be represented as a crossbar as shown inFigure 3. Each dot in the ﬁgure is a synapse. Every synapsehas a unique weight. If a neuron is not connected to an axon,the corresponding synaptic weight is 0. N a x o n s SynapseM logical neurons

Slot 1 Slot 2 Slot 3 Slot M

Fig. 3: Logical neuron connectivity.Each logical neuron performs inference followed by learn-ing, if its learning function is enabled. Learning can onlyhappen after inference, because it requires information suchas the ﬁring condition, pre-synaptic history and post-synaptic

Neuron Controller

Status MemoryConfiguration Memory

Datapath Spike BufferAddressAddressSTDP History Buffered SpikesNeuron TypeLearning ParametersSTDP ParametersThresholdBiasPotentialWeight Control SignalsLearning ModeNetwork InterfaceOutput Spikes

Incoming Spikes

Spike AER

Neuron

Fig. 4: Physical neuron structurehistory, which are updated during the inference. For perfor-mance efﬁciency, we parallelize the learning of the i th neuronwith the inference of the ( i + 1) th neuron.A global synchronization signal coordinates all physicalneurons computation and the interval between two synchro-nization signals is one NEC. Each neuron is evaluated onceevery NEC. One NEC is partitioned into M + 1 slots; onefor each logical neuron and the last slot is to complete thelearning operation of the last logical neuron. Each slot hasmultiple clock cycles and its duration is determined by thelearning and inference latencies. As we can see, by evenlydistributing logical neurons in multiple physical neurons, lessslots are required in a NEC, the whole network can work ata higher frequency. E. Hardware neuron design

As shown in Figure 4, a physical neuron has 5 parts. Theneuron controller is responsible for scheduling the compu-tation of logical neurons, generating addresses and controlsignals for learning and inference. Data path is the key compo-nent to implement neuron behaviors, including inference andlearning functions. Spike buffer has a register array of length N . Each bit corresponds to an axon input. When the startsignal is high, the content in spike buffer will be either clearedor overwritten by the output of NI.There are two memory banks in a hardware neuron. Con-ﬁguration memory stores every logical neurons’ behaviorparameters and learning parameters such as logical neurontype, learning mode, LTP learning rate ( η LT P ) , LTD learningrate ( η LT D ) etc. Status memory stores logical neurons’ statusparameter, including membrane potential, bias, threshold, axonweights, pre-synaptic history and post-synaptic history. Everyphysical neuron has its own memory, which is located next tothe data path.The overlapped learning and inference both require to accessweight memory at same time. To solve this issue, a FIFO isused. When the ith neuron is performing inference functionand accessing weight memory, each weight is pushed intothe FIFO. When the inference of i th logical neuron is done, ( i + 1) th logical neuron starts inference. At the same time,he learning function of i th logical neuron starts, all requiredweights of i th are fetched from FIFO and sent to the data path.A physical neuron with M logical neurons and N axonshas M ∗ N synapses, and each synapse has a unique weight.Therefore, weight consumes the most memory resources. TheQ2PS STDP rule limits weights in a small range but requireshigh precision[14]. This enables reducing integer bits withoutaccuracy penalty. However, some specialized networks suchas Winner-take-all circuit[14] require wide weight range whilethe precision is not important. To mitigate this problem, eachaxon is associated with a scaling factor. By conﬁguring scalingfactor, axon can be switched between different precision andrange levels, so that wide weight range and high precision canboth be satisﬁed.Based on its inference function, a logical neuron can beconﬁgured as one of the four modes: integrate and ﬁre(IF),stochastic integrate and ﬁre(SIF), spiking Rectiﬁed LinearUnit(ReLU) and learning mode. In IF mode, neuron im-plements Equation 1, if the membrane potential exceeds itsthreshold, it will forward an AER to router. In SIF mode,the threshold is a random number uniformly distributed in agiven range, so that neuron ﬁres at certain probability. In ReLUmode, neuron implements equation 2. If learning is enabled,neuron performs the learning rules described in section III.Separate hardware is used to implement the data path forinference and learning functions.The inference data path is responsible for computing mem-brane potential, issuing spikes and performing stochastic be-havior. It takes one clock cycle to accumulate the weight ofeach synapse. Adding the bias and previous NECs membranepotential takes another 2 clock cycles. Comparing currentmembrane potential to the threshold and determining whetherto spike or not takes one more cycle. At last the new membranepotential is written back to status memory. Assuming the axonnumber is conﬁgured as N , the inference evaluation of 1logical neuron takes N + 4 clocks.The learning data path implements the Q2PS rule, whichuses adder, shifter, priority encoder and look-up table(LUT)to approximate the exponential STDP learning. Learning datapath is pipelined and has four stages. In stage one, basedon the spiking history and spiking condition, η LT D or η LT P is selected to perform LTP or LTD learning. Q is computedas equation 4 or equation 5. In stage two, ¯ Q is obtained bypriority encoding Q . Then equation 3 is computed, 1 is shiftedby ¯ Q times to get the change of synaptic weight. In stagethree, weight change is applied to the old weight. In stage four,updated weight is written to weight memory. The learning ofbias is also implemented in this stage. The difference betweenbias learning and weight learning is that the bias learning isnot a function of time, whether LTD or LTP is used dependsonly on the current spiking condition. Learning of a neuronwith N axons also takes N + 4 clocks. Figure 5 shows thetiming of data path and pipeline.In addition to the data path for inference and learning, anSTDP tracker is implemented to maintain the pre-synapticspike history and post-synaptic spike history which are criticalto performing correct learning activity. The post-synaptichistory tracker is a counter that is set to 0 in the NEC whenthe logical neuron generates a spike, and incremented by 1in every NEC otherwise. The pre-synaptic history tracker is AW AW AW AW AW AW AB CMP WBDC Q WU WBDC Q WU WB

DC Q WU WBAP

Learning pipeline AW WBLogical neuron m inference evaluation Logical neuron m+ 1inference evaluationDCLogical neuron m – WBLogical neuron m learning evaluation

Recall data path

AW: accumulate weight AB: accumulate bias AP: accumulate potential WB: write backDC: delta compute Q: quantization WU: weight update

Fig. 5: Data path pipelinealso a counter that is set to 0 in the NEC when a spike isreceived on that synapse, and incremented by 1 otherwise.Post-synaptic/pre-synaptic history valid ﬂag is asserted if thethe tracker is less than the LTD/LTP window.When learning mode is enabled, STDP tracker determinesto expire post-synaptic history and pre-synaptic history basedon the ﬁring condition, valid ﬂag and incoming spike. Whenthe logical neuron issues a spike and pre-synaptic history isvalid, STDP tracker expires pre-synaptic history. When logicalneuron receives a spike, and post-synaptic history is valid,STDP tracker will expire post-synaptic history.The proposed hardware also supports multicast. In themulticast mode, the most signiﬁcant bit of extension ﬁeld inspike AER is a control bit. If the MSB is 1, neuron controllerwill increase address pointer to read next spike AER fromconﬁguration memory and keeps sending write request to NIuntil a spike AER’s MSB is 0. In this way, a logical neuron canhave ﬂexible number of destinations, which allows the networkto support more complex topologies, simplify conﬁgurationand improve resource utilization.V. RESULTS AND DISCUSSIONThe proposed hardware design is implemented in VerilogRTL-level model and synthesized on Altera Arria 10 platform.The results in terms of learning and performance of the NoCdesign is discussed, and the design space is explored in thissection.

A. Unsupervised Learning of MNIST digits

The stochastic ﬁring and STDP learning enables unsuper-vised feature learning in SNNs. To validate the functioning ofthe hardware design, we employ a simple pattern learning task.In this task, we utilize a simple winner-take-all (WTA) circuitto learn handwritten digits 0 and 1[32] from MNIST data set.The network is trained using 100 samples and each sample isexposed to the network for 100 NECs. In this experiment weset M and N to be 128 and 256 respectively.For convenience, given the size of fan-in for a core (256axons), we look to reduce the required number of inputs intoany layer. As an MNIST image has 28x28 pixels, we employan average-pooling-like mechanism for patches of 2x2 witha stride of 2 in the ﬁrst layer. Thus the resultant input into eight Distribu�onNeuron 0 Neuron 1 Neuron 2 Neuron 3 NEC 0NEC 10000 (a) Weight distribution.

Neuron 0 Neuron 1 Neuron 2 Neuron 3 (b) Weight visualization

Fig. 6: Learnt weightsthe second layer will be an average-pooled 14x14 MNISTimage. The overall network consists of two layers. The ﬁrstlayer consists of 196 average pooling neurons whose fan-in is 2x2, and the second layer consists of 4 SIF neuronswith STDP learning enabled and 4 ReLU neurons to relaythe spikes. These 8 neurons form a winner-takes-all (WTA)circuit as described in [32]. The 196 neurons in the ﬁrst layerare mapped to 14 cores and all neurons in the second layer aremapped 1 core. A 4x4 mesh network is conﬁgured to performthis experiment. MNIST image is encoded into spike packetsand a dedicated router is used to inject external packets.Figure 6(a) shows the distribution of the weight before andafter learning. The stability of the Q2PS STDP rule can beseen from this ﬁgure. As all learnt weights are limited in range[-1.41 , 1.07], less integer bits are used to encode the weightsand hence lower memory usage. The selectivity provided bythe Q2PS STDP rule can also be observed. Before the learning,the weight follows a uniform distribution in the range [-1, 1],while after learning the diverging weights of the network forma bimodal distribution as expected in [14]. Figure 6(b) givesthe weight map of the 4 classiﬁcation neurons. As we can seethat each of them learned speciﬁc patterns.Figure 7(a) gives the learning results. Initially, the spikingactivity is random, but as the time goes on, a spiking patternemerges which represent the corresponding selectivity of theneurons. This selectivity is also visible from the ﬁring ratesof the learning neurons shown in Figure 7(b). Initially thespiking rate is high as all the neurons are randomly ﬁring. Astime goes on, the neurons start learning patterns and only ﬁreselectively. Thus, drastically reducing the ﬁring rate.Most neurons remain idle through the entire training andthe average ﬁring probability is 11.472%. Because the com-putation is event-driven, the sparsity of neuron activation leadsto low power consumption.

NEC I npu t Spike Trains N e u r o n (a) Spike train pattern and corresponding input.(b) Average ﬁring rate of learning neurons. Fig. 7: Learning neuron spike activity

B. Performance evaluation

The trafﬁc pattern and spike pattern of NoC are speciﬁc tothe topology and neuron parameters of SNNs that are mappedto it. In order to guarantee the proposed architecture cansatisfy the requirements of various applications, we perform apressure test to evaluate the performance of NoC by increasingﬁring rate using above the MNIST recognition network abaseline. We conﬁgured a 4x4 mesh network with randomconnectivity. Each physical neuron has M = 128 logicalneurons and N = 256 input axon. 2048 logical neurons aredistributed in 16 physical neurons. The router has buffer depthof 2 packets. Table I shows NoC trafﬁc statistics under differ-ent ﬁring rates. The ﬁrst row is the above MNIST network thatis used as baseline. Each experiment runs 1000 NECs. At theoperation frequency of 200 MHz, ∗ / (257 ∗ , NECs can be executed in 1 second. Beneﬁting from the sparseactivation of SNNs and shared input axon, the trafﬁc stays ata relatively low level. When ﬁring rate is 87.526%, the NoCachieves throughput of 26 Mbyte/s. The forth column in TableI shows the average latency of the network under differentﬁring rate. The MNIST baseline has smaller latency becauseit has shorter average packet route than random topology. TheMNIST baseline also has larger trafﬁc because external inputof image is injected to network. The average latency remainsstable under different trafﬁc load. All packet can be deliveredin current NEC, which is 33560 clock cycles in this case. Weobserve no packet drop due to congestion. The experimentABLE I: Trafﬁc Statistic

FiringRate TotalSpikes Trafﬁc(byte) Avg.Latency(Cycle) MaxLatency Minlatency B u ﬀ e r C o n g e s � o n R a t ee t a R n o i t s e g n o C n o i t n e t n o C Firing Rate

Conten � on Conges �on Rate Bu ﬀe r Conge s�o n Rate Fig. 8: Congestion rateshows that, although small ﬂit size cause large delay, there isstill sufﬁcient time to deliver packets.Two types of congestion are studied. The ﬁrst type of con-gestion occurs inside the router, which is caused by multipleincoming packets requesting the same destination port. In thiscase, only one port is granted to transmit while other ports arestalled temporarily until the transmission is done. We referto this as contention congestion. The second type is buffercongestion, which occurs when the destination routers inputbuffer is full, router has to wait until the packets stored in thedestination are transmitted. A packet will be dropped whenbuffer congestion occurs. We deﬁne the congestion rate as T congestion /T execution , where T congestion is the total clockcycles in which congestion has occurred, and T execution isthe running time of the entire simulation. As shown in Figure8, both the two types of congestion rate increase as the ﬁringrate increases. However, due to the time-multiplexed design, aphysical neuron can generate 1 spike every 260 clocks at most,the trafﬁc is spanned in a long duration and at most timesrouters remain idle. The sparsity of SNN activation makes thetrafﬁc even sparser. As a result, even in the worst case, wherethe network has a ﬁring rate of 87.562%, the congestion isstill rare.In SNNs, neurons ﬁring rates are about 10%[36]. There isno signiﬁcant performance degradation observed in the worstcase, therefore the proposed architecture is able to satisfy therequirements of various applications. C. Design space exploration

The physical neuron capacity, NoC size and router buffersize can affect power consumption, parallelism and efﬁciency. TABLE II: Impact of Physical Neuron Capacity

Size NEC Neuron LUT Memory Power Energy (cycle) number (byte) (mW) (J)32 8580 64 18.97% 1,394,688 5759.05 0.24764 16900 32 9.56% 1,359,872 2144.83 0.351128 33540 16 4.87% 1,342,464 1839.52 0.498256 66820 8 2.46% 1,333,760 1679.96 0.783

Here, we study the impacts of these factors and provideguidelines for the design of hardware spiking neural network.First, there is a trade-off between physical neuron capacityand parallelism. Assume that a SNN has 2048 neurons and aphysical neuron can contain 256 logical neurons. 2048 / 256 =8 physical neurons are required to map the SNN to hardware.In this case, a NEC should have at least (256 + 1) * 260= 66820 clock cycles. If the physical neuron can contain 32neurons, 2048 / 32 = 64 physical neurons are required. A NECshould contain (32 + 1) * 260 = 8,580 clock cycles. Thereforethe second hardware SNN runs approximately 8 times fasterthan the ﬁrst one as the second hardware SNN require lessclock cycles in a NEC. However this improvement is obtainedat the cost of silicon area and power consumption.Table II shows the impact of physical neuron capacityon FPGA resource consumption, power and energy whenmapping a SNN with 2048 neurons to hardware. The exper-iment is performed at the operating frequency of 200 MHz.LUT consumption has an approximately linear relation to thenumber of physical neuron number. The logic resource con-sumption of a cell is almost constant. Memory consumptionincreases slightly as the physical neuron number grows. Extramemory consumption is introduced by router’s input buffer.The last column shows the energy consumption of executing1000 NECs. Although power is signiﬁcantly larger when morephysical neurons are required, the length of a NEC becomesshorter. Hence more NECs can be executed in a given runningtime. For FPGA implementation, it is preferable to use smallphysical neuron to increase parallelism as well as energyefﬁciency.Another factor that has impacts on performance is router’sbuffer size. Larger buffer size can reduce congestion rate,however the dual clock buffer is expensive in area. It isdesirable to reduce router area to improve neuron density. Westudied the impact of buffer size on network performance. A4x4 network is conﬁgured, which consists of 2048 neurons.Two sets of experiments are performed. In the ﬁrst set, theSNN’s ﬁring rate is approximately 10%, which is close tothe ﬁring rate of realistic applications. In the second set,a pressure test is performed, the SNN has approximately100% ﬁring rate. Network performance are shown in table III.Compared with buffer of depth 8, buffer congestion decreasesconsiderably, contention congestion also decreases due to lessbuffer congestion. The buffer congestion is rare when bufferdepth is 16. Increasing buffer further does not bring signiﬁcantperformance improvement.VI. C

ONCLUSIONS

In this paper, we presented a comprehensive system-levelspiking neural network hardware implementation, which fea-tures scalability, ﬂexibility and in-hardware STDP learningABLE III: Impact of router buffer size

Firing rate Buffer size Contention Buffer (width*depth) congestion congestion capability. The proposed design is validated by unsupervisedMNIST digits learning. R

EFERENCES[1] Y. Li, Z. Li, and Q. Qiu, “Assisting fuzzy ofﬂine handwriting recogni-tion using recurrent belief propagation,” in

Computational Intelligence(SSCI), 2016 IEEE Symposium Series on . IEEE, 2016, pp. 1–8.[2] A. Ren, Z. Li, C. Ding, Q. Qiu, Y. Wang, J. Li, X. Qian, andB. Yuan, “Sc-dcnn: highly-scalable deep convolutional neural networkusing stochastic computing,” in

Proceedings of the Twenty-SecondInternational Conference on Architectural Support for ProgrammingLanguages and Operating Systems . ACM, 2017, pp. 405–418.[3] Y. Wang, C. Ding, Z. Li, G. Yuan, S. Liao, X. Ma, B. Yuan, X. Qian,J. Tang, Q. Qiu et al. , “Towards ultra-high performance and energy efﬁ-ciency of deep learning systems: an algorithm-hardware co-optimizationframework,” arXiv preprint arXiv:1802.06402 .[4] S. Lin, N. Liu, M. Nazemi, H. Li, C. Ding, Y. Wang, and M. Pedram,“Fft-based deep learning deployment in embedded systems,” arXivpreprint arXiv:1712.04910 , 2017.[5] S. Wang, Z. Li, C. Ding, B. Yuan, Q. Qiu, Y. Wang, and Y. Liang, “C-lstm: Enabling efﬁcient lstm using structured compression techniqueson fpgas,” in

Proceedings of the 2018 ACM/SIGDA International Sym-posium on Field-Programmable Gate Arrays . ACM, 2018, pp. 11–20.[6] S. B. Furber, D. R. Lester, L. A. Plana, J. D. Garside, E. Painkras,S. Temple, and A. D. Brown, “Overview of the spinnaker systemarchitecture,”

IEEE Transactions on Computers , vol. 62, no. 12, pp.2454–2467, 2013.[7] C. Ding, S. Liao, Y. Wang, Z. Li, N. Liu, Y. Zhuo, C. Wang, X. Qian,Y. Bai, G. Yuan et al. , “Circnn: accelerating and compressing deep neuralnetworks using block-circulant weight matrices,” in

Proceedings of the50th Annual IEEE/ACM International Symposium on Microarchitecture .ACM, 2017, pp. 395–408.[8] J. Li, Z. Yuan, Z. Li, C. Ding, A. Ren, Q. Qiu, J. Draper, and Y. Wang,“Hardware-driven nonlinear activation for stochastic computing baseddeep convolutional neural networks,” in

Neural Networks (IJCNN), 2017International Joint Conference on . IEEE, 2017, pp. 1230–1236.[9] P. A. Merolla, J. V. Arthur, R. Alvarez-Icaza, A. S. Cassidy, J. Sawada,F. Akopyan, B. L. Jackson, N. Imam, C. Guo, Y. Nakamura et al. , “Amillion spiking-neuron integrated circuit with a scalable communicationnetwork and interface,”

Science , vol. 345, no. 6197, pp. 668–673, 2014.[10] Z. LI, Y. WANG, and Q. QIU, “Probabilistic inference in neuromorphicarchitecture: Applications and implementations.”[11] R. Ananthanarayanan, S. K. Esser, H. D. Simon, and D. S. Modha,“The cat is out of the bag: cortical simulations with 109 neurons, 1013synapses,” in

High Performance Computing Networking, Storage andAnalysis, Proceedings of the Conference on . IEEE, 2009, pp. 1–12.[12] W. Maass, “Networks of spiking neurons: the third generation of neuralnetwork models,”

Neural networks , vol. 10, no. 9, pp. 1659–1671, 1997.[13] J. Sj¨ostr¨om and W. Gerstner, “Spike-timing dependent plasticity,”

Spike-timing dependent plasticity , vol. 35, 2010.[14] A. Shrestha, K. Ahmed, Y. Wang, and Q. Qiu, “Stable spike-timingdependent plasticity rule for multilayer unsupervised and supervisedlearning,” in

Neural Networks (IJCNN), 2017 International Joint Con-ference on . IEEE, 2017, pp. 1999–2006.[15] A. Shrestha, K. Ahmed, Y. Wang, D. P. Widemann, A. T. Moody, B. C.Van Essen, and Q. Qiu, “A spike-based long short-term memory ona neurosynaptic processor,” in

Computer-Aided Design (ICCAD), 2017IEEE/ACM International Conference on . IEEE, 2017, pp. 631–637.[16] A. G. Andreou, A. A. Dykman, K. D. Fischl, G. Garreau, D. R. Mendat,G. Orchard, A. S. Cassidy, P. Merolla, J. V. Arthur, R. Alvarez-Icaza et al. , “Real-time sensory information processing using the truenorthneurosynaptic system.” in

ISCAS , 2016, p. 2911.[17] P. Yao, H. Wu, B. Gao, S. B. Eryilmaz, X. Huang, W. Zhang, Q. Zhang,N. Deng, L. Shi, H.-S. P. Wong et al. , “Face classiﬁcation usingelectronic synapses,”

Nature Communications , vol. 8, 2017. [18] M. R. Azghadi, S. Moradi, and G. Indiveri, “Programmable neuromor-phic circuits for spike-based neural dynamics,” in

New Circuits andSystems Conference (NEWCAS), 2013 IEEE 11th International . IEEE,2013, pp. 1–4.[19] S. Cawley, F. Morgan, B. McGinley, S. Pande, L. McDaid, S. Carrillo,and J. Harkin, “Hardware spiking neural network prototyping andapplication,”

Genetic Programming and Evolvable Machines , vol. 12,no. 3, pp. 257–280, 2011.[20] S. Carrillo, J. Harkin, L. J. McDaid, F. Morgan, S. Pande, S. Cawley, andB. McGinley, “Scalable hierarchical network-on-chip architecture forspiking neural network hardware implementations,”

IEEE Transactionson Parallel and Distributed Systems , vol. 24, no. 12, pp. 2451–2461,2013.[21] S. Park, A. Sheri, J. Kim, J. Noh, J. Jang, M. Jeon, B. Lee, B. Lee,B. Lee, and H. Hwang, “Neuromorphic speech systems using advancedreram-based synapse,” in

Electron Devices Meeting (IEDM), 2013 IEEEInternational . IEEE, 2013, pp. 25–6.[22] J. M. Cruz-Albrecht, M. W. Yung, and N. Srinivasa, “Energy-efﬁcientneuron, synapse and stdp integrated circuits,”

IEEE transactions onbiomedical circuits and systems , vol. 6, no. 3, pp. 246–256, 2012.[23] K. Ahmed, A. Shrestha, Y. Wang, and Q. Qiu, “System design for in-hardware stdp learning and spiking based probablistic inference,” in

VLSI (ISVLSI), 2016 IEEE Computer Society Annual Symposium on .IEEE, 2016, pp. 272–277.[24] J. Schemmel, A. Grubl, K. Meier, and E. Mueller, “Implementingsynaptic plasticity in a vlsi spiking neural network model,” in

NeuralNetworks, 2006. IJCNN’06. International Joint Conference on . IEEE,2006, pp. 1–6.[25] F. L. M. Huayaney and E. Chicca, “A vlsi implementation of a calcium-based plasticity learning model,” in

Circuits and Systems (ISCAS), 2016IEEE International Symposium on . IEEE, 2016, pp. 373–376.[26] G. Indiveri, E. Chicca, and R. Douglas, “A vlsi array of low-powerspiking neurons and bistable synapses with spike-timing dependentplasticity,”

IEEE transactions on neural networks , vol. 17, no. 1, pp.211–221, 2006.[27] G. Yuan, C. Ding, R. Cai, X. Ma, Z. Zhao, A. Ren, B. Yuan, andY. Wang, “Memristor crossbar-based ultra-efﬁcient next-generation base-band processors,” in

Circuits and Systems (MWSCAS), 2017 IEEE 60thInternational Midwest Symposium on . IEEE, 2017, pp. 1121–1124.[28] L. Deng, G. Li, N. Deng, D. Wang, Z. Zhang, W. He, H. Li, J. Pei,and L. Shi, “Complex learning in bio-plausible memristive networks,”

Scientiﬁc reports , vol. 5, 2015.[29] E. Covi, S. Brivio, A. Serb, T. Prodromakis, M. Fanciulli, and S. Spiga,“Analog memristive synapse in spiking networks implementing unsu-pervised learning,”

Frontiers in neuroscience , vol. 10, 2016.[30] M. Prezioso, F. M. Bayat, B. Hoskins, K. Likharev, and D. Strukov,“Self-adaptive spike-time-dependent plasticity of metal-oxide memris-tors,”

Scientiﬁc reports , vol. 6, p. 21331, 2016.[31] S. Sa¨ıghi, C. G. Mayr, T. Serrano-Gotarredona, H. Schmidt, G. Lecerf,J. Tomas, J. Grollier, S. Boyn, A. F. Vincent, D. Querlioz et al. ,“Plasticity in memristive devices for spiking neural networks,”

Frontiersin neuroscience , vol. 9, 2015.[32] K. Ahmed, A. Shrestha, Q. Qiu, and Q. Wu, “Probabilistic inferenceusing stochastic spiking neural networks on a neurosynaptic processor,”in

Neural Networks (IJCNN), 2016 International Joint Conference on .IEEE, 2016, pp. 4286–4293.[33] B. Nessler, M. Pfeiffer, L. Buesing, and W. Maass, “Bayesian compu-tation emerges in generic cortical microcircuits through spike-timing-dependent plasticity,”

PLoS computational biology , vol. 9, no. 4, p.e1003037, 2013.[34] B. Pakkenberg, D. Pelvig, L. Marner, M. J. Bundgaard, H. J. G.Gundersen, J. R. Nyengaard, and L. Regeur, “Aging and the humanneocortex,”

Experimental gerontology , vol. 38, no. 1, pp. 95–99, 2003.[35] A. Jantsch, H. Tenhunen et al. , Networks on chip . Springer, 2003, vol.396.[36] J. A. Cardin, M. Carl´en, K. Meletis, U. Knoblich, F. Zhang, K. Deis-seroth, L.-H. Tsai, and C. I. Moore, “Driving fast-spiking cells inducesgamma rhythm and controls sensory responses,”

Nature , vol. 459, no.7247, pp. 663–667, 2009.[37] J. Lee, C. Nicopoulos, S. J. Park, M. Swaminathan, and J. Kim, “Dowe need wide ﬂits in networks-on-chip?” in

VLSI (ISVLSI), 2013 IEEEComputer Society Annual Symposium on . IEEE, 2013, pp. 2–7.[38] C. J. Glass and L. M. Ni, “The turn model for adaptive routing,”