[PDF] A 28-nm Convolutional Neuromorphic Processor Enabling Online Learning with Spike-Based Retinas

Abstract

In an attempt to follow biological information representation and organization principles, the field of neuromorphic engineering is usually approached bottom-up, from the biophysical models to large-scale integration in silico. While ideal as experimentation platforms for cognitive computing and neuroscience, bottom-up neuromorphic processors have yet to demonstrate an efficiency advantage compared to specialized neural network accelerators for real-world problems. Top-down approaches aim at answering this difficulty by (i) starting from the applicative problem and (ii) investigating how to make the associated algorithms hardware-efficient and biologically-plausible. In order to leverage the data sparsity of spike-based neuromorphic retinas for adaptive edge computing and vision applications, we follow a top-down approach and propose SPOON, a 28-nm event-driven CNN (eCNN). It embeds online learning with only 16.8-% power and 11.8-% area overheads with the biologically-plausible direct random target projection (DRTP) algorithm. With an energy per classification of 313nJ at 0.6V and a 0.32-mm 2 area for accuracies of 95.3% (on-chip training) and 97.5% (off-chip training) on MNIST, we demonstrate that SPOON reaches the efficiency of conventional machine learning accelerators while embedding on-chip learning and being compatible with event-based sensors, a point that we further emphasize with N-MNIST benchmarking.

Full PDF

©©2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media,including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution toservers or lists, or reuse of any copyrighted component of this work in other works.This document is the paper as accepted for presentation at the IEEE International Symposium on Circuits and Systems (ISCAS) 2020.

A 28-nm Convolutional Neuromorphic ProcessorEnabling Online Learning with Spike-Based Retinas

Charlotte Frenkel, Jean-Didier Legat and David Bol

ICTEAM Institute, Universit´e catholique de Louvain, Louvain-la-Neuve, Belgium

Abstract —In an attempt to follow biological information rep-resentation and organization principles, the ﬁeld of neuromorphicengineering is usually approached bottom-up , from the biophysicalmodels to large-scale integration in silico . While ideal as exper-imentation platforms for cognitive computing and neuroscience,bottom-up neuromorphic processors have yet to demonstratean efﬁciency advantage compared to specialized neural networkaccelerators for real-world problems.

Top-down approaches aimat answering this difﬁculty by (i) starting from the applicativeproblem and (ii) investigating how to make the associatedalgorithms hardware-efﬁcient and biologically-plausible. In orderto leverage the data sparsity of spike-based neuromorphic retinasfor adaptive edge computing and vision applications, we follow atop-down approach and propose SPOON, a 28-nm event-drivenCNN (eCNN). It embeds online learning with only 16.8-% powerand 11.8-% area overheads with the biologically-plausible directrandom target projection (DRTP) algorithm. With an energy perclassiﬁcation of 313nJ at 0.6V and a 0.32-mm area for accuraciesof 95.3% (on-chip training) and 97.5% (off-chip training) onMNIST, we demonstrate that SPOON reaches the efﬁciency ofconventional machine learning accelerators while embedding on-chip learning and being compatible with event-based sensors, apoint that we further emphasize with N-MNIST benchmarking. I. I

NTRODUCTION

The ﬁeld of neuromorphic engineering takes its roots intothe discovery that the MOS transistor operated in subtresholdcould directly emulate the ion channels dynamics in thebrain [1]. This led to a long tradition of bottom-up designsince the late 1980s, going from neuroscience observationand biophysical neuron and synapse models to analog anddigital small-scale [2]–[7] and large-scale [8]–[12] integrations in silico . Bottom-up neuromorphic processors are thus ideal asexperimentation platforms for cognitive computing and neuro-science [13], [14], for which they even help to reverse-engineerthe brain with analysis by synthesis (Fig. 1). However, the keychallenge lies in applying them to real-world scenarios, whichis yet to be demonstrated with an efﬁciency advantage com-pared to conventional frame-based artiﬁcial and convolutionalneural network (ANN, CNN) hardware accelerators [14], [15].In order to address the difﬁculty of bottom-up neuromor-phic designs in tackling real-world problems efﬁciently, a few top-down designs have recently been proposed for adaptive edge computing (e.g., [15]–[18]), ensuring both robustness touncontrolled environments and low-cost deployment for ap-plications power- and resource-constrained during the trainingphase. Starting from this applicative problem, top-down de-signs investigate how to embed online learning with a focus onhardware efﬁciency and biological plausibility (Fig. 1). How-ever, top-down designs currently appear to be either spikingneural networks (SNNs) with event-driven processing at the ex-pense of accuracy [16], [17] or binary neural networks (BNNs)

This work was supported by the fonds europ´een de d´eveloppement r´egionalFEDER, the Wallonia within the “Wallonie-2020.EU” program, the Plan Mar-shall and the FRS-FNRS of Belgium. E-mail: [email protected]

Neuromorphic processorsNeuroscience observationReal-worldapplication

Analysis by synthesis

Neuroscienceapplication ? Bottom-up design Top-down design

Fig. 1: Summary of bottom-up and top-down neuromorphic design approaches.

AER retina

SPOON 28nm eCNN A E R O U T P U T SPI slave and parameter bank REQADDRMISO

MOSISCK REQADDR

CLK_EXTRST

Clockgenerator

CONV core

FC coreDRTP weight update

128 10 … … CLK

28 2810

CLKG_EN

Local tick generator

LABELTICK_EXTDATA_SYNCINFER_REQ Stride-4 maxpool A E R I N P U T ACKACK

Frame-based maxpooling + fully-connected processingEvent-driven fully-connected processing Event-driven convolution

Fig. 2: Block diagram of the SPOON event-driven CNN (eCNN) processor. with high accuracy at the expense of conventional frame-based processing [15]. The chip from Chen et al. [18] allowsexploring both sides with an SNN embedding STDP that canalso be programmed as a BNN using ofﬂine-trained weights.Therefore, in this work, we propose SPOON (standing forspiking online-learning convolutional neuromorphic proces-sor), an event-driven CNN (eCNN) for adaptive edge comput-ing. Event-driven convolutions with time-to-ﬁrst-spike codingleverage sparsity from event-based neuromorphic retinas [19],an idea now also explored for conventional machine learn-ing accelerators [20], while a combination with frame-basedprocessing ensures maximum data reuse and parallelism infully-connected layers. SPOON embeds online learning at lowpower and area overheads with the biologically-plausible directrandom target projection (DRTP) algorithm [21], which we in-troduced recently to release the key issues of the successful er-ror backpropagation (BP) algorithm [22] precluding hardwareefﬁciency and biological plausibility. We demonstrate that, tothe best of our knowledge, only SPOON allows reaching theefﬁciency of conventional machine learning accelerators whileembedding on-chip learning and being compatible with event-based sensors. The architecture of SPOON is described inSection II, implementation details and benchmarking resultson MNIST and N-MNIST are given in Section III. a r X i v : . [ c s . N E ] M a y ERdecoder 32-stage FIFO

AERIN_REQAERIN_ADDRAERIN_ACKTICK

First-spikegating C o n t r o ll e r INFER_REQ

CONV_DONESPI_DATA_ISPI_DATA_OSPI_ADDR + 6b quantize

CONV_OUT C o n t r o ll e r SPI_DATA_ISPI_DATA_OSPI_ADDRCONV_DONE CONV_OUT

Accumulation

Addressdecoder b qu a n t . inferred label AERencoder

AEROUT_ADDRAEROUT_REQAEROUT_ACKHID_GRADHID_ACTHID_IDX

OUT_GRADS

OUT_ACTSHID_INW_HIDW_NEXT W_OUT

Memory (registers)

Memory (SRAM)Configurable/programmableProgrammable (for test purpose) (a) (b) b qu a n t . Timestamp

8b counter

DATA_SYNC

12a 2b2a

Event-drivenFrame-based Frame- based

Event-driven

Fig. 3: Circuit architecture of (a) the CONV core and of (b) the FC core. Detailed control signals, data gating and overﬂow protection mechanisms are notshown for clarity. Event-driven and frame-driven processing are highlighted following the conventions of Fig. 2.

II. A

RCHITECTURE

A block diagram of SPOON is shown in Fig. 2. Four-phase-handshake address-event representation (AER) buses [23] areused for event-driven handling of input sensor spikes andof output inferences. All weights and parameters can beprogrammed and readback with an SPI bus. As neuromorphicvision sensors send spikes encoding temporal contrast [24],pixels with the highest luminosity change spike ﬁrst with ON (positive delta) or OFF (negative delta) events, conveyinguseful data for edge detection. In order to efﬁciently extract thisinformation, we use time-to-ﬁrst-spike encoding (i.e. timingcode) [25] in the convolutional layers, which are handled inthe CONV core (Section II-A). In order to match the timeconstant of SPOON with the given application, time ticks canbe retrieved either from an external reference pin

TICK_EXT or generated internally by a conﬁgurable synchronous on-chip tick generator. Fully-connected layers are handled in theFC core (Section II-B), which uses a combination of frame-based and event-driven processing for maximum data reuseand efﬁcient handling of DRTP updates (Section II-C).

A. Convolutional (CONV) core

The CONV core consists of a convolutional layer with10 5 × { x , y } coordinates for 32 × ON / OFF polarity bit. Based on the

TICK timereference and the

DATA_SYNC pin that signals the start ofan input sample, input events are concatenated with an 8-bittimestamp before being pushed into a 32-stage FIFO.Event-driven convolutions follow the timing diagram ofFig. 4(a), where the 10 kernels are processed sequentially.The 9-bit timestamp, including the input event polarity bit,is ﬁrst multiplied with all values in the current kernel i in a5 × i and input pixel { x , y } act igrad i (a) (b) R W W WR R

Mult R {i,loc1} {i,loc2} {i,loc3}R W R W R W R W{i,1} {i,2} {i+1,0} {i,7}{i,0} {i,1} Data available through weight bufferData immediately available {i,loc0}MAC MAC MACMultWR

Fig. 4: Timing diagrams for (a) event-driven convolution in the CONV coreand (b) combination of hidden and output layer processing in the FC core.Black: cycles required for inference. Orange: optional DRTP update cycles. coordinates, stored in a 16-kB SRAM, are then updated. Dueto SRAM aspect ratio constraints, this update is split in four256-bit read/write accesses whose locations are given by anaddress decoder. An overﬂow protection mechanism emulatesa hardtanh activation function. Maxpooling is automaticallycarried out after the FIFO has been emptied and the 8-bittimestamp counter falls to zero. It is followed by a quantizationto 6 bits with conﬁgurable rescaling. The CONV core thenoutputs 490 6-bit activations (

CONV_OUT ) and a

CONV_DONE trigger to enable the FC core, which is clock-gated otherwise.Depending on the event-driven sensor use case, two fea-tures can be used to adjust the accuracy-energy tradeoff.First, the ﬁrst-spike gating block can be enabled to keep onlythe most-informative ﬁrst spike of each pixel, thus droppingsubsequent events. Second, the

INFER_REQ pin can be used t* W out W out T y hid B hid T y hid J xy hid y out t* W hid e y out J (a) (b)BP DRTP [21] W out y hid y out W hid e y out Fig. 5: (a) Backpropagation of error (BP) algorithm [22]. (b) Direct randomtarget projection (DRTP) algorithm [21]. Adapted from [21] for the case of asingle hidden layer, matching the conventions of Figs. 3 and 6. to request inference at any time by triggering maxpoolingbefore the 8-bit timestamp counter falls to zero, thus ignoringall subsequent less-informative events.

B. Fully-connected (FC) core

The FC core consists of a 128-neuron fully-connectedhidden layer followed by a 10-neuron output layer, both with8-bit programmable weights that are automatically initial-ized to zero for online learning (Section II-C). The circuitarchitecture and the associated timing diagram are shown inFigs. 3(b) and 4(b), respectively. As highlighted in Fig. 2,the hidden layer output y hid = f hid ( W hid x ) is computed witha conventional frame-based approach as all the inputs areimmediately available when receiving the CONV_DONE trigger,where W hid represents the hidden layer weights, f hid is thehidden layer activation function and x is the input fromthe CONV core (Fig. 5). The hidden neurons are evaluatedsequentially and inputs are processed by batch of 64. It requires8 cycles to retrieve the 500 weights associated to a hiddenneuron, including both the 490 weights W hid,i connecting tothe inputs and the 10 weights W out,i connecting to the outputlayer neurons, where the index i denotes hidden neuron i .Once the weighted sum of inputs W hid,i x of hidden neuron i has been computed, output layer processing is triggered in anevent-driven fashion to ensure maximum data reuse:– the activation y hid,i is obtained by quantizing W hid,i x to 3 bits with a hardtanh function, whose binaryderivative is one in the linear range and zero elsewhere( HID_ACT and

HID_GRAD in Fig. 3(b)),– if the derivative is non-zero, DRTP updates can bedirectly applied to W hid,i (Fig. 4, orange), otherwisethey are skipped (Section II-C),– W out,i y hid,i is added to the 10 output psums,– as the ﬁnal activation and derivative of the outputneurons are not yet available for the current sample, aDRTP update of W out,i is triggered based on bufferedprevious sample data (Section II-C).Finally, when all the hidden neurons have been processed,the output layer activation y out = f out ( W out y hid ) is obtainedby quantizing the output psums to 3 bits with a hardsigmoidfunction, whose binary derivative is one in the linear range andzero elsewhere ( OUT_ACTS and

OUT_GRADS in Fig. 3(b)).

C. On-chip online training with direct random target projec-tion (DRTP)

Building on feedback alignment techniques [26], [27],which were proposed to solve the weight transport problem of BP [22] (i.e. requirement for weight symmetry in theforward and backward pathways), we proposed in [21] the previoushid. act.

OUT_ACTS

HID_IDX

Regfile(128 10-bit rand. num.)

DRTP LFSR (20-bit,768-unfolded)

LABEL -1 … -1 … HID_IN updates array … W_HID … HID_GRAD

Hidden layer update EN

128 8

W_NEXT … HID_IDX

Regfile(128 3-bithid prev act) BP LFSR(17-bit,120-unfolded) stochasticupdatesarray … W_OUT … Output layer update

128 8 80 … OUT_GRADSHID_ACT ≠0 prev. targets … ENLABEL prev.out.acts. thermo.code - - … prev. out.grads. … next hiddenlayer weightsnext output layer weights SPI_DATA_ISPI_DATA_OSPI_ADDR … … …

Fig. 6: Circuit architecture for the DRTP weight update module. direct random target projection (DRTP) algorithm to releasenot only the weight transport problem, but also update locking (i.e. requirement for full forward and backward passes beforethe weights can be updated). As these are the two key issuesthat preclude BP from being hardware-efﬁcient and biologi-cally plausible [21], DRTP is a low-cost algorithm suitablefor deployment at the edge. It relies only on feedforward andlocal computation (Fig. 5) and estimates the hidden layer lossgradient δy hid as a projection of the target vector t ∗ (i.e. one-hot encoded labels) with a ﬁxed random matrix B hid . This op-eration corresponds to a simple label-dependent random vectorselection, which can be quantized down to binary resolutionswith only a negligible impact on DRTP performance. Thehidden layer weight updates ∆ W hid can then be computed as ∆ W hid = − η hid ( δy hid (cid:12) f (cid:48) hid ( W hid x )) x T , (1)where η hid is the hidden layer learning rate and (cid:12) denoteselement-wise multiplication. As opposed to BP, δy hid is alwaysnon-zero in DRTP, the weights can thus be initialized to zerowithout precluding training convergence. The DRTP outputlayer update is identical to the BP update, i.e. ∆ W out = − η out ( e (cid:12) f (cid:48) out ( W out y hid )) y Thid , (2)where η out is the output layer learning rate.The circuit architecture for the DRTP weight updatemodule of SPOON is shown in Fig. 6. According to Eq. (1), thederivative f (cid:48) hid ( HID_GRAD ) of hidden neuron i , taking values or , can be used as an enable signal for the hidden layerupdate module (Section II-B). The ﬁxed random binary matrix B hid is stored in a register ﬁle. A speciﬁc bit is selected basedon the current hidden layer neuron index ( HID_IDX ) and thetraining sample label (

LABEL ). Therefore, the only requiredcomputation is a label-dependent sign inversion to the inputsfrom the CONV core (

HID_IN ), processed by batch of 64. The µ m FC weight SRAM FC weight SRAM CONV psum SRAM CONV psum SRAM

Fig. 7: SPOON layout view obtained under Cadence Innovus.TABLE I: Speciﬁcations and pre-silicon performance metrics of SPOON.

Technology 28nm FDSOI CMOSImplementation DigitalArea 0.32mm (0.26mm excl. rails)Topology C5 × µ W at 0.6VEnergy for CONV core 1.7nJ/event at 0.6VEnergy for FC core 55nJ/inference at 0.6VOnline learning overhead 16.8% in power, 11.8% in area obtained values are then used as probabilities conditioningrandom increments/decrements to the hidden layer weights W hid,i ( W_HID ), depending on the values generated by a linearfeedback shift register (LFSR) and a conﬁgurable learningrate. In order to parallelize the generation of 64 12-bit seedswith a single LFSR, we applied the unfolding technique [28],similarly to the stochastic update mechanism that we proposedfor the MorphIC SNN in [7].The output layer update follows Eq. (2), whose terms arebuffered so that updates from the previous sample to outputlayer weights W out,i ( W_OUT ) can be applied concurrentlywith the hidden layer updates of the current sample. Theerror e is computed based on the previous label and outputactivations ( OUT_ACTS ), where the binary output derivatives(

OUT_GRADS ) act as a gating signal. If the previous hiddenneuron activation is zero, updates are skipped. Otherwise, itis multiplied with the error and used in the stochastic updatesarray, which operates as in the hidden layer update module.III. I

MPLEMENTATION AND B ENCHMARKING R ESULTS

SPOON has been taped out in a 28-nm FDSOI CMOSprocess, the layout is presented in Fig. 7, while the speciﬁ-cations and pre-silicon performance metrics are reported inTable I. It occupies an area of only 0.32mm . At 0.6V, SPOONhas a leakage power of 61 µ W and the convolutional layerupdate consumes 1.7nJ per event in the CONV core. Whenconvolution is over, the FC core requires 55nJ to updatethe fully-connected layers and to send the inferred labelwith the output AER bus. The high suitability of DRTP foradaptive edge computing is highlighted by its low power andarea overheads of 16.8% and 11.8% compared to a designwithout online learning, respectively. In order to acceleratebenchmarking, the accuracy results in the subsequent text wereretrieved from an FPGA implementation of SPOON.The accuracy-area-energy tradeoff on the MNIST datasetof handwritten digits [29] is shown in Fig. 8 for conventionalANN and CNN machine learning accelerators [30]–[32], theBNN from Park et al. [15], the SNN ofﬂine-trained as aBNN from Chen et al. [18], SNNs [7], [11], [16], [17] andSPOON, which requires only 313nJ and 117 µ s per inferencefor an area of 0.32mm using a time-to-ﬁrst-spike encoding.Training SPOON for MNIST using an off-chip optimizer based Maximum accuracy [%]

85 90 95 100 N o r m a li z e d ac t i ve s ili c on a r ea [ mm ] Kim et al. [16]TrueNorth [34]Buhler et al. [17] Chenet al. [18]MorphIC [7]Whatmough et al. [30] Moons et al. [31] Parket al. [15] Chen et al. [32] SPOONOn-chip learningSNNANN/CNNBinary neural networkEvent-driven CNN (a)

Better

Accuracy [%]

85 90 95 100 N o r m a li z e d e n e r g y p e r i n f e r e n ce [ n J ] Kim et al. [16]TrueNorth [34] Buhleret al. [17] Chenet al. [18]MorphIC [7]Whatmough et al. [30] Moons et al. [31] Parket al. [15] Chenet al. [32] SPOONOn-chip learningSNNANN/CNNBinary neural networkEvent-driven CNN (b)

Better

Fig. 8: Accuracy-area-energy tradeoff normalized to 28nm for SNN, ANN,CNN and BNN accelerators on MNIST. Normalization has been carriedout using the node factor, except for the 10-nm FinFET node from Chen et al. [18] where data from [33] was used. Being a mixed-signal design,the chip from Buhler et al. [17] was not normalized. The non-preprocessedMNIST experiments reported for Chen et al. [18] rely on ofﬂine BP-basedBNN training. MNIST results for TrueNorth [11] are reported in [34]. on PyTorch [35] with quantization-aware training [36], wereach a test-set accuracy of 97.5%. When enabling on-chipDRTP-based online learning, where the convolution kernelsare initialized and ﬁxed to random values and plastic fully-connected weights are initialized to zero, SPOON reaches atest-set accuracy of 92.8% after one epoch and 95.3% after100 epochs on the 60k-sample training set. It appears fromFig. 8 that only SPOON reaches the efﬁciency of conventionalmachine learning accelerators while being compatible withevent-based sensors and embedding on-chip online learning.The neuromorphic MNIST (N-MNIST) dataset [37] is aspiking version of MNIST generated by an ATIS siliconretina [38] mounted on a pan-tilt unit and moved in threesaccades. As each active pixel spikes in average 4.8 times persaccade, leading to redundant information in time-to-ﬁrst-spikecoding, we use the single-spike-per-pixel mode of SPOON.Using only the ﬁrst saccade of each sample, SPOON reachesa test-set accuracy of 93.8% with ofﬂine-trained weights andof 90.2% (one epoch) or 93.0% (100 epochs) using on-chiponline training, while consuming 665nJ per inference.IV. C

ONCLUSION

In this paper, we presented the SPOON event-drivenCNN, following a top-down neuromorphic design approach.We demonstrate that combining event-driven and frame-basedprocessing with weight-transport-free update-unlocked trainingsupports low-cost adaptive edge computing. Indeed, SPOONhas an accuracy-area-energy tradeoff superior to SNNs andcomparable to conventional machine learning acceleratorswhile enabling online learning with spike-based sensors.

EFERENCES[1] C. Mead,

Analog VLSI and Neural Systems.

Reading, MA: Addison-Wesley, 1989.[2] N. Qiao et al., “A reconﬁgurable on-line learning spiking neuromorphicprocessor comprising 256 neurons and 128K synapses,”

Frontiers inNeuroscience , vol. 9, no. 141, 2015.[3] S. Moradi et al., “A scalable multicore architecture with heterogeneousmemory structures for Dynamic Neuromorphic Asynchronous Processors(DYNAPs),”

IEEE Transactions on Biomedical Circuits and Systems ,vol. 12, no. 1, pp. 106-122, 2018.[4] C. Mayr et al., “A biological-realtime neuromorphic system in 28 nmCMOS using low-leakage switched capacitor circuits,”

IEEE Trans. onBiomedical Circuits and Systems , vol. 10, no. 1, pp. 243-254, 2016.[5] J.-S. Seo et al., “A 45nm CMOS neuromorphic chip with a scalablearchitecture for learning in networks of spiking neurons,”

Proc. of IEEECustom Integrated Circuits Conference (CICC) , 2011.[6] C. Frenkel et al., “A 0.086-mm IEEE Transactions on Biomedical Circuits and Systems , vol. 13,no. 1, pp. 145-158, 2019.[7] C. Frenkel, J.-D. Legat and D. Bol, “MorphIC: A 65-nm 738k-synapse/mm quad-core binary-weight digital neuromorphic processorwith stochastic spike-driven online learning” IEEE Transactions onBiomedical Circuits and Systems , vol. 13, no. 5, pp. 999-1010, 2019.[8] J. Schemmel et al., “A wafer-scale neuromorphic hardware system forlarge-scale neural modeling,”

Proc. of IEEE International Symposium onCircuits and Systems (ISCAS) , pp. 1947-1950, 2010.[9] B. V. Benjamin et al., “Neurogrid: A mixed-analog-digital multichipsystem for large-scale neural simulations,”

Proceedings of the IEEE ,vol. 102, no. 5, pp. 699-716, 2014.[10] E. Painkras et al., “SpiNNaker: A 1-W 18-core system-on-chip formassively-parallel neural network simulation,”

IEEE Journal of Solid-State Circuits , vol. 48, no. 8, pp. 1943-1953, 2013.[11] F. Akopyan et al., “TrueNorth: Design and tool ﬂow of a 65 mW 1million neuron programmable neurosynaptic chip,”

IEEE Transactionson Computer-Aided Design of Integrated Circuits and Systems , vol. 34,no. 10, pp. 1537-1557, 2015.[12] M. Davies et al., “Loihi: A neuromorphic manycore processor withon-chip learning,”

IEEE Micro , vol. 38, no. 1, pp. 82-99, 2018.[13] G. Cauwenberghs, “Reverse engineering the cognitive brain,”

Proceed-ings of the National Academy of Sciences , vol. 110, no. 39, pp. 15512-15513, 2013.[14] G. Indiveri and S.-C. Liu, “Memory and information processing inneuromorphic systems,”

Proceedings of the IEEE , vol. 103, no. 8,pp. 1379-1397, 2015.[15] J. Park, J. Lee and D. Jeon, “A 65nm 236.5 nJ/classiﬁcation neuro-morphic processor with 7.5% energy overhead on-chip learning usingdirect spike-only feedback,”

IEEE International Solid-State CircuitsConference-(ISSCC) , pp. 140-142, 2019.[16] J. K. Kim et al., “A 640M pixel/s 3.65 mW sparse event-drivenneuromorphic object recognition processor with on-chip learning,”

IEEESymposium on VLSI Circuits (VLSI-C) , pp. C50-C51, 2015.[17] F. N. Buhler et al., “A 3.43 TOPS/W 48.9 pJ/pixel 50.1 nJ/classiﬁcation512 analog neuron sparse coding neural network with on-chip learningand classiﬁcation in 40nm CMOS,”

IEEE Symposium on VLSI Circuits(VLSI-C) , pp. C30-C31, 2017.[18] G. K. Chen et al., “A 4096-neuron 1M-synapse 3.8pJ/SOP spikingneural network with on-chip STDP learning and sparse weights in 10nmFinFET CMOS,”

Proc. of IEEE Symp. on VLSI Circuits (VLSI-C) , 2018. [19] G. Orchard et al., “HFirst: A temporal approach to object recognition,”

IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 37,no. 10, pp. 2028-2040, 2015.[20] K. Goetschalckx and M. Verhelst, “Breaking High Resolution CNNBandwidth Barriers with Enhanced Depth-First Execution,”

IEEE Jour-nal on Emerging and Selected Topics in Circuits and Systems , vol. 9,no. 2, pp. 323-331, 2019[21] C. Frenkel, M. Lefebvre and D. Bol, “Learning without feedback:Direct random target projection as a feedback-alignment algorithm withlayerwise feedforward training,” arXiv preprint arXiv:1909.01311 , 2019.[22] D. Rumelhart, G. Hinton and R. Williams, “Learning representationsby back-propagating errors,”

Nature , vol. 323, pp. 533-536, 1986.[23] K. A. Boahen, “Point-to-point connectivity between neuromorphic chipsusing address events,”

IEEE Transactions on Circuits and Systems II ,vol. 47, no. 5, pp. 416-434, 2000.[24] A. Vanarse, A. Osseiran and A. Rassau, “A review of current neuromor-phic approaches for vision, auditory, and olfactory sensors,”

Frontiers inNeuroscience , no. 10, p. 115, 2016.[25] S. Thorpe, A. Delorme and R. Van Rullen, “Spike-based strategies forrapid processing,”

Neural Networks , vol. 14, no. 6-7, pp. 715-725, 2001.[26] T. P. Lillicrap et al., “Random synaptic feedback weights support errorbackpropagation for deep learning,”

Nature Communications , vol. 7,no. 13276, 2016.[27] A. Nøkland, “Direct feedback alignment provides learning in deepneural networks,”

Proc. of Advances in Neural Information ProcessingSystems (NIPS) , pp. 1037-1045, 2016.[28] K. K. Parhi,

VLSI Digital Signal Processing Systems: Design andImplementation , John Wiley & Sons, 1999.[29] Y. LeCun and C. Cortes, “The MNIST database of handwritten digits,”1998 [Online]. Available: http://yann.lecun.com/exdb/mnist/.[30] P. N. Whatmough et al., “A 28nm SoC with a 1.2 GHz 568nJ/predictionsparse deep-neural-network engine with > Proc. of IEEE International Solid-State CircuitsConference (ISSCC) , 2017.[31] B. Moons et al., “BinarEye: An always-on energy-accuracy-scalablebinary CNN processor with all memory on chip in 28nm CMOS,”

Proc.of IEEE Custom Integrated Circuits Conference (CICC) , 2018.[32] Y. Chen et al., “A 2.86-TOPS/W current mirror cross-bar-basedmachine-learning and physical unclonable function engine for Internet-of-Things applications”

IEEE Transactions on Circuits and Systems I ,vol. 66, no. 6, pp. 2240-2252, 2019.[33] K. Mistry, “10nm technology leadership,”

Leading at the Edge:Intel Technology and Manufacturing Day,

Advances in Neural Information Processing Systems (NIPS ,pp. 1117-1125, 2015.[35] A. Paszke et al., “Automatic differentiation in PyTorch”,

Proc. of AnnualConf. on Neural Information Processing Systems (NIPS) Workshop , 2017.[36] I. Hubara et al., “Quantized neural networks: Training neural networkswith low precision weights and activations,”

The Journal of MachineLearning Research , vol. 18, no. 1, pp. 6869-6898, 2017.[37] G. Orchard et al., “Converting static image datasets to spiking neu-romorphic datasets using saccades,”

Frontiers in Neuroscience , no. 9,p. 437, 2015.[38] C. Posch, D. Matolin and R. Wohlgenannt, “A QVGA 143 dB dynamicrange frame-free PWM image sensor With lossless pixel-level video com-pression and time-domain CDS,”