[PDF] SPARE: Spiking Networks Acceleration Using CMOS ROM-Embedded RAM as an In-Memory-Computation Primitive

Abstract

Despite huge success of artificial intelligence, hardware systems running these algorithms consume orders of magnitude higher energy compared to the human brain, mainly due to heavy data movements between the memory unit and the computation cores. Spiking neural networks (SNNs) built using bio-plausible neuron and synaptic models have emerged as the power-efficient choice for designing cognitive applications. These algorithms involve several lookup-table (LUT) based function evaluations such as high-order polynomials and transcendental functions for solving complex neuro-synaptic models, that typically require additional storage. To that effect, we propose `SPARE' - an in-memory, distributed processing architecture built on ROM-embedded RAM technology, for accelerating SNNs. ROM-embedded RAMs allow storage of LUTs, embedded within a typical memory array, without additional area overhead. Our proposed architecture consists of a 2-D array of Processing Elements (PEs). Since most of the computations are done locally within each PE, unnecessary data transfers are restricted, thereby alleviating the von-Neumann bottleneck. We evaluate SPARE for two different ROM-Embedded RAM structures - CMOS based ROM-Embedded SRAMs (R-SRAMs) and STT-MRAM based ROM-Embedded MRAMs (R-MRAMs). Moreover, we analyze trade-offs in terms of energy, area and performance, for using the two technologies on a range of image classification benchmarks. Furthermore, we leverage the additional storage density to implement complex neuro-synaptic functionalities. This enhances the utility of the proposed architecture by provisioning implementation of any neuron/synaptic behavior as necessitated by the application. Our results show up-to 1.75x, 1.95x and 1.95x improvement in energy, iso-storage area, and iso-area performance, respectively, by using neural network accelerators built on ROM-embedded RAM primitives.

Full PDF

SSPARE: Spiking Neural Network Acceleration UsingROM-Embedded RAMs as In-Memory-Computation Primitives

Amogh Agrawal*, Aayush Ankit* and Kaushik Roy,

Fellow, IEEE

Abstract —From the little we know about the human brain,the inherent cognitive mechanism is very different from the de facto state-of-the-art computing platforms. The human brainuses distributed, yet integrated memory and computation units,unlike the physically separate memory and computation coresin typical von Neumann architectures. Despite huge success ofartiﬁcial intelligence, hardware systems running these algorithmsconsume orders of magnitude higher energy compared to thehuman brain, mainly due to heavy data movements betweenthe memory unit and the computation cores. Spiking neuralnetworks (SNNs) built using bio-plausible neuron and synapticmodels have emerged as the power efﬁcient choice for designingcognitive applications. These algorithms involve several lookup-table (LUT) based function evaluations such as high-orderpolynomials and transcendental functions for solving complexneuro-synaptic models, that typically require additional storageand thus, bigger memories. To that effect, we propose ‘SPARE’ − an in-memory, distributed processing architecture built onROM-embedded RAM technology, for accelerating SNNs. ROM-embedded RAMs allow storage of LUTs (for neuro-synapticmodels), embedded within a typical memory array, withoutadditional area overhead. Our proposed architecture consists of a2-D array of Processing Elements (PEs), wherein each PE has itsown ROM-embedded RAM structure and executes part of theSNN computation. Since most of the computations (includingmultiple math-table evaluations) are done locally within eachPE, unnecessary data transfers are restricted, thereby alleviatingthe problems arising due to physically separate remote memoryunit and the computation core. SPARE thus leverages both,the hardware beneﬁts of distributed, in-memory processing, andalso the algorithmic beneﬁts of SNNs. We evaluate SPAREfor two different ROM-Embedded RAM structures − CMOSbased ROM-Embedded SRAMs (R-SRAMs) and STT-MRAMbased ROM-Embedded MRAMs (R-MRAMs). Moreover, weanalyze trade-offs in terms of energy, area and performance,for using the two technologies on a range of image classiﬁcationbenchmarks. Furthermore, we leverage the additional storagedensity to implement complex neuro-synaptic functionalities. Thisenhances the utility of the proposed architecture by provisioningimplementation of any neuron/synaptic behavior as necessitatedby the application. Our results show up-to ∼ . × , ∼ . × and ∼ . × improvement in energy, iso-storage area, and iso-areaperformance, respectively, by using neural network acceleratorsbuilt on ROM-embedded RAM primitives. Index Terms —Spiking neural network (SNN), ROM-embeddedRAM, STT-MRAM, in-memory computing.

I. I

NTRODUCTION

Deep Neural Networks (DNNs) are inspired from the hi-erarchical learning behavior in the human brain and havetremendously enhanced the learning capabilities in machines[1], [2]. They have been credited to achieve high performance

All authors are with the School of Electrical and Computer Engineering,Purdue University, West Lafayette, USEmail: { agrawa64, aankit, kaushik } @purdue.edu(* these authors contributed equally to this work) across a variety of recognition applications, even surpassinghuman abilities in certain tasks [3]. In doing so, however,DNNs tend to consume orders of magnitude higher energythan the human brain. To bridge this energy gap, there havebeen proposals from the algorithm as well as hardware per-spectives. Spiking neural networks (SNNs), or third generationneural networks, have evolved and have been shown to achievecomparable classiﬁcation accuracies with respect to the non-spiking counterparts [4]. SNNs rely on transfer of neuronspikes from one layer to the next, resembling the informationtransfer in the human brain. These spikes are encoded asbinary data, thereby drastically simplifying the computations,and thus reducing the energy consumption.On the other hand, hardware systems running DNN algo-rithms are inefﬁcient, since DNN executions are memory-as well as compute-intensive. For instance, AlexNet whichwon the ImageNet 2011 challenge consists of 61 millionparameters and involves 2-4 GOPS per classiﬁcation [5],[6]. Consequently, their execution on von-Neumann machinesconsumes more energy for data movement than computation[6]. This can be attributed to the fact that DNN computationis inherently different from the conventional von-Neumannbased computing model. Frequent data movement between aphysically separate memory storage unit and a compute coreforms the well known von-Neumann bottleneck. To overcomethis bottleneck, there has been intense research for reducingdata movements [7]. Moreover, there have been proposals for in-memory computing [8], [9], where the underlying principleis to perform the computations as close to the memory aspossible, or better, within the memory array itself [10], [11].Typical DNNs (with artiﬁcial neurons) involve multipletranscendental function evaluations (for instance, sigmoid,tanh, logarithms etc.). In addition, SNNs involve severalbio-realistic neuron and synaptic differential equations, eachhaving multiple transcendental function and high order poly-nomial evaluations. The most efﬁcient way to implement suchfunctions is by storing look-up tables (LUTs) and math tablesin read-only memories (ROMs). However, large dedicatedROMs incur signiﬁcant area and power overheads. To thateffect, [12] proposed embedding ROMs in standard CMOSSRAM caches (R-SRAM). R-SRAM allows placing a ROMwithin the conventional SRAM array (with correspondingarchitectural modiﬁcations), without degrading the area andperformance beneﬁts of the SRAM [12]. Such compute prim-itives provide signiﬁcantly higher storage densities (bits/area)which can be leveraged for DNN and SNN computations instoring useful data (LUTs) without affecting the RAM storage,thereby avoiding longer latencies and higher access energyassociated with larger (or external) memory structures.In this work, we take R-SRAMs and R-MRAMs a step a r X i v : . [ c s . ET ] J u l LAXL Cell Cell 1AXL AXR AXL

BLB BL

BLB BLWL 1WL 2

AXR

Fig. 1. R-SRAM Schematic: Standard 6T-SRAM embedded with ROM. Theonly difference is the addition of extra word-line (WL1 and WL2) to embedROM functionality. further and propose “SPARE”, a generalized architecturefor SNN acceleration using ROM-embedded RAMs as in-memory-compute primitives. SPARE consists of a 2-D arrayof Processing Elements (PEs) that spatially map a deep SNN,where each PE performs part of the SNN computations. EachPE contains its own R-SRAM/R-MRAM which locally storesonly the relevant synaptic data and the LUTs required forsolving the neuron and synaptic differential equations. Thislocalized processing leads to energy beneﬁts, since only theneuron data (spikes) need to be transfered between PEs.Furthermore, since the PE operates only on an occuranceof an input spiking event, unnecessary computations andmemory accesses are avoided. It is also worth noting that R-SRAM/R-MRAM primitive can store several different neuronand synapse models, thereby providing necessary ﬂexibility. APE thus, synergistically combines the hardware beneﬁts fromR-SRAMs/R-MRAMs and algorithmic beneﬁts from SNNs. Insummary, we make three key contributions.1)

We design an energy-efﬁcient PE that leverages the“in-memory processing” abilities of ROM-EmbeddedRAM structures and “event-driven computing” in SNNs.We evaluate the pros and cons of using both, CMOSbased R-SRAMs and STT-MRAM based R-MRAMs, asmemory units in the PE.2)

We design an efﬁcient architecture (SPARE) usinga 2-D mesh of PEs, to provide a platform for cognitiveapplication deployment. We show the implementation ofspiking neural networks (fully connected and convolu-tional) on SPARE.3)

We investigate the energy, performance and areabeneﬁts for typical image classiﬁcation benchmarks tounderscore the system scalability and utility, both fortraining and inference phases.II. B

ACKGROUND

A. ROM-Embedded RAMs

Previous studies on ROM-embedded RAMs were limitedto logic testing and fast mathematical function evaluations[12], [13]. We explore the utility of R-SRAM and R-MRAMbased memory structures towards designing efﬁcient computeprimitives for neuromorphic computing (SNN acceleration).Further, as discussed before, such memory units enable in-memory data processing that can be of immense utility in

Memory Controller

R-SRAM

Buffer

WL1WL2

1) RAM R/W Request

3) Write 1’s (WL1=ON,

WL2=ON)

5) ROM Data 2) RAM copy

4) Write 0’s (WL1=OFF, WL2=ON)

6) RAM copy b) ROM Mode

Memory ControllerR-SRAMBuffer

WL1WL2

1) RAM R/W Request2) WL1=ON, WL2=ON3) Read Data/ Write Done a) Normal RAM Mode

Fig. 2. Operation of R-SRAM in a) Normal RAM Mode and b) ROM Mode.

DNN execution, which are typically limited by the cost ofdata movements. a) R-SRAM:

R-SRAM is a memory structure that con-sists of a ROM in hardware embedded into a conventionalCMOS SRAM array, with corresponding modiﬁcations at thearchitectural level to support ROM accesses [12]. Fig. 1 showsthe structure of R-SRAM cell array [12]. Unlike conventional6T-SRAMs, R-SRAMs bit cells have an extra word-line (WL).The gate of the access transistors connect to WL1 or WL2,depending on the data to be embedded as ROM. Thus, if thebit-cell stores ‘0’ (‘1’) as ROM data, the left access transistor(AXL) is connected to WL2 (WL1). The right access transistor(AXR) of the bit-cell follows the connectivity of the AXL ofthe neighboring bit-cell to the right. For completeness, wedescribe the R-SRAM operation, both for the RAM mode andthe ROM mode of operation.1)

RAM mode : During the normal RAM mode, both word-lines, WL1 and WL2, are connected together. They areturned ON/OFF at the same time, so as to operate asconventional 6T-SRAM for memory read/write. Notethat there is no performance penalty on RAM operationscompared to the standard 6T-SRAM bit-cells.2)

ROM mode : To retrieve the ROM data in the ROMmode of operation, a sequence of steps are performed,summarized in Fig. 2. First, ‘1’ is written to all bit-cells by turning both WL1 and WL2 ON. Thus, thewhole row stores “1111...”. Next, WL1 is turned OFFand ‘0’ is written to all the cells, while WL2 remainsON. Now only the bit-cells connected to WL2 store + -

Reference

GeneratorRead BiasGeneratorSLBL0BL1 V+ V- DataDataB EnRAM WriteDriver

WriteEn

DataInReadEn R MTJ R MTJ R MTJ R MTJ R MTJ R MTJ R MTJ

Separate bit-lines to enable ROM functionalityMerge bit-lines electrically for RAM functionalityFor RAM functionalityFor ROM Sensing

SAROMOut

RAMOut

Fig. 3. R-MRAM Schematic: Standard STT-MRAM array with two bit-lines(BL1 and BL0) to embed ROM functionality. The peripheral circuitry forRAM and ROM mode of operation is highlighted.

Neuron W W W V V V t V mem V spike t V spike V mem t V th Synapse I Fig. 4. Typical SNN dynamics. The input spikes are modulated by the synapticweights, and the accumulated synaptic current in fed to the neuron. The neuronintegrates the current and outputs a spike (ﬁres) once its membrane potentialexceeds a threshold. ‘0’, others store ‘1’. However, if two consecutive bit-cells have different ROM data, this step performs a 5Twrite operation on the SRAM cell, since only one accesstransistor is ON. This may lead to a “write stability”problem in the bit-cells, which can be resolved usingwrite-boost techniques [12]. The ROM data can now beread using conventional RAM read operation. Note thatthe ROM data retrieval process destroys the initial RAMcontent. Hence, before ROM data retrieval, RAM dataof the corresponding block is written into a buffer, asshown in Fig. 2. After the ROM data has been retrieved,the RAM data of the block is restored.It has been shown that R-SRAM incurs insigniﬁcant area( ∼ ) and power ( ∼ ) overheads [12] to incorporate anadditional word-line requirement. Moreover, we will showlater in our simulations that despite the penalty of bufferingRAM data for each ROM access, we obtain improvements inenergy consumption at the system level. b) R-MRAM:

An R-MRAM is a memory structuremade with conventional STT-MRAM array by embedding ahardware ROM. This allows it to operate in both ROM andRAM mode [13]. As shown in Fig. 3, R-MRAM bit cellsconsist of an additional Bit Line (BL) compared to STT- . . . e t tanh(x) LUT index

LUT index -1 memory rows

R-SRAM/R-MRAM Array

Fig. 5. Storage of LUTs for various functions within the same ROM-Embedded RAM array. The starting address for each type of LUT ispredeﬁned. An offset address (calculated from the input) is added to thestarting address to perform the table lookup from the R-SRAM/R-MRAM.The number of memory rows required by each LUT type is predeﬁned basedon the desired precision of the transcendental function to be stored.

MRAM. The physical connection of the bit cell (ﬁxed duringdesign time), stores ROM data. Bit cells connected to BL0store ROM data ‘0’, whereas those connected to BL1 storeROM data ‘1’. Every bit cell can be written/read for RAMoperation by electrically connecting BL0 and BL1. However,ROM access and RAM access cannot occur simultaneously.Next we describe the R-MRAM operation for RAM modeand ROM mode of operations.1)

RAM mode : During a RAM mode read operation, currentfrom the read-bias generator ﬂows through the passtransistors and the selected bit cell to Select Line (SL)(shown in Fig. 3). Consequently, a voltage appears on thepositive input of the sense ampliﬁer. The sense ampliﬁercompares the voltage (dependent on the resistance of theselected bit cell) to a reference voltage to output a ‘1’or ‘0’. For a write operation, EnRAM is asserted to turnON the pass transistors and the write driver drives bothBLs and SL.2)

ROM mode : For a ROM read operation, EnRAM isdeassserted to turn OFF the pass transistors and the latchis turned ON. If the selected bit cell is connected to BL1(BL0), BL1 (BL0) gets discharged and ROMOut outputsa 1 (0). Contrary to R-SRAM, the non-volatility of STT-MRAM prevents the RAM data to be lost in R-MRAMduring a ROM read operation.It has been shown in prior studies that the R-MRAMdesign with an extra BL has no area overhead at array-level. Additionally, this doesn’t impact the and performanceof the memory as a ROM [13]. Note that during ROM Mode,RAM data is not disturbed due to non-volatility of R-MRAM,thereby simplifying the ROM retrieval process. This results inhigher energy beneﬁts of using R-MRAM in SPARE, as wewill show later in our simulations.

B. SNN: Spiking Neural Networks

SNN has emerged as a power-efﬁcient choice for cognitiveapplications. SNNs are built using bio-plausible neurons andsynapses. All information ﬂow is converted into a train of lobalMemoryInterface to host

Control Unit

Layer 2 PE Layer 3Layer 1

ROM (math tables, neuron, synaptic behaviors etc.)

Spike Input BufferEvent

Controller

Spike vector IN

Spike output computation

RAM (weights, V mem,

Leak, time constants etc.)

Spike vector OUT

State Update

ROM Embedded RAM S p i k e O u t pu t B u ff e r SNN Inputs

SNN Output (a) (b)

Fig. 6. Block level diagram of SPARE. (a) Figure shows how a deep neural network is mapped to a 2-D array of PEs connected together. The global memorystores the spiking events at every layer output, and broadcasts them to the input of next layer. (b) Figure zooms into the logical diagram of the PE. It consistsof a ROM-embedded RAM to store the state variables along with LUTs of synapse, neuron and synaptic plasticity models, computation core to generateoutput spikes, input buffers to store incoming spike broadcast, event controller to schedule memory transactions, state updater to update the entries in thememory, and an output buffer to store the output spikes generated. spikes, similar to the information ﬂow in the human brain.Refer to Fig. 4. The input spikes V i are modulated by thesynapse weight W i . At every time-step, V i is either ‘1’ (spikingevent) or ‘0’ (no spike), whereas W i is a number between-1 and 1, signifying the strength of the connection. Theoutput from all synapses is summed up and fed to the nextneuron. The neuron keeps track of its membrane potential( V mem ), which gets updated based on the synaptic current.Subsequently, V mem accumulates/decays over several time-steps until it reaches a certain threshold V th , when the neuronemits an output spike (‘1’). This spike is then transmitted tothe neurons in the next layer. Depending on the neuron model, V mem dynamics differ in behavior and complexity. Duringthe training phase, the synaptic weights W i undergo changesto learn the input patterns. Many spike-based learning ruleshave been proposed, for example, Spike Timing DependentPlasticity (STDP) [14], Long-Term Potentiation [15] etc . Thebasic idea is to determine the correlation between the input andoutput neuron spiking activities, to determine the correspond-ing synapse weight updates. However, once the weights areall trained, the synaptic strengths remain unchanged duringthe inference phase. These plasticity rules are the basis forunsupervised learning in SNNs. C. LUT based storage in R-SRAMs and R-MRAMs

The computations required in the SNN described above relyheavily on transcendental functions and polynomial evalua-tions. The dynamics of V mem , synaptic current ﬂow, STDPlearning, etc. , all require solving differential equations withmostly exponential and higher order polynomial evaluations.The only efﬁcient way to compute these functions in hardwareis by the use of math tables or LUTs [16]. Taking an exampleof a typical STDP evaluation in SNNs, we show how the LUTsare structured in R-SRAM/R-MRAM. 1) STDP involves a synaptic weight update, based on thetime difference of post- and pre- neurons ( t = t post − t pre ). According to this empirical rule, the change in thesynaptic weight is proportional to e t .2) Range reduction: t can have an arbitrary value. Thus, t isbroken into N log K + r , where K is designer’s choice thatdetermines the size of LUT, and N is (cid:98) t/ log K (cid:99) . Thus theremainder r has a conﬁned range of | r | ≤ log K +1 . Thus,the exponential e t is reduced to N/ K e r .3) Approximation: Due to limited range of r , e r can beapproximated with lower order polynomials ( since e r =1 + r + r + ... ) .4) Reconstruction: To evaluate N/ K , let N = M K + d ,where M = (cid:98) N/ K (cid:99) and d = 0 , , ... K − . Thus N/ K = 2 M d/ K . Using d as a memory address tothe R-SRAM/R-MRAM, the corresponding ROM data(LUT) is fetched, which stores d/ K . The exponentialreduces to e t = 2 M × LU T ( d ) × e r . The multiplicationby M is a simple shift operation in hardware.5) The exponential e t is thus used to evaluate the weightupdate, completing one STDP evaluation.Other transcendental functions and polynomials can besimilarly mapped to LUTs, as described in detail in [16].Various LUTs are stored within the same array, as shown Fig.5. The starting address of each LUT is pre-deﬁned and is usedto perform table lookups. In the example taken above, when a‘Fetch LUT’ command is issued, two inputs are provided − thetype of LUT (exponential) and the offset (‘d’). The memoryaddress from which the lookup needs to be made is calculatedby adding the offset to the LUT index corresponding to theexponential LUT. Window split PE1PE2PE3 Output merge . . . PEn . . . . . . . . . Input Map

Output Map . . . . . . . . . . . . . . . . . . PE1

PE2 PEn

Fig. 7. Mapping of CNNs in SPARE: The input map is window-split basedon the kernel size of the particular layer. These are then broadcast to allPEs mapped to that layer. Each PE stores different kernels, and process thedata in parallel as they receive the inputs in a window split-manner. Each PEcomputes part of the output feature map, highlighted through color coding ofPEs in ﬁgure. The output is rearranged and stored back to the global memoryunit.

III. SPARE: SNN A

CCELERATOR USING

ROM-

EMBEDDED

RAM S A. SPARE Organization

We propose SPARE, a many-core architecture designed forefﬁcient acceleration of SNNs. As shown in Fig. 6(a), itconsists of a 2 dimensional PE-array coupled with a globalmemory and central control unit. A PE can perform all synapseand neuron functionalities required by different types of SNNs.This ﬂexibility is essential as SNN computations typicallydiffer at various levels - neurons, synapses and synaptic weightupdates, depending on the application. Layers of an SNNare spatially partitioned across different PEs depending onthe network size. The number of neuron state variables andsynaptic weights each PE can store is limited by the memorycontained within each PE. Based on the network size, numberof PEs mapped to each layer are speciﬁed.The SNN computation occurs in time-steps. At each time-step, the neuron ﬁring data is transfered from one layer tothe next. Input data spikes (for a given time-step) storedin the global memory are broadcast over the shared bus.Subsequently, the PEs mapping the ﬁrst layer of the SNNstart buffering the data and execute their SNN partition. Oncespikes for the ﬁrst layer have been transmitted, the spikesfor the next layer are broadcast, and PEs mapped to secondlayer start their computations, and so on. All synaptic datais stored locally within each PE. Once the layer-1 PEs ﬁnishtheir execution, their output data (spikes) are written back tothe global memory. Subsequently, data from all PEs is writtenback into the global memory, layer by layer. Consequently, thissuccessive data transfer (neuron data) between global memoryand PEs realize a time-step of SNN computation. It’s worthnoting that only neuron data movements occur between PEsand global memory, whereas the synapse data is locally readfrom the PE’s RAM. This reduces the data movements inSPARE compared to a von-Neumann machine which wouldinvolve moving both neuron and synapse data between theglobal memory and the computation core. Additionally, thisreduction is extremely signiﬁcant as typical SNNs have 1000 × more synapses than neurons [17].We extend this approach to map convolutional neural net-works (CNNs) using SPARE. CNNs have been shown effectivefor image classiﬁcation tasks, achieving state-of-the-art accu- Fetch Vmem

Reset V mem

Write to Output BufferFetch Input Buffer IDLE Update WeightsFetch Weights

PLASTICITY MODEL (If Training)

Fetch V mem

Evaluate 𝑑𝑉 𝑚𝑒𝑚 𝑑𝑡 Update V mem

NEURON MODEL

Fetch Weights

Fetch LUT value (ROM)Evaluate Synaptic Output Current

SYNAPSE MODEL A ll ou t pu t n e u r on s > V thresh Next time-step S i m u l a t i on c o m p l e t e == ‘1’== ‘0’ A ll ou t pu t n e u r on s < V thresh A ll i npu t n e u r on s Fetch LUT value (ROM) Fetch LUT value (ROM)

Fig. 8. Logical ﬂow diagram of the event controller, describing SNNcomputations performed in the PE. The subsequent computation is subdividedinto three main blocks. 1) Synapse model block: computes output synapticcurrent. 2) Neuron model block: keeps track of the membrane potential ofoutput neurons. 3) Plasticity model block: updates synaptic weights duringthe training phase. This block is skipped during the inference phase. racies, occasionally surpassing human performance [18], [19].The standard architecture consists of alternate convolutional(c-) and spatial-pooling (s-) layers, followed by a ﬁnal fully-connected (fc-) layer. Each convolutional layer hierarchicallyextracts complex features from the input image. This is doneby using shared weight kernels that perform a convolutionoperation on the input image. The output of one convolutionallayer becomes the input of the next. Thus, the kernels in theﬁrst convolutional layer learn low-level features, for example,edges and corners, while in deeper layers, they learn high-levelfeatures, using these low-level features as inputs. A spatial-pooling layer is added in between two convolutional layersto reduce the dimensions of the convolutional feature maps.This layer maintains the depth of the input map, howeverreduces the spatial dimensions. Finally, a fully-connected layeris used to determine the output class of the input image.Fig. 7 shows how the convolutional layer can be mapped toSPARE. The input map is split using a small window thatstrides throughout the image. The window size is governedby the kernel size of that layer. Input spikes are broadcast tothe PEs in this window-split manner (instead of pixel-by-pixelmanner), where each PE stores a different kernel of that layer.Thus, each PE computes part of the output feature map, whichis then merged and stored in the global memory unit, as shownin the ﬁgure. Layer parameters (for example, stride, kernel sizeand number of output maps) are programmed into the globalcontrol unit to implement this ‘window-split input broadcast’and ‘output merge’ in the global memory. Note that the s-and fc- layers can be conﬁgured as c- layers, with appropriateparameters. For s- layer, the parameters are: stride = 2, kernelsize = x , number of output maps = number of input maps. Layer1 processing image 3

Layer2 processing image 2Layer3 processing image 1Layer1 processing image 4Layer2 processing image 3Layer3 processing image 2

Time → Data Receive Compute Data Send

Fig. 9. Timing diagram illustrating the inter-layer pipelining in SPARE. Assoon as the PEs receive and buffer the input data, they start processing.Meanwhile, data for PEs mapped to subsequent layers is transmitted. Sincethe data transfer time is small compared to the computation time within thePE, all PEs process data in parallel.

Whereas for an fc- layer, stride = 0, kernel size = input featuresize, number of output maps = number of output neurons.Thus, the proposed architecture is a generalized programmablearchitecture that maps convolutional, spatial pooling as wellas fully-connected layers.

B. Inter-layer pipelining

SPARE enables a pipelined execution of layers in an SNNto exploit the available inter-layer data parallelism. Layers ofSNN are mapped across the 2-dimensional PE array. Hence,while layer-2 PEs are computing the n th input image, layer-1 PEs compute the (n+1) th input image and so on. Datacommunication between layers of SNN are achieved by scatterand gather operations initiated by the SPARE control unit (seeFig. 6(a)) to move data between global memory and PE-array.SPARE control unit stores the mapping information betweenSNN layers and PEs. A gather operation for a layer collectsthe output data computed by the PEs mapped to the speciﬁclayer and stores it in the global memory. Scatter operation fora layer sends the input data to the required PEs. It is importantto note that data communication in SNNs is of feed-forwardnature where PEs mapped to layer-n will only send data toPEs mapped to the subsequent layer-n+1 and so on. Hence,we do not support a dedicated on-chip network for all PE-to-PE communication due to the associated area and poweroverheads. Our “in-memory” nature of computing results intoPEs spending more time in computation (within PE) ratherthan sending and reeving data from global memory. Hence, ourinter-layer communication based on a shared resource (globalmemory) doesn’t lead to performance issues due to the naturalpipelining obtained as shown in Fig 9. C. Processing Element (PE)

As shown in Fig. 6(b), PE contains a computing coreto perform SNN computations and a memory unit to storethe neuron-synapse models, state variables and LUTs. Thememory unit (RAM and ROM) and the computation core

Fetch V mem

Fetch LUT valueEvaluate

Update V mem a) Leaky Integrate Fire

Synaptic Output Current b) Izhikevich

Fetch U,V mem

Fetch LUT valueEvaluate

Update U,V mem (x2, each for U,V mem )Synaptic Output Current Fetch V mem c) Hodgkin-Huxley

Fetch LUT value(x6, each for ) Where are functions of V mem

Fetch m,n,hEvaluate

Fetch LUT value(x3, each for i Na , i K , i l )Evaluate Update m,n,h,V mem

Synaptic Output Current

Fig. 10. Differential equations describing the dynamics of neurons and anLUT based approach to implement them in SPARE. (a) Leaky-integrate-ﬁreneuron (b) Izhikevich neuron (c) Hodgkin-Huxley neuron within the PE localize most of the data movements requiredfor computing the output neurons (mapped to the PE), therebyenabling “in-memory processing”. While RAM houses all thesynaptic weights and state variables required for the outputneuron computations, the ROM stores the LUTs required formodeling synapse, neuron and synaptic weight update com-putations. Consequently, the higher storage density enabledby ROM-embedded RAMs (smaller memory size) and theresulting reduction in data movements increases the computa-tion efﬁciency and reduces overall energy consumption. Thecomputational ﬂow in a PE and a step-by-step procedure fortypical SNN computation is illustrated in Fig. 8. It consistsof three main blocks: 1) Synapse model, 2) Neuron modeland 3) Plasticity model. The event controller checks the headof the input spike buffer, and both the Synapse and Neuronblocks are skipped if the input is ‘0’, thereby leveraging thebeneﬁts of event-driven computing in SNNs to achieve energy-efﬁciency. Similarly, the Plasticity block is skipped if the V mem is less than the threshold (no synaptic weight update).PEs are modeled as extended ﬁnite-state machines. As soonas the PE receives the broadcast of spikes corresponding to itslayer tag, it starts computing. Thus, effectively all PEs run inparallel, exploiting data-parallelism and inter-layer pipelining.Since the input spikes are broadcast to all PEs, each PEperforms the SNN computation corresponding to the neuronand synapses it is mapped to. Since SPARE localizes data-movement through in-memory computing, the same memory Design Energy (pJ) Latency (ns) Leakage (mW)

RAM Read RAM Write ROM Read RAM Read RAM Write ROM Read

SRAM 38.42 33.39 38.42 0.53 0.53 0.53 74.89R-SRAM 16.99 14.48 62.94* 0.418 0.418 1.254* 40.92

STT-MRAM 35.48 146.31 35.48 1.18 10.34 1.18 0.72

R-MRAM 17.93 73.37 17.93 1.16 10.32 1.16 0.48

Fig. 11. Energy and latency for read-write accesses from all designs considered in this work − SRAM, R-SRAM, STT-MRAM, and R-MRAM. (* ROMRead for R-SRAM includes additional overhead of buffering RAM, retrieving ROM data and storing back the buffered RAM data, as described in SectionII-A).

Parameter Value

Frequency 1 GHz

Technology node 45 nm

Memory unit 32KB ROM-embedded RAMBuffer depth 32

Synapse weight precision 8-bit

Neuron V mem precision 8-bitData width 32-bit

ALU registers 32-bit fixed point

PE core static power 1.311 mW

Fig. 12. uArchitecture design parameters used for simulations. storage unit also contains the LUTs used in SNN computationsallowing a simple and compact PE design.

D. Modeling complex neuro-synaptic functionality

Most neuron-synapse models have complicated differentialequations, and heavily use higher order polynomials andtranscendental functions. This makes them highly suitable foran LUT based storage in ROM-embedded RAMs. Thus, ourPE incorporates any model needed by the SNN applicationwithout much area overhead. To illustrate this, dynamics ofthree different neuron models - LIF [20], Izhikevich [21], andHodgkin-Huxley (HH) [22] are shown in Fig. 10. Each modelcan be implemented in SPARE, by modifying the ‘neuronmodel’ block in the state diagram (see Fig. 8), with corre-sponding alterations. As shown in Fig. 10(c), the HH modelis described by 7 differential equations, with 4 state variables- V mem , m, n and h . Firstly, we need 4 RAM fetches to readthe state variables. To update m, n, and h , we need a totalof 6 ROM fetches, each for α m,n,h , β m,n,h . Next, we needto evaluate higher order polynomials (also stored in LUTs) inorder to calculate i Na,K,l . Thus, a total of 9 ROM fetches,4 RAM fetches and 4 RAM updates per spike per time-stepare required for HH model, in contrast to 1 ROM fetch, 1RAM fetch and 1 RAM update in case of a simple LIF model.However, the overall data ﬂow diagram remains unchanged.A similar approach can be used to implement various synapseand plasticity models, by modifying the synapse and plasticityblocks, respectively. As the models become more complex,more LUTs are required to store multiple polynomial functionsand math tables. Moreover, the number of ROM and RAM fetches also increase. SPARE addresses both these issues sincethe ROM-embedded RAM primitive allows extra ROM forLUT storage, thereby allowing a compact memory unit. Data-localization in SPARE along with the compact memory storageunit enables a lower energy/latency per ROM/RAM access.IV. E

XPERIMENTAL M ETHODOLOGY

PE was modeled at the Register Transfer Level (RTL) inVerilog and synthesized to IBM 45nm technology libraryusing Synopsys Design Compiler to estimate the power andarea consumptions. R-SRAM and R-MRAM (memory units)were modeled using Cacti [24] and NVSim [25], respectively,for the corresponding RAM sizes at 45nm technology node.Subsequently, we account for the modiﬁed ROM access cyclesand peripheral circuits described in Sec. II-A. Fig. 11 summa-rizes the RAM/ROM read-write energy and latency obtainedfrom simulations for SRAM, R-SRAM, STT-MRAM and R-MRAMs. Cycle-accurate RTL simulations were performed toget estimates of memory (RAM, ROM) access traces and sub-sequently, the overall energy consumption per classiﬁcation.Fig. 12 summarizes various µ -architecture parameters used forthe simulations.We analyze the energy, performance and area beneﬁtsof SPARE on MNIST dataset [26] and CIFAR-10 dataset[27]. For an apples-to-apples comparison, we use a similararchitecture built with PEs comprised of typical RAM andSTT-MRAM as our baselines (without ROM-Embedded RAMcapability). Additionally, to demonstrate system scalability, webenchmark SPARE with various network sizes of differentscales, varying from 1184 to 36602 neurons. Fig. 13 tabulatesthe benchmarks chosen [4], [23], [28], and the number ofPEs required in each case. Note that benchmarks ‘MNIST-1,2,3’ are typical two-layer SNNs, that can be trained usingSTDP learning [23]. The input layer has 784 neurons, eachcorresponding to an input pixel in the image. The output layerhas 400, 1600 and 6400 neurons for benchmark ‘MNIST-1,2and 3’ respectively. For deep spiking networks beyond two-layers, there hasn’t been any successful attempt to generalizea training algorithm in the spiking domain. However, [4], [28]show that off-line training of the network using DNN tech-niques (standard back-propagation algorithm) and convertingthe trained network to an SNN does not incur signiﬁcantperformance degradation. Thus, to evaluate SPARE on deepnetworks, we use benchmarks ‘MNIST-4’ and ‘CIFAR10’, Benchmark Network Configuration

MNIST-1 784 x 400 16 0.5 Synaptic weightsState variables ( )Spiketimes Synapse model : as a function of weights Neuron model (LIF) : as a function of Neuron model (HH) : as a function of as a function of (x=m,n,h) Plasticity block : exponential LUT for STDP learningMNIST-2 784 x 1600 50 1.56MNIST-3 784 x 6400 200 6.25MNIST-4 784 x 1200 x 1200 x 10 100 3.125CIFAR10 32x32x3-24c5-2s-

220 6.875 Kernel weights

State variables ( ) Fig. 13. SNN benchmarks used in SPARE evaluation [4], [23]. The ﬁgure tabulates the number of PEs required, memory requirement, and the RAM/ROMcontent for each benchmark and neuron model. in the inference phase. ‘MNIST-4’ is a deep multi-layered,fully connected SNN converted from a trained DNN [4]. Itconsists of an input layer of 784 neurons, followed by twohidden layers with 1200 neurons each, and ﬁnally an outputlayer of 10 neurons. Benchmark ‘CIFAR10’, on the otherhand, is a deep convolutional neural network converted from atrained CNN (32x32x3-24c5-2s-80c5-2s-10o). The CNN hastwo convolution (c-) and two spatial-pool (s-) layers arrangedalternately, followed by a fully-connected (fc-) output layer.The dimension of input image is 32x32x3. The ﬁrst c- layerconsists of 24 kernels of size 5x5x3. The following s- layerhas kernels with size 2x2. The second c- layer has 80 kernelsof size 5x5x24, followed by another s- layer with kernel size2x2. The ﬁnal layer has 10 neurons, fully connected to theprevious layer. In all our simulations, we use the LIF neuronmodel along with the exponential STDP based plasticity forthe training phase.Input spike trains were generated from the input imagepixels based on the rate-coding approach used in [29]. Eachimage is split-up into several time-steps, each conveyingthe input ﬁring activity. We analyze the beneﬁts of SPAREtowards leveraging SNN data sparsity (event-drivenness) byanalyzing each SNN on different input maximum ﬁring rates, f p = 0 . and f p = 1 [28]. Kindly note that in our work,we use the SNN size and dataset statistics only for exploringsystem scalability and beneﬁts of in-memory computing fortraining and testing SNNs. Mapping of these networks to theproposed architecture doesn’t lead to any degradation in theclassiﬁcation accuracy. Readers are referred to [4], [23], [28] toexplore the classiﬁcation accuracy achieved in the benchmarksused. V. R ESULTS

A. Energy

A common base reference was used to normalize all energynumbers obtained through simulations such that the minimumenergy consumption bar (Inference of MNIST-1 for R-MRAMwith fp=0.4) represents 1. All other energy bars in Fig. 14, Fig.15, and Fig. 17 are normalized to this value. Fig. 14 shows theenergy consumption for benchmarks ‘MNIST-1,2,3’, both fortraining and inference phases. Each bar shows the total energy,which is further split into three sub-components 1. RAM(access + leakage) 2. ROM (access + leakage) and 3. Rest(Core - buffer, control, compute). The following observationscan be drawn from Fig. 14. 1) It can be seen that an increase in the maximum ﬁring rate (fp) results in increased overall energyconsumption across all datasets. This is because a higher ﬁringrate results in increased number of spikes. Consequently, thisincreases the number of RAM/ROM accesses, thereby decreas-ing the beneﬁts from event-driven computing in SPARE. Thisalso increases the overall computations as more synapses willbe accumulated over the output neurons. This underscoresthe effectiveness of SPARE in drawing beneﬁts from thedata sparsity in SNNs. 2) The total energy consumption inthe inference phase is lower compared to the training phasebecause the Plasticity block (refer Fig. 8) is skipped duringthe inference phase, as described in Sec III-C. 3) The energyconsumption with STT-MRAM technology is more than ∼ × less, compared to CMOS based memory technology.This was expected since STT-MRAM is a NVM, thus,leakage due to memory is close to 0. Although writing intoSTT-MRAM is expensive compared to CMOS, the near-zeroleakage is a dominant factor in reducing the energy con-sumption. 4) Using R-SRAM and R-MRAM as the memoryunits in the PE, we obtain, . × , and . × reductionin energy consumption on an average, compared to CMOSSRAM and STT-MRAM, respectively. This is a direct conse-quence of increased storage density (or, smaller area for iso-bytes) provided by ROM-embedded RAMs. A smaller memoryreduces the access energy and the leakage, thereby leading toenergy beneﬁts. However, note that for iso-area, higher storagedensity (through ROM-embedded RAMs) allows bigger on-PEstorage, eliminating data movements required from externalmemory (in case of typical RAM). This leads to energy bene-ﬁts. Note that we have used iso-storage PEs in our simulationsto evaluate the energy beneﬁts. 5) The energy improvement inSTT-MRAM technology is greater compared to CMOS due toa simpler ROM retrieval process in R-MRAMs compared toR-SRAM (refer Sec. II-A). R-SRAMs require additional stepsin buffering the RAM data, for each ROM access, which isnot required in R-MRAMs.Moving to deeper networks, Fig. 15 shows the energyconsumption for deep networks, illustrating the scalability ofSPARE towards executing SNN workloads. A few additionalobservations can be inferred: 1) Benchmark ‘MNIST4’ beinga deeper extension of ‘MNIST1-3’, obtains similar improve-ments of . × , and . × reduction in energy consumptionfor CMOS and STT-MRAM technologies, respectively. 2)For a deep convolutional network (‘CIFAR10’), the improve-ments are . × and . × , for CMOS and STT-MRAM,respectively. CNNs are more compute-intensive compared to N o r m a li z e d E n e r gy C o n s u m p t i o n Inference

RAM ROM Rest

SRAM

R-SRAM

STT-MRAM

R-MRAM

MNIST1 MNIST3MNIST2 fp=0.4 N o r m a li z e d E n e r gy C o n s u m p t i o n Training

RAM ROM RestSRAM

R-SRAM

STT-MRAM

R-MRAM

MNIST1

MNIST3MNIST2 fp=0.4

Fig. 14. Normalized energy consumption for a) Training phase and b) Inference phase, for benchmarks ‘MNIST1-3’. The simulations are performed for maxﬁring rate fp = 0 . and . The energy bars are further split into RAM (read/write energy + leakage), ROM (read energy + leakage) and Rest (core energy).The energy values are normalized to the common base reference. N o r m a li z e d E n e r gy C o n s u m p t i o n Deep networks - Inference

RAM ROM RestSRAMR-SRAMSTT-MRAM

R-MRAM

MNIST4 CIFAR10 fp=0.4 1 0.4 1

Fig. 15. Normalized energy consumption for benchmarks ‘MNIST4’ and‘CIFAR10’. The simulations are performed for max ﬁring rate fp = 0 . and . The energy bars are further split into RAM (read/write energy + leakage),ROM (read energy + leakage) and Rest (core energy). The energy values arenormalized to the common base reference. memory-intensive fully connected networks [30]. Thus, moreenergy is spent in computations, compared to the memorytransactions. Thus, the energy consumed by the core and thememory leakage energy are signiﬁcant. For the CMOS casein benchmark ‘CIFAR10’, the memory leakage overwhelmsthe core energy consumption (see Fig. 15), whereas in STT-MRAM, the core energy consumption overwhelms the mem-ory energy (no leakage!). Due to this reason, the improvementof using R-MRAMs is suppressed in CNNs. Comparing onlythe memory energy consumption (RAM+ROM), we still obtain ∼ × improvement for R-MRAMs, however, the core energybeing dominant reduces overall beneﬁts. Note that using theSTT-MRAM technology itself decreases the energy consump-tion by an order of magnitude compared to CMOS. Thus,we conclude that using R-SRAMs over typical SRAMs ascompute units lead to ∼ . × improvement in energy forboth fully-connected networks and convolutional networks.Whereas, using R-MRAMs over typical STT-MRAMs leadto ∼ . × improvement for fully-connected networks, and ∼ . × for CNNs. p e r - P E A r e a ( mm ) PE Area

Mem Core

SRAM R-SRAM STT-MRAM R-MRAM

Fig. 16. per-PE area with SRAM, R-SRAM, STT-MRAM and R-MRAM asmemory units (for iso-storage).

B. Area

By using R-SRAMs and R-MRAMs in PEs, . × and . × area beneﬁts are achieved on a per-PE basis, for R-SRAM and R-MRAM, respectively, shown in Fig. 16. Thisis because ROM-embedded RAM effectively provides extraROM with no area overhead. Moreover, the PE area is dom-inated by the memory unit, since the core (buffers, controllerand computation core) consumes a small portion of the totalarea. The area consumption of the shared bus and globalmemory are insigniﬁcant with respect to the total PE area(hundreds of PEs used in benchmarks - see Fig. 13). Thistranslates to SPARE being more area-efﬁcient compared to anormal-RAM based system. C. Performance

In the previous section, we observed a reduction in PE areaby a factor of . − . × by using R-SRAMs/R-MRAMs,due to higher storage density provided by ROM-embeddedRAMs. For a given chip area (iso-area), we can ﬁt abouttwice as many PEs that use R-SRAM and R-MRAM comparedto typical SRAMs and STT-MRAMs, respectively, by using N o r m a li z e d E n e r gy C o n s u m p t i o n Inference with HH Neurons

RAM ROM Rest N o r m a li z e d E n e r gy C o n s u m p t i o n Inference with HH Neurons

RAM ROM Rest

SRAM

R-SRAM

STT-MRAM

R-MRAM

MNIST1 CIFAR10MNIST2 MNIST3

MNIST4

SRAM

R-SRAM

STT-MRAM

R-MRAM

200 400

Fig. 17. Normalized energy consumption for using Hodgkin-Huxley neuron models on SNN benchmarks. The simulations are performed for max ﬁring rate fp = 0 . . The energy bars are further split into RAM (read/write energy + leakage), ROM (read energy + leakage) and Rest (core energy). The energy valuesare normalized to the common base reference. smaller memory sizes. Computations (neurons in a layer)can be split between more PEs, translating to . − . × performance beneﬁts. Note that we assume a ROM:RAMratio of 1:1. This is reasonable for SNN computations due toextensive LUT demands arising from various math functionrequirements. However, if the designer wishes to decreasethe ratio (at the cost of lower precision of LUTs and lowerﬂexibility with respect to neuro-synaptic functionalities), theperformance improvement would be smaller, as the ratiodecreases. D. Complex neuro-synaptic models

We expect to achieve higher beneﬁts in mapping more com-plicated neuro-synaptic models, due to increased LUT storagedemands and ROM accesses per classiﬁcation. In literature,usage of complicated models, such as the Hodgkin-Huxley(HH) and Izhikevich neuron models, is limited to biologicalexperiments, and no references report a decent classiﬁcationaccuracy in using such models in SNN classiﬁcation tasks.However, in order to evaluate SPARE for more complexmodels, we estimate the energy beneﬁts of using HH neuronmodel for the same benchmarks used before. Note that thelevel of complexity of the differential equations increases fromLIF to Izhikevich to HH, as described earlier in Section III-D.For LIF, 1 ROM fetch, 1 RAM write, and 1 RAM read isrequired per computation. For Izhikevich, 2 ROM fetches,2 RAM writes, and 2 RAM reads are required. While forHH, 9 ROM fetches, 4 RAM writes, and 4 RAM reads areneeded. Thus, it is trivial that the energy consumption wouldincrease as we go from LIF to Izhikevich to HH. To avoidclutter, we only compare the two extreme cases (LIF andHH), to evaluate SPARE with complex neuron models. Fig.17 shows the normalized energy consumption for the inferencephase for MNIST and CIFAR10 datasets, using the HH neuronmodel at max ﬁring rate f p = 0 . . Note that these are onlyprojected values showing the energy proﬁles in using HHneurons for SNN workloads. The following observations canbe inferred: 1) The energy consumption is higher, as comparedto the LIF neuron implementation, throughout all datasets. Moreover, energy spent in ROM accesses is higher than RAMaccesses. This is a direct consequence of additional RAMand ROM fetches (9 ROM fetches, 4 RAM fetches, 4 RAMupdates per spike per timestep) involved in solving complexdifferential equations for HH neurons. 2) For fully connectednetworks, we obtain . × and . × reduction in energyconsumption on an average, for R-SRAMs and R-MRAMscompared to CMOS SRAM and STT-MRAM, respectively.While for convolutional networks, we obtain . × and . × reduction. Note that the corresponding improvements in energyare higher for R-MRAM technology, but lower for the R-SRAM technology, compared to the LIF neuron case (Sec.V-A). This is due to the fact that R-SRAMs have additionaloverhead in ROM retrieval process, as described earlier. SinceHH neurons involve lots of ROM accesses, this overhead leadsto a reduction in energy improvements. While for R-MRAMs,since the overhead is minimal, increased ROM accesses leadsto higher energy beneﬁts.VI. C ONCLUSION

In this paper, we presented SPARE, an architecture utilizingthe ‘in-memory processing’ abilities of ROM-embedded RAMto enable efﬁcient acceleration of SNNs. Each processing unitin SPARE does event-driven processing in order to leverage thebeneﬁts from input data sparsity in SNNs. We analyzed trade-offs of using CMOS based R-SRAMs and STT-MRAM basedR-MRAMs as memory units in SPARE for different typesof networks. Our experiments on various SNN benchmarksfor image classiﬁcation applications reveal that R-MRAMsare suitable for mapping fully-connected networks comparedto typical STT-MRAM arrays with ∼ . × lower energy,while R-SRAMs are suitable for mapping CNNs compared totypical SRAM arrays with ∼ . × lower energy. R-SRAM andR-MRAM achieve ∼ . × reduction in area for iso-storage.Moreover, for iso-area, R-SRAMs and R-MRAMs can achieve ∼ . × improvement in performance, given required data par-allelism (neurons in a layer) is available. SPARE also providesthe necessary programmability to execute a variety of synapse,neuron and plasticity models thereby, enabling designers toeploy SNNs based on the application requirements. SPAREthus underscores the applicability of ROM-embedded RAMbased in-memory hardware primitives in efﬁcient cognitivecomputing. R EFERENCES[1] Y. Bengio et al. , “Learning deep architectures for ai,”

Foundations andtrends R (cid:13) in Machine Learning , vol. 2, no. 1, pp. 1–127, 2009.[2] N. Jones, “The learning machines,” Nature , vol. 505, no. 7482, p. 146,2014.[3] D. Silver et al. , “Mastering the game of go with deep neural networksand tree search,”

Nature , vol. 529, no. 7587, pp. 484–489, 2016.[4] P. U. Diehl et al. , “Fast-classifying, high-accuracy spiking deep networksthrough weight and threshold balancing,” in

Neural Networks (IJCNN),2015 International Joint Conference on , pp. 1–8, IEEE, 2015.[5] A. Krizhevsky et al. , “Imagenet classiﬁcation with deep convolutionalneural networks,” in

Advances in neural information processing systems ,pp. 1097–1105, 2012.[6] S. Han et al. , “Learning both weights and connections for efﬁcientneural network,” in

Advances in Neural Information Processing Systems ,pp. 1135–1143, 2015.[7] Y.-H. Chen et al. , “Eyeriss: An energy-efﬁcient reconﬁgurable accelera-tor for deep convolutional neural networks,”

IEEE Journal of Solid-StateCircuits , vol. 52, no. 1, pp. 127–138, 2017.[8] M. Kang et al. , “An energy-efﬁcient VLSI architecture for pattern recog-nition via deep embedding of computation in SRAM,” in , IEEE, may 2014.[9] W. M. Snelgrove, M. Stumm, D. Elliott, R. McKenzie, and C. Cojo-caru, “Computational ram: Implementing processors in memory,”

IEEEDesign & Test of Computers , vol. 16, pp. 32–41, 1999.[10] J. Borghetti et al. , “Memristive switches enable stateful logic operationsvia material implication,”

Nature , vol. 464, no. 7290, pp. 873–876, 2010.[11] A. Ankit et al. , “RESPARC,” in

Proceedings of the 54th Annual DesignAutomation Conference 2017 on - DAC 17 , ACM Press, 2017.[12] D. Lee et al. , “Area efﬁcient ROM-embedded SRAM cache,”

IEEETransactions on Very Large Scale Integration (VLSI) Systems , vol. 21,no. 9, pp. 1583–1595, 2013.[13] X. Fong et al. , “Embedding read-only memory in spin-transfer torquemram-based on-chip caches,”

IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems , vol. 24, no. 3, pp. 992–1002, 2016.[14] C. Clopath et al. , “Voltage and spike timing interact in STDP–a uniﬁedmodel,”

Spike-timing dependent plasticity , p. 294, July 2010.[15] T. V. Bliss et al. , “A synaptic model of memory: long-term potentiationin the hippocampus.,”

Nature , vol. 361, no. 6407, p. 31, 1993.[16] J. Harrison, T. Kubaska, S. Story, M. S. Labs, and I. Corporation, “Thecomputation of transcendental functions on the ia-64 architecture,”

IntelTechnology Journal , vol. 4, pp. 234–251, 1999.[17] F. Akopyan et al. , “TrueNorth: Design and Tool Flow of a 65 mW 1Million Neuron Programmable Neurosynaptic Chip,”

IEEE Transactionson Computer-Aided Design of Integrated Circuits and Systems , vol. 34,no. 10, pp. 1537–1557, 2015.[18] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in , IEEE, jun 2016.[19] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van denDriessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanc-tot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever,T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis,“Mastering the game of go with deep neural networks and tree search,”

Nature , vol. 529, pp. 484–489, jan 2016.[20] P. Dayan et al. , Theoretical neuroscience , vol. 806. Cambridge, MA:MIT Press, 2001.[21] E. Izhikevich, “Simple model of spiking neurons,”

IEEE Transactionson Neural Networks , vol. 14, pp. 1569–1572, nov 2003.[22] A. L. Hodgkin et al. , “A quantitative description of membrane currentand its application to conduction and excitation in nerve,”

The Journalof Physiology , vol. 117, pp. 500–544, aug 1952.[23] P. U. Diehl and M. Cook, “Unsupervised learning of digit recognitionusing spike-timing-dependent plasticity,”

Frontiers in ComputationalNeuroscience , vol. 9, aug 2015.[24] “Cacti 6.0: A tool to understand large caches,” . [25] X. Dong, C. Xu, Y. Xie, and N. P. Jouppi, “Nvsim: A circuit-levelperformance, energy, and area model for emerging nonvolatile memory,”

IEEE Transactions on Computer-Aided Design of Integrated Circuitsand Systems , vol. 31, pp. 994–1007, July 2012.[26] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,”

Proceedings of the IEEE , vol. 86,no. 11, pp. 2278–2324, 1998.[27] A. Krizhevsky, “Learning multiple layers of features from tiny images,”tech. rep., 2009.[28] B. Han, A. Ankit, A. Sengupta, and K. Roy, “Cross-layer designexploration for energy-quality tradeoffs in spiking and non-spiking deepartiﬁcial neural networks,”

IEEE Transactions on Multi-Scale ComputingSystems , vol. PP, no. 99, pp. 1–1, 2017.[29] D. Goodman, “Brian: a simulator for spiking neural networks in python,”

Frontiers in Neuroinformatics , vol. 2, 2008.[30] F. Sun, C. Wang, L. Gong, C. Xu, Y. Zhang, Y. Lu, X. Li, and X. Zhou,“A power-efﬁcient accelerator for convolutional neural networks,” in ,pp. 631–632, Sept 2017.

Amogh Agrawal received the B.Tech degree inelectrical engineering from the Indian Institute ofTechnology, Ropar, India in 2016. He was a researchintern at University of Ulm, Germany in 2015, underthe DAAD (German Academic Exchange Service)Fellowship. He joined the Nanoelectronics researchlab in 2016, and is currently pursuing his PhDdegree at Purdue University under the guidance ofProf. Kaushik Roy. His primary research interestsinclude modeling and simulation of spin devices forapplication in logic, memories and neuromorphiccomputing. He is also looking at digital and analog circuits for in-memorycomputing techniques using CMOS and beyond-CMOS memories. He wasawarded the Directors Gold Medal for his all-round performance, and Uni-versity Silver Medal for his academic achievements at IIT Ropar. He is arecipient of the Andrews Fellowship from Purdue University.

Aayush Ankit received the B.Tech. degree fromIndian Institute of Technology (BHU), Varanasi in2015. He was a summer intern and Mitacs GlobalinkFellow at the University of Alberta, Canada in 2014.He has also worked as intern at HPE Labs, Palo Alto,CA and Intel Corporation, Hillsboro, OR in 2017.Currently, he is pursuing PhD degree in Electricaland Computer Engineering at Purdue Universityand is a Research Assistant to Prof. Kaushik Roysince Fall 2015. His primary research interests lie inarchitecture and algorithm design for NeuromorphicComputing. aushik Royaushik Roy