[PDF] SHE-MTJ Circuits for Convolutional Neural Networks

Abstract

We report the performance characteristics of a notional Convolutional Neural Network based on the previously-proposed Multiply-Accumulate-Activate-Pool set, an MTJ-based spintronic circuit made to compute multiple neural functionalities in parallel. A study of image classification with the MNIST handwritten digits dataset using this network is provided via simulation. The effect of changing the weight representation precision, the severity of device process variation within the MAAP sets and the computational redundancy are provided. The emulated network achieves between 90 and 95\% image classification accuracy at a cost of ~100 nJ per image.

Full PDF

GGENERIC COLORIZED JOURNAL, VOL. XX, NO. XX, XXXX 2019 1

SHE-MTJ Circuits for Convolutional NeuralNetworks

Andrew W. Stephan and Steven J. Koester,

Fellow, IEEE

Abstract — We report the performance characteristicsof a notional Convolutional Neural Network based on thepreviously-proposed Multiply-Accumulate-Activate-Poolset, an MTJ-based spintronic circuit made to computemultiple neural functionalities in parallel. A study of imageclassiﬁcation with the MNIST handwritten digits datasetusing this network is provided via simulation. The effect ofchanging the weight representation precision, the severityof device process variation within the MAAP sets and thecomputational redundancy are provided. The emulatednetwork achieves between 90 and 95% image classiﬁcationaccuracy at a cost of 100 nJ per image.

Index Terms — Neuromorphic Computing, ConvolutionalNeural Network, Spintronics, Spin Hall, Magnetic TunnelJunction.

I. I

NTRODUCTION

Convolutional neural networks (CNNs) are a powerful toolfor beyond-Boolean computing such as data classiﬁcation,whether it be text, audio or visual. [1], [2] Their complexitymakes in-hardware implementations difﬁcult and costly be-yond that of the simpler fully-connected network. We proposeto use the Multiply-Accumulate-Activate-Pool (MAAP) setsdescribed in [3] to reduce the complexity of convolutionalneural network implementation. The MAAP sets will limitthe number of unique operations required by the CNN bycondensing the convolution, activation and pooling operationsinto one circuit. In so doing, the number of required peripheraloperations such as memory is also reduced compared to otherhardware-based implementations that incorporate all CNNfunctions individually [2].

II. B

ACKGROUND

A. Convolutional Neural Networks

Typical CNNs consist of one or more sequences of convo-lution, activation and pooling layers followed by one or more fully-connected (FC) layers as shown in Fig. 1. Each neuron inthe convolutional layers applies a certain weight template to asubset of values in the input space, with neighboring neuronsapplying the same template to neighboring, possibly overlap-ping, subsets. Since multiple values in the input contribute toa single value in the convolution layer, this results in down-sampling of the image size. Each template corresponds to onefull set of neurons, or one convolutional image map. Activationlayers pass each value contained in the convolutional layersthrough some nonlinear activation function in a 1-to-1 fashion.The rectiﬁed linear unit (ReLU) R ( x ) = (cid:40) for x < x for x ≥ (1)is commonly used for this purpose. In max-pooling, eachneuron chooses the maximum value from its unique subset ofthe input space. This further down-samples the data and alsointroduces some translation-invariance. The fully-connectedlayer consists of a one-dimensional vector of neurons, eachof which takes a weighted sum of all values in the previouslayer. The convolution, activation and pooling layers prior tothe ﬁnal fully-connected layer comprise a signiﬁcant portionof the computational cost of a CNN. B. MAAP Sets

The spin-torque-controlled magnetic tunnel junction (MTJ)is a well-known basic element of spintronic computing. [4]–[8] Depending on the circuit layout, geometry and speciﬁcapplication of spin torque, these versatile spin-MTJs can beused as analog or digital programmable synapse memristors,spiking neurons or artiﬁcial neurons. In [3] a useful applicationof MTJ cells manipulated via the spin-Hall effect (SHE) isproposed. Utilizing in-plane ﬁeldlike spin-torque along thehard axis of the free layer (FL), a linear hysteresis loop isproduced [9]. With the appropriate choice of circuit param-eters, a voltage divider composed of one SHE-MTJ cell anda reference resistor with an inverter to read the output canproduce a linear output–with saturation–as a function of thecharge current passing through the SHE layer. This structureis effectively a three-terminal device with input, output andconstant terminals in which the potential across the outputand constant terminals depends on the charge current injectedfrom the input terminal to the constant terminal. Crossbararrays are commonly used in neuromorphics to perform themultiply-and-accumulate operation by summing up parallelcurrents, each of which represents a single product of a a r X i v : . [ c s . ET ] J u l GENERIC COLORIZED JOURNAL, VOL. XX, NO. XX, XXXX 2019

Fig. 1. The CNN structure used for training in TensorFlow. The network consists of two convolutional layers with their subsequent activation andpooling layers, followed by one ﬁnal fully-connected layer. Each convolution contains four kernels.Fig. 2. The decomposition of the CNN in Fig. 1 to a MAAP-CNN. Eachconvolution-activation-pooling layer sequence has been replaced by amatching group of MAAP sets which perform all three functions. voltage and a conductance value. In order for the sum to becorrect, leakage between the parallel lines must be minimizedby holding the bottom potential constant. In order to injectthis current sum as an input to a device with low error,the device must have very low input impedance so that theﬂoating potential on the input terminal is very close to thevalue on the constant terminal regardless of the actual input.The SHE-MTJ voltage divider stack uniquely accomplishesthis by using a low-resistance SHE layer to read the chargecurrent and transform it into a spin signal without signiﬁcantlydisturbing the input potential. An equivalent circuit basedentirely on charge signals would require additional ampliﬁersat greater cost to maintain the input terminal at a constantpotential. A ReLU activation pair consists of two concatenatedSHE-MTJ cells and readout inverters with additional biascurrent sources. Several such activation pairs may be organizedinto a winner-take-all circuit that simultaneously selects themaximum of a set of input values and computes the ReLUactivation on the input. This circuit was shown to computeefﬁciently while also being robust to both thermal and processvariation, including MTJ resistance state variation, critical-current variation and transistor threshold variation [3]. Thismakes it a good candidate for a spintronic CNN accelerator.

III. MAAP-CNN S We applied the MAAP set concept to the problemof classifying the MNIST handwritten digits dataset. TheHSPICE/Matlab simulator used in [3], [10] was used togenerate a simpliﬁed input-to-output simulation module, vastlydecreasing the required compute time. This module is used inMatlab to emulate the CNN in Fig. 1 as shown in Fig.2 using

Fig. 3. Block diagram of MAAP set and memory. The weights andinputs are digitally stored in SRAM. Each MAAP output is converted viaADC and similarly stored. Weight multiplication is accomplished digitallybefore converting back to analog current via a crossbar-like array ofOTAs. weight templates trained in Tensorﬂow. To store intermediateresults in the course of CNN processing the MAAP data is sentto analog-to-digital converters (ADCs) and stored in static-RAM (SRAM). The convolutional templates are also digitallystored in SRAM. These quantities are digitally multiplied,requiring approximately B gates for B bits, and supplied to B operational transconductance ampliﬁer (OTA) based currentsources that provide the weighted input to the MAAP sets ina fashion similar to the OTA usage in [3], [11], [12]. The lowinput impedance of the MAAP set circuit makes summing upparallel inputs in a crossbar-like manner quite accurate [3]. Ablock diagram of this layout is shown in Fig. 3. We note thatwith sufﬁciently large memory, only one MAAP set is neededto emulate an entire CNN. However, since the processing timefor a single MAAP operation is on the order of ns [3] it ismuch more time-efﬁcient to complete the ∼ IV. R

ESULTS AND D ISCUSSION

In this section we report the relationship between termssuch as the image classiﬁcation accuracy, redundancy factor R ,bit representation precision B , energy and process variation.In the ﬁgures referenced herein, the voltage deviation termrefers to threshold deviation. The other process variation termsrelated to the MTJs are assumed to follow constant Gaussian TEPHAN et al. : SHE-MTJ CIRCUITS FOR CONVOLUTIONAL NEURAL NETWORKS 3

Fig. 4. (a) Energy consumed in the MAAP processing and memory op-erations vs. the number of bits used for each value. This data assumesideal devices. (b) Total energy for non-ideal devices with different levelsof transistor threshold deviation and constant FM parameter deviation.The dashed line indicates ideal device energy. The change is very small.Fig. 5. (a) Total energy vs. the weight precision for different levelsof threshold deviation. (c) Total energy vs. MAAP set redundancy fordifferent levels of threshold deviation. distributions with values drawn from those distributions foreach device in each iteration. The width of the Gaussianthreshold distribution is a variable in some ﬁgures.

A. Energy

Apart from peripheral circuitry, the energy dissipation islargely independent of the number of physical MAAP circuits,being dependent on the number of MAAP and memoryoperations to be computed instead. The total energy dissipatedduring the processing of one image is: E = N M E M + N MEM E MEM + N A E A + N MUL E MUL , (2)where N M , N MEM , N A and N MUL are the number ofMAAP, digital memory, ADC and digital multiply operations,

TABLE IO

PERATION C OUNT

Operation Energy/Op IterationsMAAP 0.9 - 7.2 pJ 3904 · R Memory 41.8 fJ 85328 · B ADC 1 pJ 3904 · B Multiply 125.4 fJ 3088 · B Quantity of each operation needed to classifyan image. B and R are the number of bits usedfor representation and the MAAP operationredundancy, respectively. Fig. 6. (a) Classiﬁcation accuracy on MNIST handwritten digit imagesvs. the weight representation precision. This plot assumes non-idealdevices with 15 mV of transistor threshold deviation and R = 10 .(b) Classiﬁcation accuracy vs. MAAP set redundancy using non-idealdevices with weight representation precision as a parameter. We notethat at maximum redundancy the deleterious effect of process variationis negligible.Fig. 7. Classiﬁcation accuracy vs. transistor threshold deviation withredundancy R as a parameter. All devices incorporate MTJ variationat a constant level consistent with the deviations in [3]. Only transistorthreshold deviation severity is varied. We note that the accuracy appearsnot to vary with the particular amount of threshold deviation once anynotable level of device variation is introduced.Fig. 8. Accuracy vs. energy for 4 bit, 8 bit and 32 bit networks. Negligibleincremental beneﬁt is granted for moving from 4 to 8 bit, but at 32-bit theﬁnal accuracy is increased by about 5 %. GENERIC COLORIZED JOURNAL, VOL. XX, NO. XX, XXXX 2019 respectively. These quantities are given in Table I, with thevalues ultimately deriving from the dimensions of the MAAPset arrays shown in Fig. 2. The energy terms E M , E MEM , E A and E MUL are the costs of a single instance of each operation.The MAAP operation cost E M accounts for leakage across theMTJ stacks, as calculated in [3] assuming a TMR of 1.5 and x nm MTJs, as well as the static and dynamic dissipationof the OTAs [12]. E MEM is based on 16nm node transistordata for 6T SRAM. [13] E MUL is estimated with the sametransistor data, assuming B gates per B -bit multiplication.The ADC energy is taken from the Stanford ADC Survey, witha power to Nyquist sampling frequency ratio of 1 pJ. [14] Wenote that the majority of the total energy is taken up by thedigital processing (see Fig. 4). The energy usage depends uponthe level of precision used to encode the weights and store theMAAP outputs. The assumed level of process variation thatneeds to be corrected and the MAAP circuit runtime havesome effect as well. [3] Fig. 4 (a) compares the MAAP andmemory operational costs for 4-bit, 8-bit and 32-bit networks,assuming 0.6 ns runtime per MAAP set. The memory anddigital operations dominate the MAAP processing cost. Fig. 4(b) shows the rise in total energy required to correct the errorin the mean MAAP output due to process variation, assuming a4-bit network. The dashed line assumes no variation, while thedata points all correspond to a set level of MTJ variation withstandard deviations matching those reported in [3], [15]–[17].The standard deviation for the transistor threshold potentialsis varied with multiple values. Although the process variation-induced mean error is easily dealt with, the process variationintroduces a signiﬁcant deviation in the error as well. In orderto reduce this effect, redundant sampling is used. Each MAAPset output is measured R times with the same inputs and theﬁnal stored value is the average of the measurements. Thevariation in the expected mean of R redundant measurementsis less than the variation of a single measurement by a factorof √ R . Multiplying the number of MAAP operations by theredundancy factor R comes at the cost of additional energyfor MAAP sampling, and the need for at least R physicallydistinct MAAP sets as the different measurements must beperformed on different devices to ensure a unique randomsampling each time. Fig. 5 shows the increase in total energycaused by operation redundancy. B. Accuracy

In Fig. 6 we display the results pertaining to image classiﬁ-cation accuracy. The values shown are statistical estimates.Each simulation included 100 test images and the valuesshown in the ﬁgure are the mean result of ﬁve or moresimulations with error bars included. Fig. 6(a) indicates theaccuracy vs. B using non-ideal simulated devices with redun-dancy R = 10 to indicate the upper bound on performance.As the representation precision grows the accuracy increasesfrom an average of about 90% to 95%. Fig. 6(b) showsthe accuracy vs. R . With R = 1 , the high error deviationintroduced by process variation signiﬁcantly lowers accuracy;however, with a modest R = 5 the accuracy is almostentirely recovered. At R = 10 the results are indistinguishable from those of ideal devices. Surprisingly, the particular levelof transistor threshold deviation appears insigniﬁcant to theaccuracy as indicated in Fig. 7. The results do not signiﬁcantlydiffer between devices with 0 mV or 20 mV of thresholddeviation. The existence of any signiﬁcant process variation–in the MTJs, if not the transistors–is sufﬁcient to necessitatesome redundancy. However, increasing the level of variationhas little to no incremental effect, although we surmise thatextreme amounts beyond what was tested would cause furthernoticeable deterioration of the accuracy.Finally, Fig. 8 indicates the classiﬁcation accuracy vs. theenergy cost for three different bit precision levels. Using a4-bit network with R = 5 is sufﬁcient to reach nearly 90%accuracy at a cost of about 80 nJ per image. The incrementalcost of increasing accuracy to 95% is quite large, requiring32-bit precision and R = 10 with a cost of about 1000 nJ perimage. C. Discussion

This work demonstrates that spintronic circuits based on theincreasingly well-understood MTJ neuron model can be usedto effectively implement the complex functions involved inCNNs and accurately classify images. The MAAP architecturecondenses the many different CNN layers into fewer, and morecohesive sets of calculations. The modularity of the MAAP setmodel also detaches the physical circuitry from the number ofconceptual neurons involved in the CNN model. Dependingon R and N the energy required to process an image variesbetween about − nJ, comparable to the networksreviewed in [2]. At a maximum accuracy of between 90 -95% the performance is also comparable. We also note that,assuming sufﬁcient devices to compute an entire sequence oflayer operations in parallel, the fully charge-based network in[2] requires between 101.5 and 139 ns to calculate the outputof a single convolution-ReLU-pooling sequence. By reducingthe number of times the network must pass its intermediaryoutputs through the ADC to memory and back, we achieve asigniﬁcant speed-up. Each MAAP set operation takes on theorder of 1 ns, saving a great deal of time if we assume identicaldelays in the ADC and digital processing peripherals betweenthe two networks. Finally, comparing the number of MAAPoperations N M = 3904 to the number of CeNN operationsin an equivalent network in [2], [18] N C ∼ shows asigniﬁcant reduction in complexity, even with redundancy. Wealso note that condensing the neuromorphic layer operationsreduces the number of ADC and memory operations that arerequired to store and access intermediate data compared to asystem which explicitly computes each operation, especiallythose which compute the operations via multiple sub-stepswhich themselves may require digital processing. This shouldyield additional savings. We hope this achievement will helpspark continued interest and study of spintronic materials andcircuits to unlock their great potential. R EFERENCES [1] Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, “Gradient-BasedLearning Applied to Document Recognition,”

Proc. IEEE, vol. 86, no.11, pp. 2278–2324, Nov. 1998, DOI:10.1109/5.726791.

TEPHAN et al. : SHE-MTJ CIRCUITS FOR CONVOLUTIONAL NEURAL NETWORKS 5 [2] Q. Lou, C. pan, J. McGuinness, A. Horvath, A. Naeemi and X. S. Hu, “AMixed Signal Architecture for Convolutional Neural Networks,”

ACM J.Emerg. Technol. Comput. Syst., vol. 15, no. 2, pp. 19:1–26, Mar. 2019,DOI:10.1145/3304110.[3] A. W. Stephan and S. J. Koester, “Spin Hall MTJ Devices for AdvancedNeuromorphic Functions,”

IEEE Trans. Elec. Dev., vol. 67, no. 2, pp.487–492, Jan. 2020, DOI:10.1109/TED.2019.2959732.[4] W. Kang, Z. Wang, Y. Zhang, J.-O. Klein, W. Lv and W. Zhao,“Spintronic Logic Design Methodology Based on Spin Hall Effect-Driven Magnetic Tunnel Junctions”,

J. Phys. D: Appl. Phys., vol. 49,pp. 065008–1–11, Jan. 2016, DOI:10.1088/0022-3727/49/6/065008.[5] A. Sengupta and K. Roy, “Encoding Neural and Synaptic Function-alities in Electron Spin: A Pathway to Efﬁcient Neuromorphic Com-puting,”

Appl. Phys. Rev., vol. 4, no. 4, pp. 041105–1–25, Dec. 2017,DOI:10.1063/1.5012763.[6] D. Morris, D. Bromberg, J.-G. Zhu and L. Pileggi, “mLogic:Ultra-Low Voltage Non-Volatile Logic Circuits Using STT-MTJ Devices,”

Proceedings 49th DAC , pp. 486–491, Jun. 2012,DOI:10.1145/2228360.2228446.[7] C. Pan and A. Naeemi, “A Proposal for Energy-Efﬁcient Cellular NeuralNetwork Based on Spintronic Devices,”

IEEE Trans. Nanotech., vol. 15,pp. 820-827, Sept. 2016, DOI:10.1109/TNANO.2016.2598147.[8] C. Pan and A. Naeemi, “Non-Boolean Computing Benchmarking forBeyond-CMOS Devices Based on Cellular Neural Network,”

IEEE J.Expl. Sol.-Stat. Computat. Dev. and Circ., vol. 2 pp. 36-43, Nov. 2016,DOI:10.1109/JXCDC.2016.2633251.[9] H. B. M. Saidaoui and A. Manchon, “Spin-Swapping Transport andTorques in Ultrathin Magnetic Bilayers,”

Phys. Rev. Lett., vol. 117, no.3, pp. 036601–1–5, Jul. 2016, DOI:10.1103/PhysRevLett.117.036601.[10]

Predictive Technology Model.

Accessed:Jul. 25, 2017. [Online]. Avail-able: http://ptm.asu.edu[11] Q. Lou, C. Pan, J. McGuinness, A. Horvath, A. Naeemi, M. Niemierand X. S. Hu, “A Mixed Signal Architecture for Convolutional NeuralNetworks,”

ACM J. Emerg. Technol. Comput. Syst., vol. 15, no. 2, pp.19:1–26, Mar. 2019, DOI:10.1145/3304110.[12] Q. Lou, I. Palit, A. Horvath, X. S. Hu, M. Niemier and J. Nahas,“TFET-based Operational Transconductance Ampliﬁer Design for CNNSystems,” in

Proc. 25th Ed. Great Lakes Symp. VLSI, vols. 20–22, pp.277–272, May 2015, DOI:10.1145/2742060.2742089.[13] A. Pushkarna, S. Raghavan and H. Mahmoodi, “Comparison of Perfor-mance Parameters of SRAM Designs in 16nm CMOS and CNTFETTechnologies,” pp. 339-342, Sept. 2010,DOI:10.1109/SOCC.2010.5784690.[14] “Stanford ADC Survey,” [Online]. Available:web.stanford.edu/ ∼ murmann/publications/ADCsurvey rev20190802.xls.[Accessed 19 Dec 2019].[15] W. Kang, L. Zhang, J.-O. Klein, Y. Zhang, D. Ravelosona, W. Zhao,“Reconﬁgurable Codesign of STT-MRAM Under Process Variations inDeeply Scaled Technology,” IEEE Trans. Elec. Dev., vol. 62, no. 6, pp.1769–1777, Jun. 2015, DOI:10.1109/TED.2015.2412960.[16] P. Wang, E. Eken, Z, W. Zhang, R. Joshi, R. Kanj and Y. Chen, “AThermal and Process Variation Aware MTJ Switching Model and itsApplications in Soft Error Analysis,”

More than Moore Technologiesfor Next Generation Computer Design,

Chapter 5, pp. 101–125, SpringerNew York, 2015.[17] M. D. Giles, N. Arkali Radhakrishna, D. Becher, A. Kornfeld, K.Maurice, S. Mudanai, S Natarajan, P. Newman, P. Packan and T. Rakshit,“High Sigma Measurement of Random Threshold Voltage Variation in14nm Logic FinFET technology,”

Proc. of VLSI Technology, pp. T150–T151, Aug. 2015, DOI:10.1109/VLSIT.2015.7223657.[18] A. W. Stephan, Q. Lou, M. T. Niemier, X. S. Hu and S. J. Koester,“Nonvolatile Spintronic Memory Cells for Neural Networks,”