A Case for Lifetime Reliability-Aware Neuromorphic Computing
AA Case for Lifetime Reliability-AwareNeuromorphic Computing
Shihao Song and Anup Das
Electrical and Computer Engineering Drexel University, Philadelphia, PA, USAEmail: { shihao.song,anup.das } @drexel.edu Abstract —Neuromorphic computing with non-volatile memory(NVM) can significantly improve performance and lower energyconsumption of machine learning tasks implemented using spike-based computations and bio-inspired learning algorithms. Highvoltages required to operate certain NVMs such as phase-changememory (PCM) can accelerate aging in a neuron’s CMOS circuit,thereby reducing the lifetime of neuromorphic hardware. Inthis work, we evaluate the long-term, i.e., lifetime reliabilityimpact of executing state-of-the-art machine learning tasks ona neuromorphic hardware, considering failure models such asnegative bias temperature instability (NBTI) and time-dependentdielectric breakdown (TDDB). Based on such formulation, weshow the reliability-performance trade-off obtained due to peri-odic relaxation of neuromorphic circuits, i.e., a stop-and-go styleof neuromorphic computing.
Index Terms —Neuromorphic Computing, Non-Volatile Mem-ory (NVM), Phase-Change Memory (PCM), NBTI, TDDB
I. I
NTRODUCTION
Spiking neural network (SNN) [1] is a machine learningtechnique designed using spike-based computation and bio-inspired learning algorithms [2]. Neuromorphic hardware suchas DYNAP-SE [3], TrueNorth [4], and Loihi [5] can exe-cute SNN-based machine learning tasks in an energy-efficientmanner, thanks to low-power neuron circuits [6], distributedimplementation of computing and storage as crossbars [7], andthe integration of non-volatile memory (NVM) for synapticstorage [8], [9]. Several techniques are recently proposed tomap and execute SNNs on to neuromorphic hardware [10]–[15]. These techniques mostly target performance (e.g., accu-racy) and energy of neuromorphic computing. Unfortunately,neuromorphic hardware are prone to reliability issues suchas limited programming endurance, read disturbance of NVMcells, and aging of CMOS-based neuron circuits [16]–[18]. Inthis work, we focus on the circuit aging due to negative biastemperature instability (NBTI) and time-dependent dielectricbreakdown (TDDB) failure mechanisms [19]–[21].Due to the high voltage operating requirement of NVM,CMOS devices in a neuron circuit are exposed to high-voltageinduced stress when propagating excitation (i.e., current)through an NVM synapse. This impacts the long-term, i.e.,lifetime reliability of neuromorphic hardware. As memory pro-cess technology scales down to smaller dimensions, reliabilityissues are expected to exacerbate due to the following threereasons. First, the electric field and power density increase inscaled nodes, exceeding their corresponding maximum valuefor reliable operation. Second, increasing power density alsoleads to higher chip temperatures and consequently, an evenfaster acceleration of the degradation mechanisms. Third, newmaterials like high- k dielectrics and novel devices such asmulti-gate field-effect transistor (FET) that are commonlyused for the neuron circuit in neuromorphic hardware haveunknown reliability behavior and they introduce new failuremechanisms at scaled nodes. In our recent work [22], we haveanalyzed NBTI failure in neuromorphic computing. This work extends our earlier work in the following three directions. First,we consider other failure mechanisms such as TDDB and showthe impact of system-level design decisions on the circuit agingin neuromorphic hardware. Second, we consider aging in aneuron circuit, which drives current into a crossbar to readsynaptic weights stored in its NVM cells. Third, we showthe performance-reliability trade-off in periodic relaxation ofneuron excitations in neuromorphic hardware using state-of-the-art machine learning applications.II. M ODELING R ELIABILITY OF C ROSSBARS
A. NBTI Issues in Neuromorphic Computing
This is a failure mechanism of CMOS devices insidea neuron, when positive charges are trapped at the oxide-semiconductor boundary underneath the gate of a CMOS [23].NBTI manifests as 1) decrease in drain current and transcon-ductance, and 2) increase in off current and threshold voltage.The lifetime of a CMOS device is measured in terms of its mean time to failure ( MTTF ) as
MTTF
NBTI = AV γ e EaKT , where A and γ are material-related constants, E a is the activationenergy, K is the Boltzmann constant, T is the temperature,and V is the overdrive gate voltage of the CMOS device.Recent studies suggest that a portion of the threshold voltagecan be recovered by annealing at high temperatures if theNBTI stress voltage is removed. Figure 1 illustrates the stressand recovery of threshold voltage of a CMOS device due toNBTI failure mechanism on application of a high ( V read = 1.8V)and a low voltage ( V idle = 1.2V) to a CMOS device in a neuroncircuit. We observe that both stress and recovery depends onthe time of exposure to the corresponding voltage [24]. Fig. 1. Demonstration of degradation due to NBTI.
B. TDDB Issues in Neuromorphic Computing
This is a failure mechanism in a CMOS device, when thegate oxide breaks down as a result of long-time applicationof relatively low electric field (as opposed to immediatebreakdown, which is caused by strong electric field) [25]. TheTDDB lifetime of a CMOS device is
MTTF
TDDB = A.e − γ √ V ,where A and γ are material-related constants, and V is theoverdrive gate voltage of the CMOS device [26]. C. Circuit Aging in Neuromorphic Computing
To illustrate the degradation caused by these failure mech-anisms, we take the example of a single neuron of the LeNetconvolutional neural network (CNN) [27] used for handwrittendigit recognition and illustrate its spike times within the first a r X i v : . [ c s . N E ] J u l Time (ms) S p i k e s (a) Spike times of a neuron in LeNet CNN. Time (ms) . . . V o l t a g e ( v ) (b) Voltage of the neuron to process the spike train. Time (ms) − − N B T I A g i n g [ a . u . l og s c a l e ] (c) NBTI aging (in log scale) of the neuron. Time (ms) T DD B A g i n g [ a . u . ] (d) TDDB aging (in log scale) of the neuron.Fig. 2. (a) Spike times of a neuron in LeNet, (b) voltages needed to propagatethese spikes through its fanout synapses, (c) NBTI degradation (in arbitraryunits), and (d) TDDB degradation (in arbitrary units) of CMOS devices. To de-stress a neuron, all CMOS devices in the neuron mustbe programmed with a voltage lower than the threshold voltage V th , which forces them to operate in the sub-threshold region,relieving their stress. Once discharged, a neuron requiresseveral clock cycles to boost its voltage back to the requiredvoltage level, before it can safely be used to generate spikesagain. This introduces performance overhead.III. P ERIODIC R ELAXATION OF N EUROMORPHIC C IRCUITS
To improve the long-term, i.e., the lifetime reliability ofneuromorphic computing, we propose periodic relaxation of aneuromorphic architecture, where we de-stress all neurons inthe hardware at fixed intervals. To compute the overhead due tosuch de-stress operations, we assume that the controller issuesa de-stress command to a crossbar once every tDSI , whichis known as the de-stress interval . Each de-stress operationcompletes within a time interval tDSC , known as the de-stress cycle time . Hence, the performance overhead (i.e., spikethroughput loss) due to periodic de-stress is de-stress overhead = tDSC/tDSI. (1) Figure 3 shows an example where four spikes (S1, S2, S3,& S4) generated by a neuron. These spikes have some idletime between them. The neuron circuits are de-stressed afterevery tDSI , such that the aging due to NBTI and TDDB (indicated by A TDDB and A NBTI , respectively) are lower than1000 units. Using this approach, the de-stress operation isinitiated upon generating S3, which increases the latency ofS4 due to the non-zero latency of the de-stress operation(indicated by tDSC ). Increase in spike latency can lead toinformation loss in SNNs and degrade the quality of response.
R1 R2 R3 R4 spikesS1 S2 S3 S4busy idle de-stressperipheral circuit status tDSI tDSC
Fig. 3. Performance impact due to periodic relaxation.
We introduce two key performance metrics in SNNs thatare affected due to periodic de-stressing of neuromorphicarchitectures – inter-spike interval (ISI) and disorder spikecount. These are defined as follows. • Inter-spike interval distortion:
Performance of super-vised machine learning is measured in terms of accu-racy , which can be assessed from inter-spike intervals(ISIs) [28]. To define ISI, we let { t , t , · · · , t K } be aneuron’s firing times in the time interval [0 , T ] . Theaverage ISI of this spike train is given by [28]: I = K (cid:88) i =2 ( t i − t i − ) / ( K − . (2) • Disorder spike count:
This is defined for SNNs whereinformation is encoded in terms of spike rate. We for-mulate spike disorder as follows. Let F i = { F i , · · · , F in i } be the expected spike arrival rate at neuron i and ˆ F i = { ˆ F i , · · · , ˆ F in i } be the actual spike rate considering de-stresslatencies. The spike disorder is computed as spike disorder = n i (cid:88) j =1 [( F ij − ˆ F ij ) ] /n i (3) IV. E
VALUATION
We evaluate 10 standard machine learning applications,which are listed in Table I.
TABLE IA
PPLICATIONS USED TO EVALUATE OUR APPROACH [10].
Class Applications Synapses Neurons Topology Accuracy
MLP EdgeDet 272,628 1,372 FeedForward (4096, 1024, 1024, 1024) 100%ImgSmooth 136,314 980 FeedForward (4096, 1024) 100%MLP-MNIST 79,400 984 FeedForward (784, 100, 10) 95.5%CNN CNN-MNIST 159,553 5,576 CNN 96.7%LeNet-MNIST 1,029,286 4,634 CNN 99.1%LeNet-CIFAR 2,136,560 18,472 CNN 84.0%HeartClass [29], [30] 2,396,521 24,732 CNN 85.12%RNN HeartEstm [31] 636,578 6,952 Recurrent Reservoir 99.2%SpeechRecog 636,578 6,952 Recurrent Reservoir 96.8%VisualPursuit 636,578 6,952 Recurrent Reservoir 89.0%
A. Reliability
Figures 4a and 4b plot respectively, the NBTI and TDDBaging of the 10 machine learning applications when increasingthe tDSI from 10ms to 50ms. We make the following threekey observations. First, both NBTI and TDDB aging increaseswith increase in tDSI. This is because, a neuron accrues higheraging when its CMOS devices are kept active for longerduration (i.e., for higher tDSI). Second, the increase in agingis application dependent. For CNN-MNIST, increasing tDSIfrom 10ms to 50ms leads to 50% increase in NBTI aging,compared to VisualPursuit, where the NBTI aging increaseby 5x. This is because, the number of spikes generated inNN-MNIST is far fewer than in VisualPursuit, which leadsto lower aging in neuron circuits. Therefore, the impact ofincreasing tDSI for CNN-MNIST is less significant comparedto VisualPursuit. Third, compared to NBTI, the increase ofTDDB agings are consistent across different applications forthe same range of tDSI. This is due to the difference in thetwo mechanisms. NBTI-induced stress (e.g., V th shift) recoverspartially when the neuron is idle. On the other hand, a CMOSdevices encounters low-voltage TDDB stress even when idle. I m g S m oo t h S p eec h R ec og V i s u a l P u r s u i t C NN - M N I S T E d g e D e t H e a r t C l a ss H e a r t E s t m L e N e t - C I F A R L e N e t - M N I S T M L P - M N I S T AV E R AG E N B T I a g i n g tDSI = 10 20 30 40 50 (a) NBTI aging for 10 machine learning applications. I m g S m oo t h S p eec h R ec og V i s u a l P u r s u i t C NN - M N I S T E d g e D e t H e a r t C l a ss H e a r t E s t m L e N e t - C I F A R L e N e t - M N I S T M L P - M N I S T AV E R AG E T DD B a g i n g tDSI = 10 20 30 40 50 (b) TDDB aging for 10 machine learning applications.Fig. 4. (a) Normalized NBTI aging, and (b) Normalized TDDB aging fortDSI of 10ms, 20ms, 30ms, 40ms, and 50ms. B. Performance
Figures 5a and 5b plot respectively, the ISI distortionand disorder spike count (DSC) of the 10 machine learningapplications when increasing the tDSI from 10ms to 50ms.We observe that both ISI and DSC reduces with increase intDSI. This reduction is due to the reduction of the de-stressoverhead (Equation 1) with an increase in tDSI. I m g S m oo t h S p eec h R ec og V i s u a l P u r s u i t C NN - M N I S T E d g e D e t H e a r t C l a ss H e a r t E s t m L e N e t - C I F A R L e N e t - M N I S T M L P - M N I S T AV E R AG E I S I tDSI = 10 20 30 40 50 (a) ISI distortion for 10 machine learning applications. i I m g S m oo t h S p eec h R ec og V i s u a l P u r s u i t C NN - M N I S T E d g e D e t H e a r t C l a ss H e a r t E s t m L e N e t - C I F A R L e N e t - M N I S T M L P - M N I S T AV E R AG E D S C tDSI = 10 20 30 40 50 (b) Disorder spike count for 10 machine learning applications.Fig. 5. ISI, disorder for tDSI of 10ms, 20ms, 30ms, 40ms, and 50ms. C. Thermal Impact
The results of Sections IV-A are obtained at nominal tem-perature of 300K. Prior works such as [32] show the impactof temperature on reliability of conventional multiprocessorsystem. Figure 6 shows the increase of aging with temperature.Average circuit aging at 325K and 350K is higher than that at300K by an average of 7% and 26%, respectively. E d g e D e t I m g S m oo t h M L P - M N I S T C NN - M N I S T L e N e t - M N I S T L e N e t - C I F A R H e a r t C l a ss H e a r t E s t m S p eec h R ec og V i s u a l P u r s u i t AV E R AG E A v e r a g e a g i n g Fig. 6. Average circuit aging at 325K and 350K normalized to aging at 300K.
V. C
ONCLUSION
We evaluate circuit aging in the neurons of neuromorphic ar-chitectures considering NBTI and TDDB failure mechanisms.We then propose a simple approach to improve reliability byperiodically de-stressing its neurons. This introduces latency,which degrades key performance metrics such as inter-spikeinterval and disorder spike count, which correlates directlyto the performance of machine learning models. We evaluatereliability-performance trade-offs for 10 state-of-the-art ma-chine learning applications. We conclude that the proposedwork will enable intelligent reliability optimization strategiesin neuromorphic computing.A
CKNOWLEDGMENT
This work is supported by the National Science FoundationFaculty Early Career Development Award CCF-1942697 (CA-REER: Facilitating Dependable Neuromorphic Computing:Vision, Architecture, and Impact on Programmability).R
EFERENCES[1] W. Maass, “Networks of spiking neurons: the third generation of neuralnetwork models,”
Neural Networks , 1997.[2] Y. Dan et al. , “Spike timing-dependent plasticity of neural circuits,”
Neuron , 2004.[3] S. Moradi et al. , “A scalable multicore architecture with heterogeneousmemory structures for dynamic neuromorphic asynchronous processors(DYNAPs),”
TBCAS , 2017.[4] M. V. DeBole et al. , “TrueNorth: Accelerating from zero to 64 millionneurons in 10 years,”
Computer , 2019.[5] M. Davies et al. , “Loihi: A neuromorphic manycore processor with on-chip learning,”
IEEE Micro , 2018.[6] G. Indiveri, “A low-power adaptive integrate-and-fire neuron circuit,” in
ISCAS , 2003.[7] P. Merolla et al. , “A digital neurosynaptic core using embedded crossbarmemory with 45pJ per spike in 45nm,” in
CICC , 2011.[8] G. W. Burr et al. , “Neuromorphic computing using non-volatile mem-ory,”
Advances in Physics: X , 2017.[9] A. Mallik et al. , “Design-technology co-optimization for OxRRAM-based synaptic processing unit,” in
VLSIT , 2017.[10] S. Song et al. , “Compiling spiking neural networks to neuromorphichardware,” in
LCTES , 2020.[11] A. Balaji et al. , “PyCARL: A PyNN interface for hardware-softwareco-simulation of spiking neural network,” in
IJCNN , 2020.[12] A. Balaji et al. , “Mapping spiking neural networks to neuromorphichardware,”
TVLSI , 2019.[13] A. Das et al. , “Mapping of local and global synapses on spikingneuromorphic hardware,” in
DATE , 2018.[14] A. Das et al. , “Dataflow-based mapping of spiking neural networks onneuromorphic hardware,” in
GLSVLSI , 2018.[15] A. Balaji et al. , “Run-time mapping of spiking neural networks toneuromorphic hardware,”
JSPS , 2020.[16] P.-Y. Chen et al. , “Reliability perspective of resistive synaptic deviceson the neuromorphic system performance,” in
IRPS , 2018.[17] B. Gleixner et al. , “Reliability characterization of phase change mem-ory,” in
NVMTS , 2009.[18] A. Pirovano et al. , “Reliability study of phase-change nonvolatilememories,”
TDMR , 2004.[19] C. Hu, “Future CMOS scaling and reliability,”
Proc. of the IEEE , 1993.[20] A. Das et al. , “Aging-aware hardware-software task partitioning for reli-able reconfigurable multiprocessor systems,” in
Compilers, Architectureand Synthesis for Embedded Systems (CASES) , 2013, p. 1.[21] S. Song et al. , “Exploiting inter-and intra-memory asymmetries for datamapping in hybrid tiered-memories,” in
ISMM , 2020.[22] A. Balaji et al. , “A framework to explore workload-specific performanceand lifetime trade-offs in neuromorphic computing,”
CAL , 2019.[23] R. Gao et al. , “NBTI-generated defects in nanoscaled devices: fastcharacterization methodology and modeling,”
TED , 2017.[24] S. Song et al. , “Improving dependability of neuromorphic computingwith non-volatile memory,” in
EDCC , 2020.[25] P. Roussel et al. , “New methodology for modelling MOL TDDB copingwith variability,” in
IRPS , 2018.[26] A. Das et al. , “Communication and migration energy aware task map-ping for reliable multiprocessor systems,”
FGCS , 2014.[27] Y. LeCun et al. , “LeNet-5, convolutional neural networks,” 2015.[28] S. Gr¨un et al. , Analysis of parallel spike trains . Springer, 2010.[29] A. Balaji et al. , “Power-accuracy trade-offs for heartbeat classificationon neural networks hardware,”
JOLPE , 2018.30] A. Das et al. , “Heartbeat classification in wearables using multi-layerperceptron and time-freq joint distribution of ECG,” in
CHASE , 2019.[31] A. Das et al. , “Unsupervised heart-rate estimation in wearables withLiquid states and a probabilistic readout,”
Neural Networks , 2018.[32] A. Das et al. , “Reliability and energy-aware mapping and scheduling ofmultimedia applications on multiprocessor systems,”