[PDF] DNN-Life: An Energy-Efficient Aging Mitigation Framework for Improving the Lifetime of On-Chip Weight Memories in Deep Neural Network Hardware Architectures

Abstract

Negative Biased Temperature Instability (NBTI)-induced aging is one of the critical reliability threats in nano-scale devices. This paper makes the first attempt to study the NBTI aging in the on-chip weight memories of deep neural network (DNN) hardware accelerators, subjected to complex DNN workloads. We propose DNN-Life, a specialized aging analysis and mitigation framework for DNNs, which jointly exploits hardware- and software-level knowledge to improve the lifetime of a DNN weight memory with reduced energy overhead. At the software-level, we analyze the effects of different DNN quantization methods on the distribution of the bits of weight values. Based on the insights gained from this analysis, we propose a micro-architecture that employs low-cost memory-write (and read) transducers to achieve an optimal duty-cycle at run time in the weight memory cells, thereby balancing their aging. As a result, our DNN-Life framework enables efficient aging mitigation of weight memory of the given DNN hardware at minimal energy overhead during the inference process.

Full PDF

TTo appear at the th Design, Automation, and Test in Europe (DATE 2021)

DNN-Life: An Energy-Efﬁcient Aging Mitigation Frameworkfor Improving the Lifetime of On-Chip Weight Memories inDeep Neural Network Hardware Architectures

Muhammad Abdullah Hanif , Muhammad Shaﬁque Faculty of Informatics, Technische Universit¨at Wien (TU Wien), Vienna, Austria Division of Engineering, New York University Abu Dhabi (NYUAD), Abu Dhabi, United Arab Emirates [email protected], muhammad.shaﬁ[email protected]

Abstract —Negative Biased Temperature Instability (NBTI)-inducedaging is one of the critical reliability threats in nano-scale devices. Thispaper makes the ﬁrst attempt to study the NBTI aging in the on-chipweight memories of deep neural network (DNN) hardware accelerators,subjected to complex DNN workloads. We propose DNN-Life, a specializedaging analysis and mitigation framework for DNNs, which jointly exploitshardware- and software-level knowledge to improve the lifetime of aDNN weight memory with reduced energy overhead. At the software-level, we analyze the effects of different DNN quantization methods onthe distribution of the bits of weight values. Based on the insights gainedfrom this analysis, we propose a micro-architecture that employs low-costmemory-write (and read) transducers to achieve an optimal duty-cycle atrun time in the weight memory cells, thereby balancing their aging. Asa result, our DNN-Life framework enables efﬁcient aging mitigation ofweight memory of the given DNN hardware at minimal energy overheadduring the inference process.

I. I

NTRODUCTION

DNN accelerators have already become an essential part of variousmachine learning systems [1] [2]. DNNs usually require a largenumber of parameters to offer high accuracy, which comes at thecost of high memory requirements; see Fig. 1a. Dedicated memoryhierarchies are designed to tradeoff between the low-cost storageoffered by the off-chip DRAMs and the energy-/performance-efﬁcientaccess offered by the on-chip SRAMs [1]; see Fig. 1b for accessenergy. This has led to an increasing trend towards the use of largeron-chip memory in the state-of-the-art DNN accelerators [3] [4],with the recent wafer-scale chips having up to 18 GBs of on-chipmemory [5]. However, due to continuous technology scaling, the on-chip SRAM-based memories are becoming increasingly vulnerableto different reliability threats, for example, soft errors and aging [6][7] [8]. Studies have shown that even a single fault in weights ofcritical neurons can result in signiﬁcant degradation of application-level accuracy [9]. State-of-the-art works have focused on analyzingand mitigating the effects of faults in DNN accelerators w.r.t. DNNaccuracy [10].

However, to the best of our knowledge, no prior workshave analyzed and optimized the aging of the on-chip weight memoriesof DNN accelerators, especially when considering diverse dataﬂowsof different DNNs and the impact of different types of quantizationson the weight distributions.

Aging due to NBTI:

In PMOS transistors when a negative gate-to-source voltage is applied, it can break-down the Si-H bond at theoxide-interface, thereby causing a gradual increase in the thresholdvoltage ( V th ) over the device lifetime, which results in poor drivecurrent and a reduction in the noise margin [11] . To overcome this V th shift, the operating frequency of the device has to be reducedby more than 20% over its entire lifetime [12]. However, due tostrict performance and energy constraints (speciﬁcally for embeddedapplications), the V th shift cannot be addressed just by design-timedelay margins or adaptive operating frequency adjustments [13], asthis leads to a signiﬁcant loss in the system’s performance and energyefﬁciency. Therefore, in traditional computing systems, alternateopportunities have to be exploited to overcome this challenge [12].One such opportunity lies in the fact that the NBTI aging phenomenonis partially reversed by removing the stress. A similar phenomenon called PBTI happens in NMOS transistors, thoughNBTI has been considered relatively more serious compared to PBTI [6].

AlexNet GoogleNet VGG-16 ResNet-152 S i z e [ M B ] A cc u r a c y [ % ] Size Top-5 Accuracy Top-1 Accuracy

32 bit 32KBSRAM 32 bitDRAM A cc e ss E n e r gy [ p J ] ( l o g s c a l e ) (a) (b) Fig. 1: (a) Accuracy and size comparison of few of the state-of-the-art DNNs(b) Access energy comparison of SRAM with DRAM (data source: [1]). V DD BL BL WL P P N N N N The PMOS transistors are mainly affected by NBTI aging S N M deg r ada t i on a ft e r y ea r s i n m Percentage of time that the cell stores zero [%]

The NBTI effect is minimum here because the

NBTI stress will equally be distributed between the two PMOS transistors existing in the SRAM (b) (a)

Fig. 2: (a) A 6T-SRAM Cell; and (b) its SNM degradation after 7 years [15]

NBTI Aging of On-chip Memories:

On-chip memories aretypically built using 6T-SRAM cells to achieve high area and powerefﬁciency. A 6T-cell is composed of two inverters coupled with twoaccess transistors (see Fig. 2a). The inverters store complementaryvalues to store a single bit. Each inverter has a PMOS transistor and anNMOS transistor. Depending on whether the cell is storing ‘0’ or ‘1’,one of the PMOS transistors is always under stress, when the transistoris on. As aging of a cell is deﬁned by its most-aged transistor, thelowest aging is achieved when both the PMOS transistors receive on-average the same amount of stress over the entire lifetime of the device,i.e., the percentage of the entire lifetime for which the cell stores a‘1’ ( duty-cycle ) is 50%, as shown in Fig. 2b. Note that NBTI agingstrongly depends on average long-term stress and weakly on short-term statistics [14].

Therefore, the key challenge in aging mitigationof on-chip memories is to balance their duty-cycle over the entirelifetime without affecting system-level performance.

State-of-the-art techniques and their limitations:

Varioustechniques have been proposed at circuit-level and at architecture-level.At circuit-level, the structure of SRAM-cells is modiﬁed to reducethe aging rate [16] [13]. For example, Ricketts et al. [16] proposedan asymmetric SRAM structure for workloads having biased bitdistribution, but due to their high data dependence, they are applicableonly in speciﬁc scenarios. Recovery boosting through dedicatedrecovery accelerating circuit is another method for enhancing thelifetime of the SRAM cells [17], but it increases power/energyconsumption due to additional transistors per cell, and thereforecannot be used in energy-constrained large-sized memories [18]. Atarchitecture-level, periodic inversion of data is used to reduce the agingrate of on-chip caches [19]. However, it cannot guarantee optimalduty-cycle, speciﬁcally in cases where the same data is periodicallyreused, e.g., in DNN-based systems where the same set of parametersare reused for processing each input sample. Calimera et al. in [20]improved recovery of unutilized portions of memory, but at high area& energy cost of expensive online monitoring. The technique alsosuffers from serious performance degradation in dynamic workloadscenarios. Another set of techniques uses bit rotations to cater NBTIaging in registers [15], but they work only in cases where the overall1 a r X i v : . [ c s . A R ] J a n o appear at the th Design, Automation, and Test in Europe (DATE 2021)distribution of bits is relatively balanced. Moreover, they use barrelshifters that incur high area and power overheads. The work in [21]proposed a conﬁgurable micro-architecture for reducing aging rate ofvideo memories, but only works for streaming video applications.

In summary, the state-of-the-art techniques either incur highoverheads in terms of area and power/energy or rely on certainspeciﬁc workloads, but cannot be employed in DNN accelerators dueto the unique properties of DNN hardware and workloads, as we willillustrate later in this paper.

Additional Challenges from the Deep Learning Perspective:

Thedataﬂow (i.e., computation scheduling) for a given DNN on a speciﬁchardware is deﬁned as per the DNN architecture and the hardwareimplementation to achieve maximum energy-/performance-efﬁciency.Altering the dataﬂow to balance the duty-cycle in on-chip SRAMcells can result in signiﬁcant degradation of system-level efﬁciency.

Therefore, an aging mitigation technique that does not require anyalteration to the dataﬂow or the mapping of the data in on-chip SRAMis desired.

Our Novel Contributions:

Towards this, we propose DNN-Life,an aging analysis and mitigation framework for on-chip memories ofDNN hardware (see Fig. 3). Our framework employs two key features:1)

Aging Analysis [Section III]:

We analyze the impact of usingdifferent data representation formats and quantization methods forweights of a DNN on the probability distribution of weight-bits,as this can provide useful insights for designing an effective andlow-overhead aging mitigation technique.2)

Aging Mitigation [Section IV]:

We propose a scheme andsupporting micro-architecture for mitigating the NBTI-aging of6T-SRAM-based on-chip weight memory of DNN acceleratorswith minimal energy overhead. Noteworthy, our scheme does notrequire any alteration to the dataﬂow of DNN inference or on-chipdata mapping, and thereby maintains the energy and performancebeneﬁts of the system. The micro-architectural extensions for agingmitigation are integrated in the DNN accelerator before and afterthe on-chip weight memory in the form of aging-aware write andread transducers, as shown in Fig. 4. I n s i g h t s DNN Hardware Accelerator

Design-time Run-time 𝜇 -architecture Encoded

Weights Control Signals

Decoded

Weights

Conventional

Aging Mitigation

MethodsPre-trained DNNs and

Datasets

Aging

Analysis ( Section IV )DNN

Quantization and Data

Representation

Aging Mitigation

Scheme and

Micro-architecture(

Section III & VI )Integration of Aging Mitigation Modules in DNN Hardware

Architecture Aging Mitigation Modules Processing

Array

Weight

Memory

Control and

Address

Generation Unit

Fig. 3: Overview of the design-time and run-time steps involved in our DNN-Life framework. Our novel contributions are highlighted in colored boxes.

II. O

VERVIEW OF O UR DNN-L

IFE F RAMEWORK

In this work, we propose DNN-Life, a novel aging analysisand mitigation framework for weight memories of DNN hardwareaccelerators. It employs a low-cost data encoding scheme that accountsfor diverse DNN workloads to adapt over time to balance the duty-cycle in each on-chip weight memory cell to alleviate the NBTI-agingeffects. Towards this, the two key features of our framework are:1)

Analysis:

We analyze the probability distribution of weight-bitsof different pre-trained DNNs to ﬁnd key insights that help indeveloping a low-cost aging-mitigation scheme. To consider thevariations in the distribution across number representation formatsand the methods used to transform the weights to those formats,we consider different number representation formats and differentcommonly used conversion methods. The detailed analysis andinsights are presented in Section III.2)

Architecture:

Based on the gathered insights, we design adata encoding module and an aging controller. The encoder is responsible for encoding the weights before writing the valuesto the weight memory, and the aging controller is responsiblefor generating encoding information required to encode the datasuch that the duty-cycle is balanced. The encoding informationis then stored to be used by the corresponding decoder module.The data encoder is deployed inside the DNN hardware acceleratorright before the weight memory, and the corresponding decoder isinstalled after the memory, to decode the weights before passingthem for computations. The integration of the encoder and thedecoder modules in a DNN accelerator is illustrated in Fig. 4a.The details of the micro-architecture are presented in Section IV.

A. DNN Hardware Architecture

Our DNN hardware architecture is based on well-established DNNaccelerator models, such as [22] for dense DNNs. Our acceleratoris composed of an

Activation Buffer , a

Weight Buffer , a

ProcessingArray , and an

Accumulation Unit ; see Fig. 4a. Our proposed weight-memory aging mitigation modules integrated in the architecture arealso shown in the ﬁgure (see details in Section IV). The activationand weight buffers provide intermediate storage for the activations andweights, respectively, to reduce the costly off-chip memory accesses.The buffers provide data to the processing array for performing thecomputations. For this work, we assume a memory hierarchy similar toBit-Tactical [22], DaDianNao [3] and TPU [4], according to which:1) the activation buffer is large enough to store the activations ofa single layer of a DNN; 2) the activation memory can provide N number of activation values to the processing array at a time; and 3)the weight memory can provide f × N weights to the processing arraysimultaneously. The processing array (see Fig. 4b) is composed of f number of Processing Elements (PEs) that share the activations, andtherefore can perform N number of multiplications for f differentﬁlters at the same time. Each PE has an adder tree to computethe sum of the multiplications. The computed sum is passed to theaccumulation unit where it is added with the corresponding partialsums to generate the output activation value. Note, as the ﬁlters canbe signiﬁcantly large, the computation of each output activation cantake several cycles, depending on the ﬁlter size. B. Dataﬂow in the DNN Accelerator

To perform the computations of a DNN layer using the aboveaccelerator, the weights have to be partitioned into blocks that canbe accommodated in the on-chip memory. The goal of partitioning isto maximize the use of available PEs. The input/output feature mapsand the ﬁlters/neurons all are divided into so-called tiles , dependingon the available on-chip storage for the corresponding data type.Works like SmartShuttle [23] provide methods to ﬁnd an optimal tilingconﬁguration and computation scheduling policy for a layer of a DNNfor a given memory hierarchy.Fig. 5 illustrates the policy that we employ for partitioning theﬁlters of a CONV layer. Note, we support the well-established tilingtechnique so that we can demonstrate that our technique can beneﬁta wide-range of existing DNN hardware accelerators. The ﬁgure alsoillustrates the sequence in which the blocks are moved to the on-chipweight memory and the corresponding computations are scheduled.The ﬁlters are ﬁrst divided into sets, where each set contains f numberof ﬁlters. Note, f is mainly deﬁned based on the number of ﬁlters thatthe hardware accelerator can process in parallel. Afterwards, a chunkof data (grey boxes in Fig. 5) from a set is selected to be moved tothe on-chip memory. The selected chunk contains a block of data ofsize r × c × ch from the same location of each ﬁlter in the set. Thesequence in which the grey boxes are traversed in the ﬁlters deﬁnesrest of the dataﬂow. The used sequence is shown as steps in Fig. 5.III. A NALYSIS OF THE D ISTRIBUTION OF W EIGHT -B ITS FOR D IFFERENT

DNN S & THEIR I MPACT ON D UTY -C YCLE

Before presenting the design of the proposed aging mitigationmodules in Section IV, here we ﬁrst present an analysis whichhighlights the rationale behind the proposed design.2o appear at the th Design, Automation, and Test in Europe (DATE 2021) 𝑓𝑥𝑁 𝑓𝑥𝑁 𝑁 ... ... A A A N Off-Chip

Memory(DRAM) Weight BufferActivation

Buffer Accumulation Unit

Write Data

Encoder (WDE) Read Data

Decoder (RDD)

Control Unit

Processing Array PE PE PE f ... DNN Accelerator

Input Activations

Output ActivationsWeights

Encoded

Weights Encoded Weights

Decoded Weights

Partial Sums

Output

Activations

Aging Controller 𝑓 W W W N <1> ... ... W f > W f > W N < f > ... PE PE f ... ... ... ......... Processing

Array

Adder

Tree

Accumulation Unit

Reg. + ... ... Reg. + Output Activations (a) (b) I npu t A c t i v a t i o n s X X X X X X

Adder

Tree 𝑁 number of Weights of Filter 𝑁 number of Weights of Filter f Fig. 4: (a) Architecture of the baseline DNN accelerator. The highlighted boxes, i.e., Write Data Encoder (WDE), Read Data Decoder (RDD) and AgingController, are the proposed modules for mitigating NBTI aging of weight memory. (b) A detailed view of the processing array and the accumulation unit. ... ... f CH ch r c C R Step 1

Step 2

Step 4 ... f CH

C R

Filter Set 1 Filter Set N FS Fig. 5: Division of ﬁlters of a CONV layer of a DNN into smaller blocksthat can be accommodated in the on-chip weight memory. Different colorscorrespond to different sets of ﬁlters/blocks. The gray colored boxes deﬁneone block of r × c × ch × f size. The steps show the sequence in which theblocks are moved to the on-chip fabric for scheduling their computations. A. Analyzing the Distribution of Weight-Bits

For this analysis, we consider the AlexNet and the VGG-16 networks, trained on the ImageNet dataset. As different datarepresentations for weights, we consider 32-bit ﬂoating pointrepresentation (IEEE 754 standard) and 8-bit integer formatachieved using range-linear symmetric and asymmetric quantizationtechniques [24]. Fig. 6 illustrates the ratio of observing a ‘1’ to the totalnumber of observations (which corresponds to probability of observinga ‘1’) at each bit-location of a word for all three data representationformats for both the networks.

By analyzing the distributions, thefollowing key observations are made: The probability of getting a ‘1’ value at a particular bit-locationof a randomly selected weight depends on the network, the datarepresentation format, and the method used to transform the datato the particular data representation format.

For example, theprobability of getting a ‘1’ at a particular bit-location in symmetric8-bit representation is almost the same across bit-locations withina network for both the considered DNNs, however, it varies acrossnetworks. Similarly, the probability of getting a ‘1’ at the lowerbit-locations in 32-bit ﬂoating-point representation is around 0.5,however, the distribution of bits at higher bit-locations varies acrossbit-locations as well as across DNNs.2)

Representation of weights using a speciﬁc format cannot guaranteea distribution that offers 0.5 probability at each bit-location, i.e.,a distribution that can potentially lead to a balanced duty-cycle.

For example, out of all the studied cases, only the distribution ofthe AlexNet when represented using 8-bit integer format achievedusing symmetric range-linear quantization offers close to 0.5probability for all the bit-locations.3)

The average probability of getting a ‘1’ across bit-locations ina speciﬁc format is also not guaranteed to be equal to 0.5.

Forexample, see the distributions of 8-bit asymmetrically quantizedDNNs. Therefore, barrel shifter-based balancing techniques wouldnot produce desirable results in such cases.

31 29 27 25 23 21 19 17 15 13 11 9 7 5 3 1

Format

Format (Symmetric) (a) Probability of getting a ‘1’ at a specific bit position of a weight in AlexNet

ExponentSign

Mantissa

MSB

LSB P r o b a b ili t y P r o b a b ili t y P r o b a b ili t y Bit-location Bit-location Bit-location0

31 29 27 25 23 21 19 17 15 13 11 9 7 5 3 1 Format (Asymmetric)

Format

Format (Symmetric)MSB LSB P r o b a b ili t y P r o b a b ili t y P r o b a b ili t y Bit-location

Bit-location

Bit-location Format (Asymmetric) (b) Probability of getting a ‘1’ at a specific bit position of a weight in VGG-16

Fig. 6: Distribution of bits of weights of different different DNNs whenrepresented in different data representation formats. Symmetric and asymmetricrepresent which post-training quantization method is used to transform the datafor the corresponding distribution.

B. A Probabilistic Model-based Analysis for Aging of 6T-SRAMOn-chip Weight Memory of a DNN Accelerator

In the following, we develop a probabilistic model to analyze theeffectiveness of different aging mitigation techniques.

1) Probabilistic Model:

Assume the on-chip memory of a givenDNN accelerator is composed of I × J cells. For mapping the weightsof a DNN onto the memory, we assume: (a) the same dataﬂow aspresented in Fig. 5; (b) each block of weights is kept in the on-chipmemory for equal amount of time, and it is fetched only once duringa single inference (similar to the dataﬂow for the DNN acceleratorproposed in [22]); (c) each block of data mapped onto the on-chipmemory ﬁts perfectly to it. Based on the aforementioned conditionsand the given DNN size, we can divide the DNN into K blocks thattranslates to K number of data mappings onto the on-chip weightmemory. Now, if the same DNN is used repeatedly for inferencingwith the same dataﬂow, a single on-chip memory cell is mapped withonly K different bits. If the probability of getting a ‘1’ for all the bitsis given by ρ , the probability of getting a duty-cycle less than andequal to b/K , or greater than and equal to − b/K , can be computedusing the following equation, except when b/K = 0 . , where theprobability is 1. P b/K = b (cid:88) i =0 (cid:16) Ki (cid:17) ρ i × (1 − ρ ) K − i + K (cid:88) i = K − b (cid:16) Ki (cid:17) ρ i × (1 − ρ ) K − i (1) Here, b is an arbitrary variable with the range from to (cid:98) K/ (cid:99) .Note that we combine (i) the cases in which duty-cycle is less thanand equal to b/K and (ii) the cases in which duty-cycle is greaterthan and equal to − b/K , because in a symmetric 6T-SRAM cellboth the cases cause the same level of stress in one of the two PMOStransistors. Assuming the above computed probability to be the same3o appear at the th Design, Automation, and Test in Europe (DATE 2021) (a) P b/K > 0.1 for b/K = 0.3 With the increase in the value of K the probability starts decreasing (b) Fig. 7: Probability of occurrence of b/K ≥ duty-cycle ≥ − b/K when (a) K = 20 , and (b) K = 160 for all the cells of the on-chip memory, the probability of at least n number of cells (out of I × J ) experiencing duty-cycle less than andequal to b/K , or greater than and equal to − b/K can be computedusing the following equation. P n = I × J (cid:88) i = n (cid:16) I × Ji (cid:17) P ib × (1 − P b ) I × J − i (2)

2) An Example Case-Study:

Let us consider a scenario where K =20 and ρ = 0 . (i.e., the best-case with balanced bit distribution), and I × J = 8192 . Fig. 7a shows the probability for each possible valueof b computed using Eq. 1. Note, even for b/K = 0 . , the probabilityis over 0.1, i.e., more than 10% of the cells are expected to experiencea duty-cycle of less than 0.3, or greater than 0.7.Now, if we employ a given aging mitigation technique that offersupto 7 shifts to increase the number of different bits that are mappedto a single cell, we can theoretically increase the value of K to 160,assuming the bits to be independent from each other and the idealshifting policy. Putting K = 160 in the above mentioned example,Fig. 7b shows the probabilities for different b/K values. As canbe seen from Fig. 7b, the probabilities at lower b/K values havedropped signiﬁcantly. The above analysis implies that by signiﬁcantlyincreasing K and having ρ = 0 . , we can achieve close to idealduty-cycle for all the cells. Now, instead of a barrel shifter, if we employ an inversion-basedduty-cycle balancing technique where every other write to the samelocation is inverted, for the given scenario, the value of K remains thesame, as it is even. Moreover, as ρ is deﬁned to be 0.5, the inversion-based policy has no impact on ρ either. Therefore, we get the sameprobabilities as presented in Fig. 7a. However, note that the inversion-based policy is mainly useful for achieving ρ = 0 . in cases wherethe distribution of bits is biased either towards ‘0’ or ‘1’ . C. Challenges in Designing an Efﬁcient Aging Mitigation System

Based on the above analysis, we outline the following keychallenges in designing a generic aging mitigating system.1) The probability of occurrence of non-ideal duty-cycle isconsiderable even with the state-of-the-art ﬁxed aging mitigationtechniques. Therefore, a more robust method has to be designedby exploiting the fact that NBTI-aging is more dependent on theaverage duty-cycle over the lifetime of the device [14].2) The distribution of bits and the duty-cycle is signiﬁcantly affectedby the datatype used for representing the weights.

Therefore, themitigation technique should be generic and independent of thedatatype used so that it is beneﬁcial for various DNN accelerators.

Moreover, in practical scenarios, each layer of a DNN can have adifferent size. Therefore, each layer can take different amount oftime for processing that can vary signiﬁcantly across layers. Also,different DNNs can have different number of layers. Therefore, amethod that keeps track of all these factors at a ﬁne granularity canhelp in signiﬁcantly reducing the aging rates. However, such methodsare super costly.

This makes it very challenging to develop a genericmethod that offers effective aging mitigation at reasonable overheads.

IV. A M

ICRO - ARCHITECTURE FOR M ITIGATING A GING OF THE O N -C HIP W EIGHT M EMORY OF

DNN A

CCELERATORS

To address the above challenges, we propose a

Write Data Encoder (WDE) for encoding the weights before writing them to the on-chipweight memory, and a

Read Data Decoder (RDD) which performs the b b b b Inverter

Switch b b b n-1 ... b b ... b n-1 ... Write Data Encoder (WDE)

M-bit Register E M th bit + M-bits

New Data Block Signal

Aging Mitigation Controller ...... n True Random Bit Generator (TRBG) E Fig. 8: Proposed micro-architecture for effective aging mitigation of 6T-SRAMweight memory of DNN accelerators. inverse function of the WDE while reading the data from the on-chipmemory and before passing it to the processing array. The integrationof the proposed modules in the DNN accelerator is shown in Fig. 4a.Moreover, we propose an aging mitigation controller which generatesthe control signals (metadata) for the write (and read) transducer. Theproposed micro-architectures of the WDE and the aging mitigationcontroller is shown in Fig. 8.

Write Data Encoder (WDE):

It leverages the inversion logic thatbesides its low-overhead , also enables to perfectly balance out thedistribution of bits in the cells of the memory when the distribution isoriginally biased towards either ‘0’ or ‘1’, as highlighted in Section III.The inversion logic in the proposed micro-architecture is implementedusing XOR gates as they allow the aging mitigation controller toenable or disable it using just a ( E ) signal. Another keyadvantage of this design is that the micro-architecture of the RDD isthe same as WDE , where the same E signal (metadata) that is used toencode the weights is used (at a later point in time) for decoding thembefore passing them to the processing array. Moreover, the proposedWDE and RDD modules are highly scalable , as increasing the width ofthe modules require only a linear increase in the number of XOR gates.Therefore, the widths of these modules can be deﬁned directly basedon the DNN accelerator conﬁguration without affecting the energy-efﬁciency of the system. Aging Mitigation Controller:

The controller is the core part ofthe proposed micro-architecture, as it is responsible for generating theenable signal ( E ) that enables/disables the inversion logic in WDE.The design is based on the observations made in Section III that thehigher the number of different bits to be written on an SRAM cellduring its lifetime (i.e., K in Eq. 1) that are generated from a uniformdistribution the lower the chances of observing a deviation in its duty-cycle from 0.5 (see Figs. 7), i.e., the ideal point shown in Fig. 2b.Therefore, to increase the number of different bits to be written onan SRAM cell, we employ a True Random Bit Generator (TRBG)to generate the enable signal and decide whether the upcoming datashould be written with or without inversion in the memory cell. TRBGadds the sense of randomness in the bits to be written in the memoryand thereby leads to larger K value and lower aging.Note in practical scenarios, the output of TRBGs can be biasedtowards either ‘0’ or ‘1’, which can eventually affect the duty-cycle.Therefore, to mitigate this, we periodically invert the output of theTRBG after a deﬁned number of iterations with the help of an M -bitregister before using it as the enable signal, which balances the bias.V. R ESULTS AND D ISCUSSION

A. Experimental Setup

Fig. 10 illustrates the overall experimental setup used for evaluation.The setup consists of hardware synthesis for estimating the power, areaand delay characteristics of the proposed modules, and simulations foraging estimation of the 6T-SRAM on-chip weight memory of differentDNN hardware accelerators. For hardware synthesis, we implemented low overhead compared to other techniques such as shifting, which requirescostly barrel shifters (as shown later in Section V)

4o appear at the th Design, Automation, and Test in Europe (DATE 2021)

Without Aging

Mitigation Inverter-based

Barrel Shifter- based

DNN-Life when

Bias = 0.5

DNN-Life with

Bias Balancing when Bias = 0.732-bit Floating Point Format

Format (Symmetric

Quantization)

Format (Asymmetric

Quantization)

DNN-Life without

Bias Balancing when Bias = 0.7 Baseline DNN

Accelerator with the AlexNet a Fig. 9: SNM degradation of 6T-SRAM on-chip weight memory cells of the baseline DNN accelerator when used for performing inferences only using theAlexNet network. Each bar graph shows the percentage of the number of cells (Y-axis) experiencing different level of SNM degradation (X-axis).

Memory Simulator

Duty-Cycle Generation Test dataset Verilog

Files

Verilog

Files

Verilog Files

Cadence Genus

Logic SynthesisPower Estimation

Area, Delay,

Estimation

Verilog Files

Verilog FilesTechnology LibraryDevice-Level NBTI AgingAging ModelAging AnalysisSource Files

Scripts (.m)Pre-trained Deep

Neural Networks InputsDNN Hardware Configurations and Dataflow

Fig. 10: Overall experimental setup used for evaluation. different aging balancing circuits and our DNN-Life architecture inVerilog. The circuits are synthesized for the TSMC 65nm technologyusing Cadence Genus.For aging estimation, we use Static Noise Margin (SNM) to quantifythe NBTI-aging of 6T-SRAM cells, similar to [21] [25]. The SNMdeﬁnes the tolerance to noise that directly affects the read stability ofa cell [26], i.e., if the SNM of a cell is low, the cell is highly susceptibleto read failures. As per [15] [21] [25], SNM mainly depends on theduty-cycle over the entire lifetime of the cell, and the least SNMdegradation is achieved at 50% duty-cycle. To obtain SNM results,we employ a similar device aging model as used in state-of-the-artstudies like [21] [25]. However, due to its duty-cycle optimizationfocus, our proposed technique is orthogonal to the given device agingmodels, and other device-level models can easily be integrated in ourframework. Based on the models, the SNM degradation of a 6T-SRAMcell can be computed using the duty-cycle. From the analysis, the bestSNM degradation for 6T-SRAM cell after 7 years is 10.82% (at 50%duty-cycle), and the worst is 26.12% (at 0% and 100% duty-cycle).For large-scale simulations, we integrated the output of these modelsinto a memory simulator of the baseline DNN hardware (described inSection II-A). The simulator takes the DNN hardware conﬁguration,dataﬂow, pre-trained DNN architecture and test samples as inputs. Wealso built a memory simulator for a TPU-like hardware architecture [4]to validate the proposed aging-mitigation technique across DNNhardware accelerators. The hardware conﬁgurations used for theevaluation are presented in Table I. The DNNs used are the AlexNetand the VGG-16 with the ImageNet dataset and a custom networkwith MNIST dataset. The custom network is composed of two CONVlayers and two FC layers, i.e., CONV(16,1,5,5), CONV(50,16,5,5),FC(256,800) and FC(10,256). For each setting the duty-cycles areestimated based on the values observed in 100 inferences. The biasbalancing register is deﬁned to be a 4-bit register (i.e., M=4), for allthe corresponding cases.

B. Aging Estimation Results and Comparisons

In this subsection, we analyze the impact of using different agingmitigation policies on the SNM degradation of the 6T-SRAM on-chipweight memory cells after 7 years. We mainly considered four differentpolicies: (1) No aging mitigation, (2) Inversion-based, (3) Barrelshifter-based, and (4) DNN-Life. For the proposed DNN-Life, weconsider three different cases: (i) TRBG is not biased and it generates0s and 1s with equal probability (referred in the results as

Bias=0.5 );(ii) TRBG is biased and it generates 1s with 0.7 probability, and theaging controller does not have a bias balancing register (referred in theresults as without bias balancing with Bias=0.7 ); and (iii) TRBG isbiased and it generates 1s with 0.7 probability and the aging controllerhas a 4-bit bias balancing register (referred in the results as with biasbalancing with Bias=0.7 ).Moreover, we performed experiments considering three differentdata representation formats for weights: (1) 32-bit ﬂoating pointformat; (2) 8-bit integer format when weights are quantized usingsymmetric quantization method; and (3) 8-bit integer format whenweights are quantized using asymmetric quantization method.Fig. 9 shows the distributions of SNM degradation in the memorycells obtained using different aging mitigation policies and a pre-trained AlexNet model. The Y-axis of each bar graph shows thepercentage of the number of cells and the X-axis of each showsSNM degradation levels. Note that, for these experiments, we assumedthe baseline DNN accelerator conﬁguration presented in Table I andthe dataﬂow shown in Fig. 5 with f = 8 . Also, we assumed thatonly a single DNN (i.e., the AlexNet) is used for data inferencethroughout the lifetime of the device. As can be seen in the ﬁgure,the inversion-based and barrel shifter-based aging balancing reducethe SNM degradation of the SRAM cells, however, they do not offerminimum SNM degradation (see 2 and 3 in comparison with 1in Fig. 9). This behavior is observed to be consistent across all thedata representation formats (see 2 till 7 in comparison with theirrespective without aging mitigation graphs in Fig. 9). Speciﬁcally, theinversion-based aging balancing offers sub-optimal aging mitigationin case of the 32-bit ﬂoating point format (see 2 in Fig. 9), wheremost of the cells experience around 10.8% SNM degradation (see a TABLE I: Hardware conﬁgurations and settings used in evaluation

Baseline Accelerator (Section II-A) TPU-like NPU [4]Weightmemory size 512KB 256KBActivationmemory size 4MB 24MBPE array size 8 PEs (1 PE = 8 Multipliers) 256 x 256 PEs (1 PE = 1 MAC)Networks AlexNet AlexNet, VGG-16 and Custom

5o appear at the th Design, Automation, and Test in Europe (DATE 2021)

Without Aging Mitigation Inverter-based Barrel Shifter-based

DNN-Life with

Bias Balancing when Bias = 0.7VGG-16 for ImageNetDataset

Custom

Network for MNIST Dataset

NPU with

Different DNNs

Individually

AlexNet for

ImageNet

Dataset Fig. 11: SNM degradation of 6T-SRAM on-chip weight memory cells of aTPU-like NPU when used for performing inferences using the AlexNet, theVGG-16 and the custom DNN, individually. The networks are quantized to8-bit format using symmetric range-linear quantization method. in Fig. 9). However, this is not the ideal scenario as there are 4% cellsthat experience highest level of SNM degradation (see b in Fig. 9)and a few that experience moderate level of SNM degradation (see cin Fig. 9). Now, if we analyze the results of the proposed

DNN-Lifewith bias balancing , it offers maximum aging-mitigation (i.e., all thecells experience around 10.8% SNM degradation) in all the cases (see8 , 9 and 10 in Fig. 9).

Impact of biased TRBG on aging balancing of 6T-SRAM on-chip weight memory:

Fig. 9 also illustrates the impact of usingproposed design without bias correction when the duty-cycle of TRBGis 0.7. As can be seen in the ﬁgure, for all the data representationformats, having biased TRBG and no bias correction leads to lessreduction in SNM degradation of the 6T-SRAM cells (e.g., see 11in comparison with 8 in Fig. 9). This behavior is consistent acrossall the data representation formats.

Impact across different hardware accelerators:

Fig. 11 showsthe impact of using the proposed aging-mitigation technique for aTPU-like [4] Neural Processing Unit (NPU) architecture that has anon-chip weight FIFO which is four tiles deep, where one tile isequivalent to weights for × PEs. Each PE has a single MACunit that can perform 8-bit multiplication. For our implementation,we assumed the weight FIFO to be a circular buffer-based design.We performed analysis using the three different networks mentionedearlier. All the DNNs are quantized to 8-bits using post-trainingsymmetric quantization. Considering the dataﬂow of the NPU, theparameter f was set to 256. As can be seen in Fig. 11, the inversion-based aging mitigation policy offers optimal results for the AlexNetand the VGG-16 networks (see 1 and 2 in Fig. 11). However, whenused for the custom DNN, almost all the memory cells experiencesigniﬁcant SNM degradation (see 3 in Fig. 11). The barrel shifter-based approach also offer sub-optimal results (see 4 till 6 inFig. 11). However, the proposed DNN-Life with bias balancing offersmaximum aging mitigation (see 7 till 9 in Fig. 11). This showsthat DNN-Life can be used for a wide range of DNN accelerators. C. Area and Power Results

The area, power and delay characteristics of three different WDEscomposed of different aging balancing units are shown in Table II. Allthree WDEs are designed for 64 bit-width. The barrel shifter-basedWDE consumes the most amount of area and power. The proposeddesign consumes slightly more power and area as compared to theinversion-based WDE. However, as shown in the previous subsection,it offers best aging-mitigation in all the possible scenarios regardlessof the size of the given DNN, the data representation format and theon-chip weight memory size. Note that, at hardware level, we realizedTRBG using a 5-stage ring oscillator.

TABLE II: Hardware results of different Write Data Encoders (WDEs)

Delay [ps] Power [nW] Area [cell area]Barrel Shifter based WDE 977.7 345190 9035Inversion based WDE 811.6 10716 195Proposed WDE with AgingMitigation Controller 581.8 13747 295

VI. C

ONCLUSION

In this paper, we proposed DNN-Life, an aging-mitigationframework that employs read and write transducers to reduce NBTI-induced aging of 6T-SRAM on-chip weight memory in DNN hardwareaccelerators. We analyzed different DNN data representation formatsat the software-level and their potential for balancing the duty-cycle in SRAM cells. Based on the analysis, we proposed a micro-architecture that makes use of a True Random Bit Generator (TRBG)to ensure optimal duty-cycle at runtime, thereby balancing the agingof complimentary parts in 6T-SRAM cells of the weight memory. Asa result, our DNN-Life enables efﬁcient aging mitigation of weightmemory of a given DNN hardware with minimal energy overhead.A

CKNOWLEDGMENT

This work is partially supported by Intel Corporation through Giftfunding for the project ”Cost-Effective Dependability for Deep NeuralNetworks and Spiking Neural Networks.”R

EFERENCES[1] V. Sze et al., “Efﬁcient processing of deep neural networks: A tutorial and survey,”

Proceedings of IEEE , vol. 105, no. 12, pp. 2295–2329, 2017.[2] M. Capra et al., “Hardware and software optimizations for accelerating deep neuralnetworks: Survey of current trends, challenges, and the road ahead,”

IEEE Access ,2020.[3] Y. Chen et al., “Dadiannao: A machine-learning supercomputer,” in

IEEE/ACM

MICRO

Symposium , 2014, pp. 609–622.[4] N. P. Jouppi et al., “In-datacenter performance analysis of a tensor processing unit,”in

ACM/IEEE

ISCA , 2017, pp. 1–12.[5] P. McLellan. (2019) Hot chips: The biggest chip in the world. Accessed:2019-09-10. [Online]. Available: https://community.cadence.com/cadence blogs8/b/breakfast-bytes/posts/the-biggest-chip-in-the-world[6] J. Henkel et al., “Reliable on-chip systems in the nano-era: Lessons learnt and futuretrends,” in

ACM/ESDA/IEEE

DAC , 2013, p. 99.[7] M. Shaﬁque et al., “Robust machine learning systems: Challenges, current trends,perspectives, and the road ahead,”

IEEE Design & Test , vol. 37, no. 2, pp. 30–57,2020.[8] J. Henkel et al., “Thermal management for dependable on-chip systems,” in

IEEE

ASP-DAC , 2013, pp. 113–118.[9] M. A. Hanif et al., “Robust machine learning systems: Reliability and security fordeep neural networks,” in

IEEE IOLTS , 2018, pp. 257–260.[10] S. Kim et al., “Matic: Learning around errors for efﬁcient low-voltage neuralnetwork accelerators,” in

IEEE

DATE , 2018, pp. 1–6.[11] K. Kang et al., “Nbti induced performance degradation in logic and memorycircuits: How effectively can we approach a reliability solution?” in

IEEE

ASP-DAC , 2008, pp. 726–731.[12] D. Gnad et al., “Hayat: Harnessing dark silicon and variability for agingdeceleration and balancing,” in , 2015.[13] J. Shin et al., “A proactive wearout recovery approach for exploitingmicroarchitectural redundancy to extend cache sram lifetime,” in

ACM/IEEESIGARCH Computer Arch. News , vol. 36, no. 3, 2008, pp. 353–362.[14] J. Abella et al., “Penelope: The nbti-aware processor,” in

IEEE/ACM

MICRO

Symposium . IEEE Computer Society, 2007, pp. 85–96.[15] S. Kothawade et al., “Analysis and mitigation of nbti aging in register ﬁle: An end-to-end approach,” in

IEEE

ISQED , 2011, pp. 1–7.[16] A. Ricketts et al., “Investigating the impact of nbti on different power saving cachestrategies,” in

IEEE

DATE , 2010, pp. 592–597.[17] T. Siddiqua et al., “Enhancing nbti recovery in sram arrays through recoveryboosting,”

IEEE

TVLSI , vol. 20, no. 4, pp. 616–629, 2011.[18] B. Zatt et al., “A low-power memory architecture with application-aware powermanagement for motion disparity estimation in multiview video coding,” in

IEEE/ACM

ICCAD , 2011, pp. 40–47.[19] T. Jin et al., “Aging-aware instruction cache design by duty cycle balancing,” in

IEEE

IVLSI , 2012, pp. 195–200.[20] A. Calimera et al., “Partitioned cache architectures for reduced nbti-induced aging,”in

IEEE

DATE , 2011, pp. 1–6.[21] M. Shaﬁque et al., “Enaam: Energy-efﬁcient anti-aging for on-chip videomemories,” in

ACM/IEEE

DAC , 2015, pp. 101:1–101:6.[22] A. Delmas et al., “Bit-tactical: Exploiting ineffectual computations in convolutionalneural networks: Which, why, and how,” preprint arXiv:1803.03688 , 2018.[23] J. Li et al., “Smartshuttle: Optimizing off-chip memory accesses for deep learningaccelerators,” in

IEEE

DATE , 2018, pp. 343–348.[24] D. Lin et al., “Fixed point quantization of deep convolutional networks,” in

ICML ,2016, pp. 2849–2858.[25] M. Shaﬁque et al., “Content-aware low-power conﬁgurable aging mitigation forsram memories,”

IEEE Transactions on Computers , vol. 65, no. 12, pp. 3617–3630,2016.[26] K. Agarwal et al., “Statistical analysis of sram cell stability,” in

ACM/IEEE

DAC ,2006, pp. 57–62.,2006, pp. 57–62.